Using a monkey-patched XML parser to convert journal articles to ePub format

Question

I am working on a project that involves XML parsing, and for the job I am using xml.dom.minidom. During development I identified several patterns of processing that I refactored into discrete methods. The code shown in the snippet below shows my definition of an Article class that is instantiated during initial processing and is later passed to other classes for interpretation and output.

Because I wanted to consolidate my refactored methods into a single definition — and because I felt they were best thought of as extended minidom methods, I removed them from the Interpretation/Output classes and then monkey-patched them into the minidom module so that they would be available to any class operating on the Article's document.

# -*- coding: utf-8 -*-
import openaccess_epub.utils.element_methods as element_methods
import openaccess_epub.utils as utils
from openaccess_epub.jpts.jptsmetadata import JPTSMetaData20, JPTSMetaData23, JPTSMetaData30
import os.path
import sys
import shutil
import xml.dom.minidom as minidom
import logging

log = logging.getLogger('Article')

#Monkey patching in some extended methods for xml.dom.minidom classes
minidom.Node.removeSelf = element_methods.removeSelf
minidom.Node.replaceSelfWith = element_methods.replaceSelfWith
minidom.Node.elevateNode = element_methods.elevateNode
minidom.Element.getChildrenByTagName = element_methods.getChildrenByTagName
minidom.Element.removeAllAttributes = element_methods.removeAllAttributes
minidom.Element.getAllAttributes = element_methods.getAllAttributes
minidom.Element.getOptionalChild = element_methods.getOptionalChild


class Article(object):
    """
    A journal article; the top-level element (document element) of the
    Journal Publishing DTD, which contains all the metadata and content for
    the article.
    3.0 Tagset:
    http://dtd.nlm.nih.gov/publishing/tag-library/3.0/n-3q20.html
    2.0 Tagset:
    http://dtd.nlm.nih.gov/publishing/tag-library/2.0/n-9kc0.html
    2.3 Tagset:
    http://dtd.nlm.nih.gov/publishing/tag-library/2.3/n-zxc2.html
    """
    def __init__(self, xml_file):
        """
        The __init__() method has to do the following specific jobs. It must
        parse the article using xml.dom.minidom. It must check the parsed
        article to detect its DTD and version; it must also detect the
        publisher using self.identify_publisher(). It is responsible for
        using this information to create an instance of a metadata class
        such as found in jptsmeta.py to serve as the article's metadata
        attribute.
        """
        log.info('Parsing file - {0}'.format(xml_file))
        doc = minidom.parse(xml_file)
        #Here we check the doctype for the DTD under which the article was
        #published. This affects how we will parse metadata and content.
        dtds = {'-//NLM//DTD Journal Publishing DTD v2.0 20040830//EN':
                '2.0',
                '-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN':
                '2.3',
                '-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN':
                '3.0'}
        try:
            self.dtd = dtds[doc.doctype.publicId]
            dtdStatus = 'Article published with Journal Publishing DTD v{0}'
            log.debug(dtdStatus.format(self.dtd))
        except KeyError:
            print('The article\'s DOCTYPE declares an unsupported Journal \
Publishing DTD: \n{0}'.format(doc.doctype.publicId))
            sys.exit()
        #Access the root tag of the document name
        self.root_tag = doc.documentElement
        #Determine the publisher
        self.publisher = self.identify_publisher()
        log.info('Publisher - {0}'.format(self.publisher))
        #Create instance of article metadata
        if self.dtd == '2.0':
            self.metadata = JPTSMetaData20(doc, self.publisher)
        elif self.dtd == '2.3':
            self.metadata = JPTSMetaData23(doc, self.publisher)
        elif self.dtd == '3.0':
            self.metadata = JPTSMetaData30(doc, self.publisher)
        #The <article> tag has a handful of potential attributes, we can check
        #to make sure the mandated ones are valid
        self.attrs = {'article-type': None, 'dtd-version': None,
                      'xml:lang': None, 'xmlns:mml': None,
                      'xmlns:xlink': None, 'xmlns:xsi': None}
        for attr in self.attrs:
            #getAttribute() returns an empty string if the attribute DNE
            self.attrs[attr] = self.root_tag.getAttribute(attr)
        self.validate_attributes()  # Log errors for invalid attribute values
        try:
            self.body = self.root_tag.getElementsByTagName('body')[0]
        except IndexError:
            self.body = None

    def identify_publisher(self):
        """
        This method determines the publisher of the document based on an
        an internal declaration. For both JP-DTDv2.0 and JP-DTDv2.3, there are
        two important signifiers of publisher, <publisher> under <journal-meta>
        and <article-id pub-id-type="doi"> under <article-meta>.
        """
        log.info('Determining Publisher')
        pubs = {'Frontiers Research Foundation': 'Frontiers',
                'Public Library of Science': 'PLoS'}
        dois = {'10.3389': 'Frontiers',
                '10.1371': 'PLoS'}
        if self.dtd in ['2.0', '2.3']:
            #The publisher node will be the primary mode of identification
            publisher = self.root_tag.getElementsByTagName('publisher')
            pname = False
            if publisher:
                log.debug('Located publisher element')
                pname = publisher[0].getElementsByTagName('publisher-name')[0]
                pname = pname.firstChild.data
                try:
                    return pubs[pname]
                except KeyError:
                    log.debug('Strange publisher name - {0}'.format(pname))
                    log.debug('Falling back to article-id DOI')
                    pname = False
            if not pname:  # If pname is undeclared, check article-id
                art_IDs = self.root_tag.getElementsByTagName('article-id')
                for aid in art_IDs:
                    if aid.getAttribute('pub-id-type') == 'doi':
                        idstring = aid.firstChild.data
                        pub_doi = idstring.split('/')[0]
                try:
                    return dois[pub_doi]
                except KeyError:
                    print('Unable to identify publisher by DOI, aborting!')
                    sys.exit()

    def validate_attributes(self):
        """
        Most of the time, attributes are not required nor do they have fixed
        values. But in this case, there are some mandatory requirements.
        """
        #I would love to check xml:lang against RFC 4646:
        # http://www.ietf.org/rfc/rfc4646.txt
        #I don't know a good tool for it though, so it gets a pass for now.
        mandates = [('xmlns:mml', 'http://www.w3.org/1998/Math/MathML'),
                    ('xmlns:xlink', 'http://www.w3.org/1999/xlink'),
                    ('xmlns:xsi', 'http://www.w3.org/2001/XMLSchema-instance')]
        attr_err = 'Article attribute {0} has improper value: {1}'
        for key, val in mandates:
            if self.attrs[key] and not self.attrs[key] == val:
                log.error(attr_err.format(key, self.attrs[key]))
        if self.attrs['article-type'] not in utils.suggested_article_types:
            art_type_err = 'article-type value is not a suggested value - {0}'
            log.warning(art_type_err.format(self.attrs['article-type']))

    def get_DOI(self):
        """
        A method for returning the DOI identifier of an article
        """
        return self.metadata.article_id['doi']

A lot of the code in this file is old and definitely should be revised. Though I welcome comment on anything, my primary question is specifically about whether or not monkey-patching is a good (or acceptable) solution. If not, what would be better instead? If it is okay, how could I improve my use?

When I recently made the choice to use the monkey-patching solution, part of my justification was that I was not changing any of the native functions of the xml.dom.minidom module; all of these are new methods that do not change expected behavior that might cause confusion. Additionally this allowed me to use methods that fit minidom's "style" which I hoped would emphasize the analogous behavior through analogous idioms.

For reference, the full codebase is on GitHub, and the element methods are here:

Maybe you should use composition? OK, as a rule, I despise monkey patching, but composition is a good way to "hide" what is wrong with older/crappy implementations. — fge
– fge, Commented Jun 1, 2013 at 0:12

Sean Perry · Accepted Answer · 2014-01-25 04:33:57Z

3

The problem here is the minidom API is a well known API. Someone new to the code needs to know that you monkey patched it and why you did it. Otherwise they would be scouring the minidom docs looking for your methods. This is generally why monkey patching is a bad idea because it can be confusing to the next reader. Especially when the next reader is a person less experienced with the API or programming language.

As @fge suggests, using composition in some way would be preferable here.

edited Jan 25, 2014 at 4:33

answered Jan 24, 2014 at 23:40

Sean Perry

1,2099 silver badges15 bronze badges

Add a comment |

Billal BEGUERADJ · Accepted Answer · 2018-04-25 07:11:00Z

I do not find the idea of parsing the file within the class initializer a good one. That place is for for initializing -instance- variables as the name suggests. I suggest you to delegate the processing (parsing) in separate functions. This is a note about the general design of your class Article

To instantiate the class, we need to pass an XML file as an argument. But within __init__() you make assumption it is a valid XML file. We can not trust an input. That is a security rule. However, in your case, the minimu you can do is to check for the validity of the input: what if the XML file is not well formed? What if the wrong path was given and simply does not exist? To remedy to these problems, you can improve your code concerning this point as follows:

First, you need to import xml.parsers.expat with which you can raise an ExpatError exception in case your XML file is malformed. So in the import sections of your code, add this line:

from xml.parsers.expat import ExpatError

And wrap doc = minidom.parse(xml_file) like this:

try:
   self.doc = minidom.parse(xml_file)
except ExpatError as err:
   print('Malformed XML file.')
   print('Exception: {}'.format(err.code))
   print('Line: {}'.format(err.lineno))
   print('Offset: {}.format(err.offset)')
   raise err
except IOError as ioerr:
   print('Invalide path, can not read the file.')
   print(ioerr.strerror, ioerr.errno)
   raise ioerr

The second except above checks if the file exist within the provided path. Note also that you do not need to import a specific module to make use of IOError as it is a built-in exception.

Stack Exchange Network

Using a monkey-patched XML parser to convert journal articles to ePub format

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Using a monkey-patched XML parser to convert journal articles to ePub format

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions