6

I'm using SimpleXML to load in some xml files (which I didn't write/provide and can't really change the format of).

Occasionally (eg one or two files out of every 50 or so) they don't escape any special characters (mostly &, but sometimes other random invalid things too). This creates and issue because SimpleXML with php just fails, and I don't really know of any good way to handle parsing invalid XML.

My first idea was to preprocess the XML as a string and put ALL fields in as CDATA so it would work, but for some ungodly reason the XML I need to process puts all of its data in the attribute fields. Thus I can't use the CDATA idea. An example of the XML being:

 <Author v="By Someone & Someone" />

Whats the best way to process this to replace all the invalid characters from the XML before I load it in with SimpleXML?

2
  • if it’s just the &, can’t you escape it prior to loading?
    – Dormilich
    Commented May 22, 2010 at 23:36
  • Its more than just the & that's invalid.
    – Paul
    Commented May 22, 2010 at 23:58

3 Answers 3

7

What you need is something that will use libxml's internal errors to locate invalid characters and escape them accordingly. Here's a mockup of how I'd write it. Take a look at the result of libxml_get_errors() for error info.

function load_invalid_xml($xml)
{
    $use_internal_errors = libxml_use_internal_errors(true);
    libxml_clear_errors(true);

    $sxe = simplexml_load_string($xml);

    if ($sxe)
    {
        return $sxe;
    }

    $fixed_xml = '';
    $last_pos  = 0;

    foreach (libxml_get_errors() as $error)
    {
        // $pos is the position of the faulty character,
        // you have to compute it yourself
        $pos = compute_position($error->line, $error->column);
        $fixed_xml .= substr($xml, $last_pos, $pos - $last_pos) . htmlspecialchars($xml[$pos]);
        $last_pos = $pos + 1;
    }
    $fixed_xml .= substr($xml, $last_pos);

    libxml_use_internal_errors($use_internal_errors);

    return simplexml_load_string($fixed_xml);
}
1
  • 2
    Posting an example of computer position would be handy! Commented Aug 1, 2012 at 10:24
2

i think workaroung for creating compute_position function will be make xml string flat before processing. Rewrite code posted by Josh:

function load_invalid_xml($xml)
{
    $use_internal_errors = libxml_use_internal_errors(true);
    libxml_clear_errors(true);

    $sxe = simplexml_load_string($xml);

    if ($sxe)
    {
        return $sxe;
    }

    $fixed_xml = '';
    $last_pos  = 0;

    // make string flat
    $xml = str_replace(array("\r\n", "\r", "\n"), "", $xml);

    // get file encoding
    $encoding = mb_detect_encoding($xml);

    foreach (libxml_get_errors() as $error)
    {
        $pos = $error->column;
        $invalid_char = mb_substr($xml, $pos, 1, $encoding);
        $fixed_xml .= substr($xml, $last_pos, $pos - $last_pos) . htmlspecialchars($invalid_char);
        $last_pos = $pos + 1;
    }
    $fixed_xml .= substr($xml, $last_pos);

    libxml_use_internal_errors($use_internal_errors);

    return simplexml_load_string($fixed_xml);
}

I've added encoding stuff becose i've had problems with simply array[index] way of getting character from string.

This all should work but, dont know why, i've seen that $error->column gives me a different number than it should. Trying to debug this by simply add some invalid characters inside xml and check what value it would return, but no luck with it. Hope someone could tell me what is wrong with this approach.

1
  • While your method runs, it does not solve my particular problem that generates this error.
    – Patrick
    Commented Sep 18, 2017 at 16:43
0

Despite this problem being 10 years old (for when I'm typing this), I'm still experiencing similar XML parsing issues (PHP8.1), which is why I ended up here. The answers already given are helpful, but either incomplete, inconsistent or otherwise unsuitable for my problem and I suspect for the original poster too.

Inspecting internal XML parsing issues seems right, but there are 735 error codes (see https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-xmlerror.html), so a more adaptable solution seems appropriate.

I used the word "inconsistent" above because the best of the other answers (@Adam Szmyd) mixed multibyte string handling with non-multibyte string handling.

The following code uses Adam's as the base and I reworked it for my situation, which I feel could be extended further depending on the problems actually being experienced. So, I'm not complete either - sorry!

The essence of this code is that it handles "each" (in my implementation, just 1) XML parsing error as a separate case. The error I was experiencing was an unrecognised HTML entity (&ccedil; - ç), so I use PHP entity replacement instead.

function load_invalid_xml($xml)
{
    $use_internal_errors = libxml_use_internal_errors(true);
    libxml_clear_errors(true);

    $sxe = simplexml_load_string($xml);

    if ($sxe)
        return $sxe;

    $fixed_xml = '';
    $last_pos  = 0;

    // make string flat
    $xmlFlat = mb_ereg_replace( '(\r\n|\r|\n)', '', $xml );

    // Regenerate the error but using the flattened source so error offsets are directly relevant
    libxml_clear_errors();
    $xml_doc = @simplexml_load_string( $xmlFlat );

    foreach (libxml_get_errors() as $error)
    {
        $pos = $error->column - 1; // ->column appears to be 1 based, not 0 based

        switch( $error->code ) {

            case 26: // error undeclared entity
            case 27: // warning undeclared entity
                if ($pos >= 0) { // the PHP docs suggest this not always set (in which case ->column is == 0)

                    $left = mb_substr( $xmlFlat, 0, $pos );
                    $amp = mb_strrpos( $left, '&' );

                    if ($amp !== false) {

                        $entity = mb_substr( $left, $amp );
                        $fixed_xml .= mb_substr( $xmlFlat, $last_pos, $amp - $last_pos )
                            . html_entity_decode( $entity );
                        $last_pos = $pos;
                    }
                }
                break;

            default:
        }
    }
    $fixed_xml .= mb_substr($xml, $last_pos);

    libxml_use_internal_errors($use_internal_errors);

    return simplexml_load_string($fixed_xml);
}
2
  • Could you not just pre-emptively run the XML through preg_replace_callback and run html_entity_decode on any matches after filtering out the 4 allowed entities?
    – miken32
    Commented Apr 14, 2023 at 16:06
  • That is one approach, but it depends on the content I would say. I know that the XML I'm working with is a bit picky and processing encoded URLs multiple times causes issues with the URLs, hence taking a softly, softly approach with only "correcting" those things that caused the parser to fail. Commented Apr 16, 2023 at 13:52

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.