Despite this problem being 10 years old (for when I'm typing this), I'm still experiencing similar XML parsing issues (PHP8.1), which is why I ended up here. The answers already given are helpful, but either incomplete, inconsistent or otherwise unsuitable for my problem and I suspect for the original poster too.
Inspecting internal XML parsing issues seems right, but there are 735 error codes (see https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-xmlerror.html), so a more adaptable solution seems appropriate.
I used the word "inconsistent" above because the best of the other answers (@Adam Szmyd) mixed multibyte string handling with non-multibyte string handling.
The following code uses Adam's as the base and I reworked it for my situation, which I feel could be extended further depending on the problems actually being experienced. So, I'm not complete either - sorry!
The essence of this code is that it handles "each" (in my implementation, just 1) XML parsing error as a separate case. The error I was experiencing was an unrecognised HTML entity (ç - ç), so I use PHP entity replacement instead.
function load_invalid_xml($xml)
{
$use_internal_errors = libxml_use_internal_errors(true);
libxml_clear_errors(true);
$sxe = simplexml_load_string($xml);
if ($sxe)
return $sxe;
$fixed_xml = '';
$last_pos = 0;
// make string flat
$xmlFlat = mb_ereg_replace( '(\r\n|\r|\n)', '', $xml );
// Regenerate the error but using the flattened source so error offsets are directly relevant
libxml_clear_errors();
$xml_doc = @simplexml_load_string( $xmlFlat );
foreach (libxml_get_errors() as $error)
{
$pos = $error->column - 1; // ->column appears to be 1 based, not 0 based
switch( $error->code ) {
case 26: // error undeclared entity
case 27: // warning undeclared entity
if ($pos >= 0) { // the PHP docs suggest this not always set (in which case ->column is == 0)
$left = mb_substr( $xmlFlat, 0, $pos );
$amp = mb_strrpos( $left, '&' );
if ($amp !== false) {
$entity = mb_substr( $left, $amp );
$fixed_xml .= mb_substr( $xmlFlat, $last_pos, $amp - $last_pos )
. html_entity_decode( $entity );
$last_pos = $pos;
}
}
break;
default:
}
}
$fixed_xml .= mb_substr($xml, $last_pos);
libxml_use_internal_errors($use_internal_errors);
return simplexml_load_string($fixed_xml);
}