4

I get the following error, while trying to validate XML using a schema:

lxml.etree.XMLSchemaParseError: Element '{http://www.w3.org/2001/XMLSchema}attributeGroup', attribute 'ref': The QName value '{http://www.w3.org/XML/1998/namespace}specialAttrs' does not resolve to a(n) attribute group definition., line 15

The issue is reproducing with lxml>= 6.0.0 and only on Linux (tested on Ubuntu 20 and 22).

lxml version 6.0.2 works well on Windows systems (10 and 11).

Below is a simplified example of my use case.

main.xml

<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:xi="http://www.w3.org/2001/XInclude">
    <title>Main XML</title>
    <elements>
        <element name="main element" foo="main foo">This text is from main.xml</element>
        <xi:include href="include.xml" parse="xml" xpointer="xpointer(/elements/element)"/>
    </elements>
</root>

include.xml

<?xml version="1.0" encoding="UTF-8"?>
<elements>
    <element name="element1" foo="foo1">Text 1: This content is included from another file.</element>
    <element name="element2" foo="foo2">Text 2: This content is included from another file.</element>
    <element name="element3" foo="foo3">Text 3: This content is included from another file.</element>
</elements>

transform.xslt

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <!-- Identity transform: copy everything by default -->
    <xsl:template match="@* | node()">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()"/>
        </xsl:copy>
    </xsl:template>

    <!-- Match only <message> with name="message2" and override foo -->
    <xsl:template match="element[@name='element2']">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <xsl:attribute name="foo">spam</xsl:attribute>
            <xsl:attribute name="name">message99</xsl:attribute>
            <xsl:apply-templates select="node()"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

schema.xsd

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
    <xs:import namespace="http://www.w3.org/XML/1998/namespace" schemaLocation="http://www.w3.org/2009/01/xml.xsd"/>
    <xs:element name="root">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="title" type="xs:string"/>
                <xs:element name="elements">
                    <xs:complexType>
                        <xs:sequence minOccurs="1" maxOccurs="unbounded">
                            <xs:element name="element" minOccurs="1" maxOccurs="unbounded">
                                <xs:complexType mixed="true">
                                    <xs:attribute name="name" type="xs:string" use="required"/>
                                    <xs:attribute name="foo" type="xs:string" use="required"/>
                                    <xs:attributeGroup ref="xml:specialAttrs"/>
                                </xs:complexType>
                            </xs:element>
                        </xs:sequence>
                    </xs:complexType>
                </xs:element>
            </xs:sequence>
        </xs:complexType>
    </xs:element>

</xs:schema>

Line 15 in schema.xsd is needed for the case when include.xml is not in the same directory as main.xml and it's referenced via a relative path.

E.g. <xi:include href="../include.xml" parse="xml" xpointer="xpointer(/elements/element)"/>

In this case, the included elements will have an extra attribute added (xml:base): <element name="element1" foo="foo1" xml:base="../include.xml">Text 1: This content is included from another file.</element>

xmlParse.py

#!/usr/bin/env python3

import os
import lxml
from lxml import etree

print("Using lxml version {0}".format(lxml.__version__), end="\n\n")

tree = etree.parse("main.xml")
tree.xinclude()

# Apply transformations
if os.path.isfile("transform.xslt"):
    print("Applying transformation from transform.xslt")
    xslt = etree.parse("transform.xslt")
    transform = etree.XSLT(xslt)
    result = transform(tree)
    tree._setroot(result.getroot())

print(etree.tostring(tree, pretty_print=True).decode())

schema = etree.XMLSchema(etree.parse("schema.xsd")) # Load and parse the schema
if schema.validate(tree): # Validate
    print("XML is valid.")
else:
    print("XML is invalid!")
    for error in schema.error_log:
        print(error.message)

Below the example output from my Ubuntu 20 machine:

bogey@machine:/opt/xml_schema$ python3 xml_parse.py
Using lxml version 6.0.2
Applying transformation from transform.xslt
<root xmlns:xi="http://www.w3.org/2001/XInclude">
<title>Main XML</title>
<elements>
<element name="main element" foo="main foo">This text is from main.xml</element>
<element name="element1" foo="foo1">Text 1: This content is included from another file.</element><element name="message99" foo="spam">Text 2: This content is included from another file.</element><element name="element3" foo="foo3">Text 3: This content is included from another file.</element>
</elements>
</root>

Traceback (most recent call last):
File "/opt/xml_parse.py", line 20, in
schema = etree.XMLSchema(etree.parse("schema.xsd")) # Load and parse the schema
File "src/lxml/xmlschema.pxi", line 90, in lxml.etree.XMLSchema.init
lxml.etree.XMLSchemaParseError: Element '{http://www.w3.org/2001/XMLSchema}attributeGroup', attribute 'ref': The QName value '{http://www.w3.org/XML/1998/namespace}specialAttrs' does not resolve to a(n) attribute group definition., line 15

bogey@machine:/opt/xml_schema$ pip install lxml==5.4.0
Defaulting to user installation because normal site-packages is not writeable
Collecting lxml==5.4.0
Downloading lxml-5.4.0-cp310-cp310-manylinux_2_28_x86_64.whl (5.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.1/5.1 MB 12.2 MB/s eta 0:00:00
Installing collected packages: lxml
Attempting uninstall: lxml
Found existing installation: lxml 6.0.2
Uninstalling lxml-6.0.2:
Successfully uninstalled lxml-6.0.2
Successfully installed lxml-5.4.0

bogey@machine:/opt/xml_schema$ python3 xml_parse.py
Using lxml version 5.4.0
Applying transformation from transform.xslt
<root xmlns:xi="http://www.w3.org/2001/XInclude">
<title>Main XML</title>
<elements>
<element name="main element" foo="main foo">This text is from main.xml</element>
<element name="element1" foo="foo1">Text 1: This content is included from another file.</element><element name="message99" foo="spam">Text 2: This content is included from another file.</element><element name="element3" foo="foo3">Text 3: This content is included from another file.</element>
</elements>
</root>

XML is valid.

Output on Windows machine:

(venv310_win) PS C:\xml_schema> python .\xml_parse.py
Using lxml version 6.0.2
Applying transformation from transform.xslt
<root xmlns:xi="http://www.w3.org/2001/XInclude">
<title>Main XML</title>
<elements>
<element name="main element" foo="main foo">This text is from main.xml</element>
<element name="element1" foo="foo1">Text 1: This content is included from another file.</element><element name="message99" foo="spam">Text 2: This content is included from another file.</element><element name="element3" foo="foo3">Text 3: This content is included from another file.</element>
</elements>
</root>

XML is valid.

What's the deal? Any ideas would be appreciated. Thanks.

EDIT: Windows

Python : sys.version_info(major=3, minor=11, micro=8, releaselevel='final', serial=0)
etree : (6, 0, 2, 0)
libxml used : (2, 11, 9)
libxml compiled : (2, 11, 9)
libxslt used : (1, 1, 39)
libxslt compiled : (1, 1, 39)

Linux

Python : sys.version_info(major=3, minor=8, micro=10, releaselevel='final', serial=0)
etree : (6, 0, 0, 0)
libxml used : (2, 14, 4)
libxml compiled : (2, 14, 4)
libxslt used : (1, 1, 43)
libxslt compiled : (1, 1, 43)

6
  • 1
    May be there's a difference on underlying libxml2 library. Check it with import sys from lxml import etree print("%-20s: %s" % ('Python', sys.version_info)) print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION)) print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION)) print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION)) print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION)) print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION)) Commented Sep 26 at 15:19
  • libxml win 2.11.9, while on Linux I have 2.14.4. Libxslt win 1.1.39, while on linux it's 1.1.43 Commented Sep 26 at 15:31
  • 1
    I get the same error on macOS. If I remove line 15 from schema.xsd i.e. <xs:attributeGroup ref="xml:specialAttrs"/> it runs and still produces the same output as with 5.4.0 Commented Sep 26 at 15:50
  • I need that line for the case where include.xml is in a different directory and it's referenced via relative path, e.g. <xi:include href="../include.xml" Commented Sep 26 at 16:02
  • 1
    Opened a bug with lxml as well (bugs.launchpad.net/lxml/+bug/2125776), but also created this post for faster answers, in case there's some quick fix :D Commented Sep 26 at 16:04

1 Answer 1

3

The right way

libxml2 has enforced in latest versions the use of xml catalogs to resolve external resources due to security reasons. A custom catalog could be written as follows

catalog.xml uri gets schemaLocation value and the xsd file must be downloaded <xs:import namespace="http://www.w3.org/XML/1998/namespace" schemaLocation="http://www.w3.org/2001/xml.xsd"/>

wget "http://www.w3.org/2001/xml.xsd"

<?xml version="1.0"?>
<!DOCTYPE catalog PUBLIC "-//OASIS//DTD XML Catalogs V1.0//EN"
                      "http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd">
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
  <public publicId="http://www.w3.org/2001/xml.xsd"
          uri="xml.xsd"/>
  <system systemId="http://www.w3.org/2001/xml.xsd"
          uri="xml.xsd"/>
  <uri name="http://www.w3.org/2001/xml.xsd"
          uri="xml.xsd"/>
</catalog>

The custom catalog.xml can be used with lxml as follows

import os
import lxml
from lxml import etree

# Path to your XML Catalog file
catalog_file = "catalog.xml"
os.environ["XML_CATALOG_FILES"] = catalog_file

print("Using lxml version {0}".format(lxml.__version__), end="\n\n")

schema_tree = etree.parse("schema.xsd")
schema = etree.XMLSchema(etree=schema_tree)

tree = etree.parse("main.xml", parser=parser)
tree.xinclude()

# Apply transformations
if os.path.isfile("transform.xslt"):
    print("Applying transformation from transform.xslt")
    xslt = etree.parse("transform.xslt")
    transform = etree.XSLT(xslt)
    result = transform(tree)
    tree._setroot(result.getroot())

print(etree.tostring(tree, pretty_print=True).decode())

if schema.validate(tree): # Validate
    print("XML is valid.")
else:
    print("XML is invalid!")
    for error in schema.error_log:
        print(error.message)

Testing the catalog with xmllint

XML_CATALOG_FILES='catalog.xml' /home/lmc/Downloads/libxml2-v2.15.0/xmllint --noout --xinclude --schema schema.xsd main.xml 
main.xml validates

Running the script

python3.12 parse-so.py 
Using lxml version 6.0.0

Applying transformation from transform.xslt
<root xmlns:xi="http://www.w3.org/2001/XInclude">
[REDACTED]

XML is valid.

Alternative: edit xsd

This answer suggests to remove schemaLocation from the xsd but that does not fix the problem. Downloading a copy of xml.xsd and referencing it in schema.xsd does the trick

wget "http://www.w3.org/2001/xml.xsd"

change schema to

<xs:import namespace="http://www.w3.org/XML/1998/namespace" schemaLocation="xml.xsd"/>

Note:
latest xmllint tool from libxml2 Linux package fails with the same error so it's not an lxml bug

/home/lmc/Downloads/libxml2-v2.15.0/xmllint --noout --xinclude --schema schema.xsd main.xml
I/O warning : failed to load "https://www.w3.org/2005/08/xml.xsd": No such file or directory
schema.xsd:3: element import: Schemas parser warning : Element '{http://www.w3.org/2001/XMLSchema}import': Failed to locate a schema at location 'https://www.w3.org/2005/08/xml.xsd'. Skipping the import.
schema.xsd:15: element attributeGroup: Schemas parser error : Element '{http://www.w3.org/2001/XMLSchema}attributeGroup', attribute 'ref': The QName value '{http://www.w3.org/XML/1998/namespace}specialAttrs' does not resolve to a(n) attribute group definition.
WXS schema schema.xsd failed to compile

It works when referencing a local xsd file

<xs:import namespace="http://www.w3.org/XML/1998/namespace" schemaLocation="xml.xsd"/>

/home/lmc/Downloads/libxml2-v2.15.0/xmllint --noout --xinclude --schema schema.xsd main.xml 
main.xml validates
Sign up to request clarification or add additional context in comments.

5 Comments

It works, indeed, if I use a local xml.xsd file.
You can accept the answer if that fixes the issue :-). I still think it's an lxml bug
Yeah, I applied this solution for now. Will wait to see if that bug I opened will get any attention. Might be a bug, might also be a bugfix for this, lol: bugs.launchpad.net/lxml/+bug/1234114 (allthough I tried to explicitly set no_network=False.
I agree, this looks like an lxml bug. Problems with loading the schema for the XML namespace often arise when people try to load a non-standard version from a non-standard location, but in this case (a) you're referencing something that's defined in the standard version, and (b) you're referencing it at a location where the standard version is found. So if anyone is loading a non-standard version from a non-standard location then it's lxml itself.
Sorry, forgot to update, libxml 2.13.0 removed HTTP support: discourse.gnome.org/t/libxml2-2-13-0-released/21529 My bug: gitlab.gnome.org/GNOME/libxml2/-/issues/990

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.