0

I have a file that looks live a .XML (it even has a <?xml version="1.0" encoding="utf-8"?> header). But before the header, when I open the file in Notepad++, there is random letters and some NULL chars.

Header

Going down the file, after the .XML standard writing, there a lot of random chars (that I assume, based of the origin of my file, it is the data from sensors)

After the .XML body

I'm trying to open it in python, to work the data, but I can't find a way to do it. Normally, for a .XML file, I would open it with Element Tree. I've tried to read the binary data with open, but the Buffer returns a stream of data that only show the character as displayed in Notepad++.

I don't know if there is a solution for my problem, or if I need a dictionary to translate these random chars.

I would appreciate any help anyone could give me!

I've tried using Element Tree, python's open method and struct unpack.

3
  • 1
    How was the file transferred? Looks like app that transferred file did not remove (or added) characters to the file. Sometimes there are SOX (start character) and EOX (end character) that are used to get beginning and end of the data. You would need to remove the extra character or get file from source without these extra characters.
    – jdweng
    Commented Apr 10, 2024 at 18:54
  • @jdweng This data came directly from a data acquisition software. And it opens normally in this software, that's what I think is strange. Commented Apr 11, 2024 at 11:12
  • You are probably reading something that wasn't meant to be read outside the app. Binary Data is usually proprietary and without the documentation for the data you shouldn't be attempting to read the data.
    – jdweng
    Commented Apr 11, 2024 at 12:41

1 Answer 1

1

I suggest trying xml.parsers.expat which should work after you remove leading NUL characters, consider following example

import xml.parsers.expat

enveloped_xml = '\x00\x00\x00<?xml version="1.0"?><outer outattr="outval"><inner inattr="inval">data</inner></outer>\xDE\xAD\xBE\xEF'

def start_element(name, attrs):
    print('Start element:', name, attrs)
def end_element(name):
    print('End element:', name)
def char_data(data):
    print('Character data:', repr(data))

p = xml.parsers.expat.ParserCreate()

p.StartElementHandler = start_element
p.EndElementHandler = end_element
p.CharacterDataHandler = char_data

try:
    p.Parse(enveloped_xml.lstrip('\x00'), 1)
except xml.parsers.expat.ExpatError:
    pass

gives output

Start element: outer {'outattr': 'outval'}
Start element: inner {'inattr': 'inval'}
Character data: 'data'
End element: inner
End element: outer

code is based at Example from docs, observe enveloped_xml is str, so if you have bytes you need to .decode them first. Parser throws error at garbage after XML, but it is after proper XML was processed.

1
  • I'll try yo work it out from your answer, and if I succeed, I'll accept it as an solution. Thanks so much for your help!!! Commented Apr 11, 2024 at 11:21

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.