1

I'm parsing a number of text files of different formats. Some are csv, some are xml and some even txt. I have a case statement that checks if a certain string is contained in the first 100 bytes of a file in order to identify it. I would like to replace this case statement with a hash-based solution because in the end I will cover 10 or more different file formats and I would like to avoid such a long case statement.

first_bytes = '<?xml version="1.0" encoding="UTF-8"?><Document xmlns="urn:iso:std:iso:20022:tech:xsd:camt.053.001.02"'

case first_bytes
when /camt.053.001/
   'camt053'
when /camt.052.001.08/
   'camt052'
when /"Auftragskonto";"Buchungstag";"Valutadatum";/
   'SpK'
end

I started with the following hash but I'm not sure how to match this.

file_types = {
    '/camt.053.001/' => 'camt053',
    '/camt.052.001/' => 'camt052',
    '/"Auftragskonto";"Buchungstag";"Valutadatum";/' => 'SpK'
}

file_types.keys.select { |key| first_bytes.match(key) }

This doesn't work. It produces an empty array.

4
  • 1
    While the answer below is accurate. It is also worth noting that your case statement returns a single value, e.g. "camt053" where as your Hash methodology will return and Array of the matching key(s) (Regexp) e.g. [/camt.053.001/]. Additionally you should probably use String#match? which will simply return true or false rather than constructing a MatchData object, which you don't need. Commented Nov 26 at 19:59
  • This is correct. I'm not sure how to best get the single value. the following works but feels ugly: file_types[file_types.keys.select { |key| key.match(first_bytes) }.first]@engineersmnky Commented Nov 27 at 6:21
  • 1
    I suggest using an XML reader (for example, Nokogiri) when you want to extract and read values from an XML document. Commented Nov 27 at 6:56
  • Only some of the files are xml, many are csv and txt. I use an xml reader further down the line when I'm handling the xml files (and a csv reader for csv files) but at this point I need to identify what is what. Commented Nov 27 at 8:47

2 Answers 2

3

You are using a string as a key in your hash map instead of a regular expression object.
Here is the correct hash map declaration:

file_types = {
    /camt.053.001/ => 'camt053',
    /camt.052.001/ => 'camt052',
    /"Auftragskonto";"Buchungstag";"Valutadatum";/ => 'SpK'
}
Sign up to request clarification or add additional context in comments.

Comments

2

I would use a single regex and a hash code that does not use regex, just the output of a
single regex match to do the code lookup.

Demo : https://www.jdoodle.com/ia/1NBq

file_types = {
    'camt.053.001' => 'camt053',
    'camt.052.001' => 'camt052',
    '"Auftragskonto";"Buchungstag";"Valutadatum";' => 'SpK'
}

RxFileTypes = /(?:camt\.053\.001|camt\.052\.001\.08|"Auftragskonto";"Buchungstag";"Valutadatum";)/
first_bytes = '<?xml version="1.0" encoding="UTF-8"?><Document xmlns="urn:iso:std:iso:20022:tech:xsd:camt.053.001.02"'

match = first_bytes.match( RxFileTypes )

if match
   puts "File type code is :  " + file_types[ match[0] ]
end
    

Output

File type code is :  camt053    

Combine all the file types into a single regex separated by alternations.
The match will be the Key into the File Type code hash.

(?:
   camt \. 053 \. 001
 | camt \. 052 \. 001 \. 08
 | "Auftragskonto";"Buchungstag";"Valutadatum";
)

For additional File types just add another alternation and an entry into the File Type code hash.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.