Replace a long case statement involving regex with a hash

Question

I'm parsing a number of text files of different formats. Some are csv, some are xml and some even txt. I have a case statement that checks if a certain string is contained in the first 100 bytes of a file in order to identify it. I would like to replace this case statement with a hash-based solution because in the end I will cover 10 or more different file formats and I would like to avoid such a long case statement.

first_bytes = '<?xml version="1.0" encoding="UTF-8"?><Document xmlns="urn:iso:std:iso:20022:tech:xsd:camt.053.001.02"'

case first_bytes
when /camt.053.001/
   'camt053'
when /camt.052.001.08/
   'camt052'
when /"Auftragskonto";"Buchungstag";"Valutadatum";/
   'SpK'
end

I started with the following hash but I'm not sure how to match this.

file_types = {
    '/camt.053.001/' => 'camt053',
    '/camt.052.001/' => 'camt052',
    '/"Auftragskonto";"Buchungstag";"Valutadatum";/' => 'SpK'
}

file_types.keys.select { |key| first_bytes.match(key) }

This doesn't work. It produces an empty array.

While the answer below is accurate. It is also worth noting that your case statement returns a single value, e.g. "camt053" where as your Hash methodology will return and Array of the matching key(s) (Regexp) e.g. [/camt.053.001/]. Additionally you should probably use String#match? which will simply return true or false rather than constructing a MatchData object, which you don't need. — engineersmnky
– engineersmnky, Commented Nov 26 at 19:59
This is correct. I'm not sure how to best get the single value. the following works but feels ugly: file_types[file_types.keys.select { |key| key.match(first_bytes) }.first]@engineersmnky — Ricky883249
– Ricky883249, Commented Nov 27 at 6:21
I suggest using an XML reader (for example, Nokogiri) when you want to extract and read values from an XML document. — spickermann
– spickermann, Commented Nov 27 at 6:56
Only some of the files are xml, many are csv and txt. I use an xml reader further down the line when I'm handling the xml files (and a csv reader for csv files) but at this point I need to identify what is what. — Ricky883249
– Ricky883249, Commented Nov 27 at 8:47

Jupit90 · Accepted Answer · 2025-11-26 18:41:23Z

3

You are using a string as a key in your hash map instead of a regular expression object.
Here is the correct hash map declaration:

file_types = {
    /camt.053.001/ => 'camt053',
    /camt.052.001/ => 'camt052',
    /"Auftragskonto";"Buchungstag";"Valutadatum";/ => 'SpK'
}

answered Nov 26 at 18:41

Jupit90

1196 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

sln · Accepted Answer · 2025-11-26 22:20:35Z

I would use a single regex and a hash code that does not use regex, just the output of a
single regex match to do the code lookup.

Demo : https://www.jdoodle.com/ia/1NBq

file_types = {
    'camt.053.001' => 'camt053',
    'camt.052.001' => 'camt052',
    '"Auftragskonto";"Buchungstag";"Valutadatum";' => 'SpK'
}

RxFileTypes = /(?:camt\.053\.001|camt\.052\.001\.08|"Auftragskonto";"Buchungstag";"Valutadatum";)/
first_bytes = '<?xml version="1.0" encoding="UTF-8"?><Document xmlns="urn:iso:std:iso:20022:tech:xsd:camt.053.001.02"'

match = first_bytes.match( RxFileTypes )

if match
   puts "File type code is :  " + file_types[ match[0] ]
end

Output

File type code is :  camt053

Combine all the file types into a single regex separated by alternations.
The match will be the Key into the File Type code hash.

(?:
   camt \. 053 \. 001
 | camt \. 052 \. 001 \. 08
 | "Auftragskonto";"Buchungstag";"Valutadatum";
)

For additional File types just add another alternation and an entry into the File Type code hash.

Collectives™ on Stack Overflow

Replace a long case statement involving regex with a hash

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related