Using hexdump to figures out boms

This is part of the Semicolon&Sons Code Diary - consisting of lessons learned on the job. You're in the encoding category.

Last Updated: 2021-05-16

After trying to read a CSV file and getting an encoding error I went down a rabbit hole about how to analyze and fix these issues.

1. Get the first n bytes (10) of a file

Why? Because those bytes contain useful hints you can analyze. Often there is an optional BOM (Byte order mark), magic numbers at the start of a text stream that signal things to a program regarding the text.

$ hexdump -n10 Download.CSV
0000000 ef bb bf 22 44 61 74 65 22 2c

These bytes "ef bb bf" are the UTF-8 BOM.

2. Get the supposed encoding a file and compare to bytes

$ file -I Download.CSV
Download.CSV: text/plain; charset=utf-8

According to Wikipedia, UTF-8 is represented with the hex byte sequence:

0xEF,0xBB,0xBF.

Is this the same as what I saw? Yes it is, except for the 0000000 at the start in the CSV. However, this 000000 comes from hexdump - it's actually just offsets.

# It shows when reading just one character
$ hexdump -n1 Download.CSV
0000000 ef
$ hexdump -n33 Download.CSV
# hexdump shows 16 bytes per line
# the left-hand column are offsets
0000000 ef bb bf 22 44 61 74 65 22 2c 22 54 69 6d 65 22
0000010 2c 22 54 69 6d 65 5a 6f 6e 65 22 2c 22 4e 61 6d
0000020 65

The bit with 000000s on the left is the starting offset of the file. Then there are 16 8bit values per line.

Back to my bug

The problem, it turned out, was that the Ruby CSV file opener does not handle BOMs automatically. (In general, BOMs break compatibility with code that expects plain ASCII)

Here is how to write a bom in Ruby:

File.write("bom.txt", "\u{FEFF}abc")

There is also a new IO method for dealing with it:

IO#set_encoding_by_bom → encoding or nil

"Checks if IO starts with a BOM, and then consumes it and sets the external encoding. Returns the result encoding if found, or nil. If IO is not binmode or its encoding has been set already, an exception will be raised."

Be aware that two strings in the same program can diverge in encodings

I ran into further issues due to it being a StringIO and not a true file:

contents = stringio.sub!("\xEF\xBB\xBF", '');
=> Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8

Indeed, this shows that with string operations involving two strings, the encodings of both ought to match

"\xEF\xBB\xBF".encoding
=> UTF8

stringio.read.encoding
=> Encoding:ASCII-8BIT>

# no match!!

Here's a fix:

stringio.set_encoding("utf-8")
contents = stringio.sub!("\xEF\xBB\xBF", '');
=> works

Resources