Understand binary vs text data

This is part of the Semicolon&Sons Code Diary - consisting of lessons learned on the job. You're in the encoding category.

Last Updated: 2021-05-16

I noticed that I got an error when writing a file with

File.open(filename, "w")

but it was fine when I used the binary variant:

File.open(filename, "wb")

The nature of the error was something like:

Encoding::UndefinedConversionError: "\xF0" from ASCII-8BIT to UTF-8

and I encountered it when trying to download PDF files stored in ActiveStorage.

What is the difference between w and wb?

The two file types may look the same on the surface, but they encode data differently. While both binary and text files contain data stored as a series of bits (binary values of 1s and 0s), the bits in text files represent characters, while the bits in binary files represent custom data.

Clearly, on a fundamental file-system level, every file is just a collection of bytes and could therefore be viewed as binary data. On the other hand, a distinction between "text" and "non-text" (hereafter: "binary") data seems helpful for programs like grep or diff, if only not to mess up the output of your terminal emulator when run on binary files that cannot be represented easily.

A file is called "text file" if its content consists of an encoded sequence of Unicode code points (e.g. UTF8, UTF16, ASCII).

How can one quickly tell binary and text files apart? The trick is that binary data contains lots of null bytes (00) whereas a text file does not.

Let's use xxd to show textual representations of the bytes, starting with a photo file:

$ xxd 1px-orange.png
00000000: 8950 4e47 0d0a 1a0a 0000 000d 4948 4452  .PNG........IHDR
00000010: 0000 0001 0000 0001 0103 0000 0025 db56  .............%.V
00000020: ca00 0000 0350 4c54 45ff 4d00 5c35 387f  .....PLTE.M.\58.
00000030: 0000 0001 7452 4e53 ccd2 3456 fd00 0000  ....tRNS..4V....
00000040: 0a49 4441 5478 9c63 6200 0000 0600 0336  .IDATx.cb......6
00000050: 377c a800 0000 0049 454e 44ae 4260 82    7|.....IEND.B`.

vs. a simple text file:

$ xxd robots.txt
00000000: 2320 5365 6520 6874 7470 3a2f 2f77 7777  # See http://www
00000010: 2e72 6f62 6f74 7374 7874 2e6f 7267 2f77  .robotstxt.org/w
00000020: 632f 6e6f 726f 626f 7473 2e68 746d 6c20  c/norobots.html
00000030: 666f 7220 646f 6375 6d65 6e74 6174 696f  for documentatio
... omitted...
00000180: 0a53 6974 656d 6170 3a20 6874 7470 3a2f  .Sitemap: http:/
00000190: 2f77 7777 2e6f 7862 7269 6467 656e 6f74  /www.oxbridgenot
000001a0: 6573 2e63 6f2e 756b 2f73 6974 656d 6170  es.co.uk/sitemap
000001b0: 312e 786d 6c2e 677a 0a                   1.xml.gz.

(xxd shows bytes just as does hexdump does. xxd also gives an ASCII column)

The presence of many null bytes is used as a heuristic for detecting binary data. Roughly, the algorithm used by many programs is "A file is very likely to be a "text file" if the first 1024 bytes of its content do not contain any NULL bytes." This algorithm is imperfect - e.g. UTF8 sometimes contains null bytes (it is not illegal by its encoding scheme).

Aside: notice the text "PNG" present in the PNG file? This is due to a magic number, which binary files use to signal their type: the bytes 50 4e 47 are used for PNG.

Binary file have no EOF, whereas text files do (char ASCI code 26)

Text file pros:

A small error in a textual file can be recognized and eliminated when seen. Whereas, a small error in a binary file corrupts the file and is not easy to detect.

Text files are human readable and it's easy to diff/grep/otherwise work with them.

Binary pros

What exactly does opening a file in binary mode do?

How to work with binary strings (i.e. from files) in (e.g.) Ruby?

The unpack methods take directives, specifying what the structure is, e.g. "expect 8 bits per integer". It returns an array:

# `cc` directive: Two 8-bit unsigned integers
"\x34\x12".unpack('cc') # => [0x34, 0x12]

# `S` directive: One 16-bit unsigned int, little-endian
"\x34\x12".unpack('S') # => [0x1234]

How to work with binary headers and parse them (e.g. in Ruby)?

We specify the byte size/type for each of the entries in the file specificiation then unpack the start of the binary string in this way:

# Define file header structure

FileHeader = Struct.new(
  :bfType,
  :bfSize,
  :bfReserved1,
  :bfReserved2,
  :bfOffbits
)

File.open("lena512.bmp", "rb") do |file|

  # Read 14 bytes, this is the size of file header
  binary = file.read(14)
  # decode binary data - directives used

  # A2 - arbitrary string, 2 is there because there are 2 bytes, "BM"
  # L - this is bfSize, 4 bytes unsigned
  # S - bfReserved1, 2 bytes unsigned
  # S - bfReserved2, 2 bytes unsigned
  # L - bfOffBites, 4 bytes unsigned

  data = binary.unpack("A2 L S S L")
  # returns an array with 5 entries
  file_header = FileHeader.new(*data)
end
# Returns...
<struct FileHeader
  bfType="BM",
  bfSize=263222,
  bfReserved1=0,
  bfReserved2=0,
  bfOffbits=1078>

What happens if you get sign wrong?

# Encode value as signed 2-byte integer with `s`
# Decode _un_signed with `S`
[-1024].pack("s").unpack("S")
# Wrong result!
=> [64512]

# Use correct decoding
[-1024].pack("s").unpack("s")
=> [-1024]

References