Understand seeking

This is part of the Semicolon&Sons Code Diary - consisting of lessons learned on the job. You're in the unix category.

Last Updated: 2022-05-26

This is an interactive exploration of seeking:

File.open("test.txt", "w:UTF-8") do |f|
  f.write ""
end
File.open("test.txt").size
=> 0

puts File.open("test.txt", "w:UTF-8") do |f|
  f.write "a"
end

# so file size is num of chars or bytes. Let's do some experiments to figure out
# which
puts File.open("test.txt", "r:UTF-8").size
=> 1

File.open("test.txt", "w:UTF-8") do |f|
  f.write "test \u00A9 foo"
end

puts File.open("test.txt", "r:UTF-8").size
=> 11 # even though "test \u00A9 foo".length is 10, i.e. lengthwise the \u00A9 counts as "1" but size-wise 2)

# Therefore File.size gives bytes

We saw the variable length encoding above.

Let's look closer at file bytes:

$ hexdump test.txt
0000000 74 65 73 74 20 c2 a9 20 66 6f 6f
000000b

Hexdump shows it has 11 bytes, but the BOM was not added. This, apparently, is normal for UTF-8

Encoding::UTF_8

File.open("test.txt", "w:utf-16le") do |f|
  f.write "test \u00A9 foo"
end
# encoding affects size!
=> 20
# kaboom... needs binary mode
puts File.open("test.txt", "r:utf-16le").size
# now works with `rb`
puts File.open("test.txt", "rb:utf-16le").size
File.open("test2.txt", "w:utf-8") do |f|
  f.write "test \u02A0 foo"
end

I open up VIM and typed test foo and saved as test3.txt. Running hexdump shows the addition of a newline (despite me not adding any)

0000000   t   e   s   t       f   o   o  \n
0000009

This is expected as a line terminator in unix tools - vim did the right thing. Some tools won't work if it is missing.

What is the size of the unicode char above?

File.open("test4.txt", "w:utf-8") do |f|
  # preserve space on either side
  f.write "test foo"
end
=> 9
# => originally 11, so size is 2.

This shows that UTF-8 does not take up 2 bytes for ASCII characters but only 1. I.e. it is variable width.

Seek and you shall become

f = File.open("test3.txt") # the one created in VIM
f.read
=> "test foo\n"

# is it possible to seek past the length?
f.seek(100)
=> 0
# YES it is. But what happens when you read?
[33] pry(main)> f.read
# nothing comes out
=> ""

# can you go back?
f.seek(1)
=> 0
[36] pry(main)> f.read
# yes you can. The effect of going 1 byte from start is to lop off a byte
=> "est foo\n"

# what happens if you lob off a non ASCII byte?
f=File.open("test.txt") # the one created with utf-8
f.read
# it has a copyright symbol
=> "est © foo"

f.seek(5);f.read
# seeking is just like before, so far...
=> "© foo"
# what happens if I seek halfway into a double UTF-8 byte?
f.seek(6);f.read
# bingo! it can't print it!
=> "\xA9 foo"
# This shows that seek operates byte-wise

Lesson

Seek operates byte-wise, which may cause issues with encoding.