White space is not necessarily white space

This is part of the Semicolon&Sons Code Diary - consisting of lessons learned on the job. You're in the encoding category.

Last Updated: 2024-10-12

A junior colleague had a HTML form which was oddly broken. The code looked perfect in the Chrome DevTools inspector, but the form wasn't functioning correctly (e.g. default value didn't show in a form input and the input's value was not passed to JavaScript).

When I inspected the code on my computer, I used the following vim setting to display characters similar to but different from white-spaces as special characters:

" can be disabled with :set list! (or :set nolist)
:set list

In order to figure out what exact character I was dealing with, I typed ga (mnemonic: get ascii) in normal mode when my cursor was over the character. In the status bar, the following was displayed:

< > 160, Hex 00a0, Oct 240, Digr NS

In order, this shows:

By using these precise representations, I was able to get rid of these characters throughout the whole file using a search and replace

:%s/\%u00a0/ /g

Generally speaking, one should not copy code from fancy editors due to the random crud they add in. Also, if you copy something from elsewhere (e.g. the internet) the encoding may not be what you think it is and you could be in a for a surprise.

As for a general solutoin, one should always have a script to detect/remove odd whitespaces. Here's one I found online:

# Assume This is saved in a file called:
# `removewhitespace`

C_ALL=en_US.UTF-8 spaces=$(printf "%b" "\U00A0\U1680\U180E\U2000\U2001\U2002\U2003\U2004\U2005\U2006\U2007\U2008\U2009\U200A\U200B\U202F\U205F\U3000\UFEFF")

while read -r line; do
  echo "${line//[$spaces]/ }"
done < "${1:-/dev/stdin}"
  line="cats&dogs"
  # Notice that the variable name, `line`, doesn't have a $ in front of it when inside the ${} structure
  # i.e. the format is: ${stringToOperateOn/thingToRemove/thingToReplaceWith/}
  echo "${line/&/*}"
  => "cats*dogs"

How can I make these "almost white space" characters more salient in my editor?

I added highlighting in vim:

" syntax match using a hex regex and store matches as `nonascii`
syntax match nonascii "[^\x00-\x7F]"

" highlight this nonascii group in a particular way
highlight nonascii guibg=Green ctermbg=2

How can I enter these character into a vim file on purpose?

What encoding does vim even use?

You can see with :set encoding? It is always utf-8 internally. For an individual file, you can set it with :set fileencoding - conversion is done with iconv when writing the file.

But is the encoding set by VIM or the filetype?

Files generally indicate their encoding with a file header. However, even when reading the header, you can never be sure what encoding a file is really using.

For example, a file with the first three bytes 0xEF,0xBB,0xBF is probably a UTF-8 encoded file. However, it might be an ISO-8859-1 file which happens to start with the characters . Or it might be a different file type entirely.

What are all the space-like characters in ASCII?

What is the point of a BOM?

Resources