This is part of the Semicolon&Sons Code Diary - consisting of lessons learned on the job. You're in the encoding category.
Last Updated: 2024-10-12
A junior colleague had a HTML form which was oddly broken. The code looked perfect in the Chrome DevTools inspector, but the form wasn't functioning correctly (e.g. default value didn't show in a form input and the input's value was not passed to JavaScript).
When I inspected the code on my computer, I used the following vim setting to display characters similar to but different from white-spaces as special characters:
" can be disabled with :set list! (or :set nolist)
:set list
In order to figure out what exact character I was dealing with, I typed ga
(mnemonic: get ascii) in normal mode when my cursor was over
the character. In the status bar, the following was displayed:
< > 160, Hex 00a0, Oct 240, Digr NS
In order, this shows:
By using these precise representations, I was able to get rid of these characters throughout the whole file using a search and replace
:%s/\%u00a0/ /g
%u
is the prefix needed to represent hex codesGenerally speaking, one should not copy code from fancy editors due to the random crud they add in. Also, if you copy something from elsewhere (e.g. the internet) the encoding may not be what you think it is and you could be in a for a surprise.
As for a general solutoin, one should always have a script to detect/remove odd whitespaces. Here's one I found online:
# Assume This is saved in a file called:
# `removewhitespace`
C_ALL=en_US.UTF-8 spaces=$(printf "%b" "\U00A0\U1680\U180E\U2000\U2001\U2002\U2003\U2004\U2005\U2006\U2007\U2008\U2009\U200A\U200B\U202F\U205F\U3000\UFEFF")
while read -r line; do
echo "${line//[$spaces]/ }"
done < "${1:-/dev/stdin}"
line="cats&dogs"
# Notice that the variable name, `line`, doesn't have a $ in front of it when inside the ${} structure
# i.e. the format is: ${stringToOperateOn/thingToRemove/thingToReplaceWith/}
echo "${line/&/*}"
=> "cats*dogs"
The last bit ${1:-/dev/stdin}
means "use what is before the colon if it is
not blank, otherwise default to what is afterwards". It allows the program to
act on a filename if given as the first argument ($1
) or /dev/stdin
otherwise. So
you can call it with removewhitespace myfile
or echo myfile | removewhitespace
Note that the \U00A0
with printf only works on bash version 4 and above. Luckily, it works fine with ZSH.
I added highlighting in vim:
" syntax match using a hex regex and store matches as `nonascii`
syntax match nonascii "[^\x00-\x7F]"
" highlight this nonascii group in a particular way
highlight nonascii guibg=Green ctermbg=2
Ctrl-v u CODE
- e.g. Ctrl-v u 00a0
for non-breaking space
or Ctrl-v u 00a9
for copyright sign: ©Ctrl-v <tab>
:
You can see with :set encoding?
It is always utf-8
internally. For an
individual file, you can set it with :set fileencoding
- conversion is done
with iconv
when writing the file.
But is the encoding set by VIM or the filetype?
Files generally indicate their encoding with a file header. However, even when reading the header, you can never be sure what encoding a file is really using.
For example, a file with the first three bytes 0xEF,0xBB,0xBF
is probably a
UTF-8 encoded file. However, it might be an ISO-8859-1
file which happens to
start with the characters 
. Or it might be a different file type entirely.
ASCII itself only has like 126 characters (7 bit), therefore 160 is out of its range. Instead it comes under extended ASCII (more than 8 bit). There is no official "extended ascii". Instead there are many, and unicode can be considered as one.
Eventually, ISO released this standard as ISO 8859 describing its own set of eight-bit ASCII extensions. The most popular is ISO 8859-1, also called ISO Latin 1, which contained characters sufficient for the most common Western European languages. Variations were standardized for other languages as well: ISO 8859-2 for Eastern European languages and ISO 8859-5 for Cyrillic languages, for example.
Because the full English alphabet and the most-used characters in English are included in the seven-bit code points of ASCII, which are common to all encodings (even most proprietary encodings), English-language text is less damaged by interpreting it with the wrong encoding, but text in other languages can display as complete nonsense.