Sorting gotchas

This is part of the Semicolon&Sons Code Diary - consisting of lessons learned on the job. You're in the algorithms category.

Last Updated: 2024-10-12

Consider text vs digits

e.g. I wanted to find the biggest files in the project_s repo. So I ran du

$ du project_s
192./.composer/cache/files/phpspec
16 ./.composer/cache/files/fideloper/proxy
1632 ./.composer/cache/files/maximebf
84496 ./.composer/cache/files
854328 ./.composer/cache/repo/https---repo.packagist.org
854328 ./.composer/cache/repo
938832 ./.composer/cache
938840 ./.composer
2810896 .

Next I sorted by piping into sort and sorting on first key

$ du project_s | sort -k 1
968 ./node_modules/jsdom/node_modules/acorn/dist
9696 ./node_modules/terser
971280 ./node_modules
976 ./node_modules/handlebars/dist/amd/handlebars/compiler
984 ./node_modules/array-includes/node_modules/es-abstract/2019
9912 ./node_modules/lodash
992 ./vendor/phpunit/phpunit/tests/end-to-end/regression

As you can see, the order was not what I expected because sort expected text not numbers. Therefore I had to tell sort to sort numerically with

sort -k 1 -n
$ du project_s | sort -k 1 -n

968 ./node_modules/jsdom/node_modules/acorn/dist
976 ./node_modules/handlebars/dist/amd/handlebars/compiler
984 ./node_modules/array-includes/node_modules/es-abstract/2019
992 ./vendor/phpunit/phpunit/tests/end-to-end/regression
9696 ./node_modules/terser
9912 ./node_modules/lodash
971280 ./node_modules

Note that this applies within vim too - e.g. ! sort -n

Consider number of digits

  ep-1.mp4
  ep-10.mp4
  ep-12.mp4
  ep-2.mp4
  ep-25.mp4
  ep-29.mp4
  ep-3.mp4
  ep-30.mp4
  ep-36.mp4
  ep-37.mp4
  ep-38.mp4
  ep-39.mp4
  ep-4.mp4
  ep-40.mp4
  ep-5.mp4
  ep-6.mp4
  ep-7.mp4

The above results stayed the same with sort -n.

And this has nothing to do with the prefixes - when removed and sorted with -n, we get

  1.mp4
  10.mp4
  12.mp4
  2.mp4
  25.mp4
  29.mp4

The issue appears to be the number of digits differing. However this put 10 before 2. The right solution was sort -V for version numbers

  ep-1.mp4
  ep-2.mp4
  ep-3.mp4
  ep-4.mp4
  ep-5.mp4
  ep-6.mp4
  ep-7.mp4
  ep-8.mp4
  ep-9.mp4
  ep-10.mp4
  ep-12.mp4
  ep-13.mp4
  ep-14.mp4
  ep-25.mp4
  ep-26.mp4

From the docs: -V will give this ordering:

sort-1.022.tgz
sort-1.23.tgz
sort-1.23.1.tgz
sort-1.024.tgz
sort-1.024.003.
sort-1.024.003.tgz
sort-1.024.07.tgz
sort-1.024.009.tgz

TLDR: For numerical data, -V is most likely what is needed.

Consider case sensitivity

When I started capitalizing acronyms in the SemicolonAndSons website, I noticed in the code diary pages that titles like "CORS big picture" would come before "all you need to know about sessions" (i.e. links to entries with an uncapitalized first letter in the title). The issue was that the sort algorithm placed all capitalized letters before even a lowercase "a". The solution was to sort based on the result of capitalizing just the first letter (to create an even playing field)

entries.sort_by {|e| e[:name].upcase }