Use fuzzy match algorithms to manage messy string data

This is part of the Semicolon&Sons Code Diary - consisting of lessons learned on the job. You're in the data category.

Last Updated: 2024-03-02

I had hundreds of fields (law_disciplines, colleges etc.) where users were (previously) allowed to enter whatever they wanted, leading to lots of "almost the same" strings - e.g. University of London vs London University.

Eventually I wanted so streamline this to enable better on-site filters. Manually doing this would be too painful to consider so I used a fuzzy match algorithm to give me an idea of closeness and automate the process (or at much as possible)

Here was the time-saving code (relying on a generic fuzzy match library)

def best_match(needle, possibilities)
  match, score =
  puts "Possible issue: #{match}:#{needle}" if score < 0.5

colleges.each {|college|
    new_name: best_match(, valid_colleges)

98% were perfectly matched, and the low scores indicated what records needed visiting by hand.