Use fuzzy match algorithms to manage messy string data

This is part of the Semicolon&Sons Code Diary - consisting of lessons learned on the job. You're in the data category.

Last Updated: 2024-03-28

I had hundreds of fields (law_disciplines, colleges etc.) where users were (previously) allowed to enter whatever they wanted, leading to lots of "almost the same" strings - e.g. University of London vs London University.

Eventually I wanted so streamline this to enable better on-site filters. Manually doing this would be too painful to consider so I used a fuzzy match algorithm to give me an idea of closeness and automate the process (or at much as possible)

Here was the time-saving code (relying on a generic fuzzy match library)

def best_match(needle, possibilities)
  match, score = FuzzyMatch.new(possibilities).find_with_score(needle)
  puts "Possible issue: #{match}:#{needle}" if score < 0.5
  match
end

colleges.each {|college|
  { 
    id: college.id,
    old_name: college.name,
    new_name: best_match(college.name, valid_colleges)
  }
}

98% were perfectly matched, and the low scores indicated what records needed visiting by hand.