Holy Crap, Levenshtein Distance Exists

How I got to now

I have spent a decade-plus of working with school data, despite no formal training in the technology of data analysis. I keep running into the same problem: inconsistent name spellings. So many of our data sets lack a unique identifier and rely on student names to serve in this role. Though not ideal, this would still work, except that name spelling variations are common, as are differences in the number of students included on the roster. In order to figure out which students are missing, I would check for matches, but the different name spellings meant that I could not reliably find matches. This meant a lot of manual line-by-line hunts for mistakes.

Two years ago, I decided that data was an area I wanted to focus on in my development, and that I would need more than just a set of Excel tricks to be successful. I started learning Python on Code Academy, and reading books about Python and about data analysis and data science more broadly. This brought me to my current read, and thus far among my favorite books on the subject: Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking. I’m planning on a longer post when I finish the book, but I read something yesterday and couldn’t help but start a post about something I read.

Levenshtein distance

Levenshtein distance is a measure of distance between two character strings. Distance, in a mathematical sense, is a way of thinking how dissimilar or far apart two words or strings are. Thus, if the distance is low, then the two words are very similar. Words that are exactly the same would have a distance of 0.

The reason this is so helpful, and why I’m excited to learn more about using Levenshtein distance in Python, is that I can use it to determine when two names are probably the same with a slight spelling variation. Take, for example, the case of a student whose name includes an apostrophe. When that students’ name is listed without the apostrophe, it is not a perfect match, but it does indicate the same student. I can use Levenshtein distance to determine if a name one list that lacks a match is close to a name on another list that also lacks a match, and then match them.

How I might use it

Here’s where I would love some feedback from my more coding-savvy readers. My vision for matching two lists of names would be to first iterate over them to find the perfect matches. Then, I would likely have a handful of unmatched names on both lists. I could then iterate over each unmatched item from one list and calculate its Levenshtein distance from the unmatched items on the other list. Presumably, the near-matches with one or two character differences would stand out in the data. I would still have to manually check for the match, but at least I will have saved the time searching for possible matches.

Overall, it’s amazing what you can learn when you actually read and research into a topic instead of just forging ahead blindly–then again, I also love the forging.

Brief Updates

The audiobook for Values-Driven Allowance should release any day now.

Leave a comment