Fuzzy Keyword Matching— Enhance the power of keyword matching!!!
In real world examples you often have to deal with typos and misspelled words. This is a big problem for the most people that are new to Data Science. In this article I’ll show you how you can solve this problem.
Why Keywords?
In the Data Science community it is often useful to filter documents based on keywords. These keywords could be extracted from previous analysis or they could be provided by knowledge/domain experts. The following example shows you a list of keywords to filter performance problems.
keywords = [“100% CPU”, “CPU starvation”,”crash”,”hang”,”hanging thread”, “heap dump”, “Heapdump”, “high CPU”, “hung thread”,
“memory leak”,”OOM”,”Out Of Memory”,”threaddump”]
Exact Match: First of all lets look to the most common method. It is easy to write a simple algorithm that is able to detect an exact match. For example:
sentence = “Outofmemory encountered Apptarget nodes encountering Outofmemory apptarget nodes jvm recovers itself full GC review dumps help identify the root cause outofmemory noticing failed event being generated same time OOM not able open FEM.”print(“Outofmemory” in sentence) # Return True
but if there is no 100% match it will return false. For example:
sentence = “Outofmemory encountered Apptarget nodes encountering Outofmemory apptarget nodes jvm recovers itself full GC review dumps help identify the root cause outofmemory noticing failed event being generated same time OOM not able open FEM.”print(“Out of Memory” in sentence) # Return False
What is the solution? Of course you could use a REGEX for each keyword you looking for but this is a very time consuming way. The better solution is to implement Fuzzy logic.
What is Fuzzy
It is often used to compare sequences of natural language. It is a technique that allows you to find similar sequences in written or spoken text. In our context we want to use it to enhance our keyword search.
Some theory on string metrics.
Levenshtein
This is an algorithm that measures how many characters you have to insert, delete or replace (substitute) to get the same word you are looking for. For example “Fuzzy“ and “Fuzy” the result would be 1. Because you only have to insert a “z” to get the same word. For “Fuzzy” and “Beer” you get the result 5. Because you have to change all characters.
Wikipedia: https://en.wikipedia.org/wiki/Levenshtein_distance
Hamming
The strings need to be of equal length otherwise the method will not work. It counts the numbers of positions with the same symbol. In other words it checks for possible substitutions in both strings. If Leventshein and Hamming have equal results for a sequence with the same length. Hamming is the upper bound of Leventshein.
Wikipedia: https://en.wikipedia.org/wiki/Hamming_distance
Jaro-Winkler
Delivers a value between 0 and 1, where 0 indicates that the two string are not equal, and 1 indicates they are identical.
Wikipedia: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
A List of additional string metrics you will find here: https://en.wikipedia.org/wiki/String_metric
Let’s code!
Install: https://pypi.org/project/python-Levenshtein/
In the sample below, we want to verify that we can detect an “out of memory” problem. In order to do that we will use the keyword “Out of Memory”. As you can see in the sample below, the sentence contains the keyword however it is not an exact match. Let’s experiment with the 3 metrics described above and see what we get.
import Levenshtein as levsentence = “Outofmemory encountered Apptarget nodes“print(lev.distance(“Out of Memory”, “Outofmemory”)) # >>> '3'print(lev.jaro_winkler(“Out of Memory”, “Outofmemory”)) # >>> '0.9016083916083916'print(lev.hamming(‘Out of Memory’, ‘Outofmemory‘))
# >>> Strings are not equal
As you can see from the results, the Levenshtein distance is 3 and the hamming method is not able to find a result because of the unequal length. Based on these results, neither of the two are very useful on their own to help us detect what we are searching for.
Let’s look at the Jaro-Winkler distance. This provided a result indicating that both sequences are equal with a probability of 90 percent! This is an significant improvement in performance, compared to the previous two, towards achieving our goal to match these keywords in the sentence.
Conclusion
I hope this article provides you with some useful ideas to improve your natural language projects.