Levenshtein distance concept in NLP

In my last blog we have discussed how we can use TF - IDF method implementation using python for more details refer [TF - IDF Implementation Using Python ]. In this blog we will discuss how to deal with spelling correction to do the stemming or lemmatization effectively. 

There is a concept known as "Edit Distance". "Edit distance is the minimum number of edits required to one string to another string". We can perform following types of operations
  • Insertion of a letter
  • Deletion of a letter
  • Modification of a letter
Let's take an example to understand this concept in more detail. "success" is written as "sucess". We have two strings one with length 7 [with correct spelling] and another with length 6 [ with incorrect spelling]. 

Step 1:  If the string of length M and N then we need to create the matrix of size (M+1) and (N+1). In our case we will create the matrix of size 7 X 8 as follows.



Step 2: Initialize the first row and first column staring from 0 to N -1 and 0 to M-1 respectively. In our case we initialize the first row from 0 to 7 and first column as 0 to 6.


Step 3: If the character at row and character at column is same then copy the diagonal value to that cell.
If the character at row and character at column is not same then check the value of preceding neighbors and put the minimum value of two plus 1.

In our example let's start with first row :
  • S in the first row is matching with S in column then this cell will be filled by 0.
  • S in the first row is not matching with U in column then this cell will be filled by min of its neighbor cell values plus 1 i.e. min of [0,1,2] + 1 = 1
Now perform this step all the characters in row and characters in the column. We will get the following matrix.



Step 4: Now note down the value of cell (M) and (N). In our case it is 1. This number provides the information that 1 edit is required to convert one string to another string. This number is know Edit Distance and sometimes it is also known as Levenshtein distance.



This can be calculated by performing the following steps in Python:

from nltk.metrics.distance import edit_distance
edit_distance("sucess", "success")

As a challenge I would encourage all of you to write a function in python to calculate the "Edit Distance" of two strings. In case you need any guidance please put in comments.

In this way we can find Levenshtein distance between two strings. In case you have any queries or clarifications please put in the comment section and I would love to help you in that.

Now we have discussed the concepts of Lexical processing of text. In my next blog we will start discussing the mathematical concepts required to understanding the machine learning models. 

Please do not forget to follow me on my blog page.

Stay safe and healthy.

Comments

Popular posts from this blog

Lexical Processing

Creating Bag-of-Words using Python