Stemming and Lemmatization

In my last blog we have discussed,  bag of words method for extracting features from text [refer this link Feature Extraction - Text- Bag of word ]. The drawback of bag of word method is size of bow matrix due to redundant tokens. if we will use these redundant tokens in building any machine learning model, it will be inefficient or will not perform good.

To solve redundant token problem we can use "Stemming" and "Lemmatization"

Stemming 

Stemming technique makes sure that different variations of word are represented by a single word. E.g. run, ran, running are represented by the single word "run". So the process of reducing the inflected forms of word to its root or stem word is called as Stemming.

Root/Stem Word

Variations of root word

Gain

Gained, gaining, gainful

Do

Do, done, doing, does

Mr. Martin Stemmer had developed an algorithm for performing the stemming process to remove the common morphological endings from the English word. The algorithm is known as “Porter Stemming”.  This algorithm works on rule based technique. Mr. Stemmer has created a bunch of rules to reduce the word to its root form in his algorithm. E.g.

Rule

 

Words

Root form

ED -> ‘ ‘

Worked

Work

IES àY

Denies

Deny

ING -> ‘ ‘

Doing

Do

S à ‘ ‘

Cats

Cat

Still there are some variations of word which cannot be handled by Porter Stemming algorithm. E.g. words like Foot and Feet or Teach and Taught or Go and went etc. cannot be reduced its root form if we will use the above method. To solve this problem let’s discuss the lemmatization method.

Lemmatization

Here comes another method for reducing the inflected form of a word to its dictionary form, which is called as “Lemma” and process is called Lemmatization. This process was developed based on the English dictionary. This process scans through the dictionary to reduce the word to its root.

To give you an example wordnet is the “Lexical” database that can be used to lemmatize the word. The lemmatization process is computationally expensive as it scans through the dictionary to find the root form of the word.

So using above two methods we can solve the problem of redundant tokens and use the data to create the machine learning models to build spam / ham detector or sentiment predictor effectively.

In my next blog we will discuss another method called as “TF-IDF” for converting the text into matrix form and python implementation of performing the bag-of-word, stemming and lemmatization.

I hope you learned something today. Stay tuned and get subscribe blog to get the notification.  Do not forget to share your thoughts or comments.  Stay safe and healthy and let’s keep learning.

Comments

Popular posts from this blog

Levenshtein distance concept in NLP

Creating Bag-of-Words using Python

Lexical Processing