Stemming and Lemmatization

- May 17, 2020

In my last blog we have discussed, bag of words method for extracting features from text [refer this link Feature Extraction - Text- Bag of word ]. The drawback of bag of word method is size of bow matrix due to redundant tokens. if we will use these redundant tokens in building any machine learning model, it will be inefficient or will not perform good.

To solve redundant token problem we can use "Stemming" and "Lemmatization"

Stemming

Stemming technique makes sure that different variations of word are represented by a single word. E.g. run, ran, running are represented by the single word "run". So the process of reducing the inflected forms of word to its root or stem word is called as Stemming.

Root/Stem Word	Variations of root word
Gain	Gained, gaining, gainful
Do	Do, done, doing, does

Mr. Martin Stemmer had developed an algorithm for performing the stemming process to remove the common morphological endings from the English word. The algorithm is known as “Porter Stemming”. This algorithm works on rule based technique. Mr. Stemmer has created a bunch of rules to reduce the word to its root form in his algorithm. E.g.

Rule	Words	Root form
ED -> ‘ ‘	Worked	Work
IES àY	Denies	Deny
ING -> ‘ ‘	Doing	Do
S à ‘ ‘	Cats	Cat

Still there are some variations of word which cannot be handled by Porter Stemming algorithm. E.g. words like Foot and Feet or Teach and Taught or Go and went etc. cannot be reduced its root form if we will use the above method. To solve this problem let’s discuss the lemmatization method.

Lemmatization

Here comes another method for reducing the inflected form of a word to its dictionary form, which is called as “Lemma” and process is called Lemmatization. This process was developed based on the English dictionary. This process scans through the dictionary to reduce the word to its root.

To give you an example wordnet is the “Lexical” database that can be used to lemmatize the word. The lemmatization process is computationally expensive as it scans through the dictionary to find the root form of the word.

So using above two methods we can solve the problem of redundant tokens and use the data to create the machine learning models to build spam / ham detector or sentiment predictor effectively.

In my next blog we will discuss another method called as “TF-IDF” for converting the text into matrix form and python implementation of performing the bag-of-word, stemming and lemmatization.

I hope you learned something today. Stay tuned and get subscribe blog to get the notification. Do not forget to share your thoughts or comments. Stay safe and healthy and let’s keep learning.

Search This Blog

Explore and Learn Machine Learning Concept

Stemming and Lemmatization

Comments

Post a Comment

Popular posts from this blog

Creating Bag-of-Words using Python

Levenshtein distance concept in NLP

Lexical Processing