Creating Bag-of-Words using Python

In my last two blog we have discussed about the bag of words method for extracting features from text [refer this link Feature Extraction - Text- Bag of word ] and stemming and lemmatization techniques to avoid the redundant token problem [ refer this link Stemming and Lemmatization ] . Now it’s time to apply those concepts using python and see the things in action:

We are going to explore below three options



Let’s consider following three sentences:

  • Kesri is a good movie to watch with family as it has good stories about India freedom fight.
  • The success of a movie depends on the performance of the actors and story it has.
  • There are no new movies releasing this month due to corona virus.

Using above three sentences we will extract the Bag-of-words by applying the concepts of tokenization, stemming and lemmatization. So let’s get started:

Step 1: Import the libraries word_tokenize for tokenization stopwords for stop words and CountVectorizer for creating bag-of words.

# load all necessary libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
nltk.download('punkt')
nltk.download('stopwords')

Step 2: Let’s store each sentences in the list:

sentences = ["Kesri is a good movie to watch with family as it has good stories about India freedom fight.", 
             "The success of a movie depends on the performance of the actors and story it has.",
            "There are no new movies releasing this month due to corona virus."]
print(sentences)

Step 3: Now let’s define a function to perform the pre-processing of text. Like converting all the text to lower case and removing the stop words from sentences:

def preprocess(sentence):
    'changes sentence to lower case and removes stopwords'
 
    # change sentence to lower case
    sentence = sentence.lower()
 
    # tokenize into words
    words = word_tokenize(sentence)
 
    # remove stop words
    words = [word for word in words if word not in stopwords.words("english")]
 
    # join words to make sentence
    sentence = " ".join(words)
    #print(sentence)
    return sentence
 
sentences = [preprocess(sentence) for sentence in sentences]
print(sentences)

Step 4: We will create the object of countvetorizer to create the bag- of-words for the sentences:

vectorizer = CountVectorizer()
bow_model = vectorizer.fit_transform(sentences)
# returns the rown and column number of cells which have 1 as value
print(bow_model) 

# print the full sparse matrix
print(bow_model.toarray())
We will get the following BOW matrix with shape : (3, 21)

print(bow_model.shape)
print(vectorizer.get_feature_names())

 ['actors', 'corona', 'depends', 'due', 'family', 'fight', 'freedom', 'good', 'india', 'kesri', 'month', 'movie', 'movies', 'new', 'performance', 'releasing', 'stories', 'story', 'success', 'virus', 'watch']

Now if we analyse above bag-of-words created, we can clearly see that it has generated redundant tokens for movie and movies or story and stories.  Due to this the bag-of-words have extra tokens which have the same meanings or context. Now let’s apply the stemming technique to get rid of this redundant token problem

BOW - Stemming

Step 1:  Let’s import library for performing stemming apart from library already imported

from nltk.stem.porter import PorterStemmer

Step 2: Let’s store each sentences in the list:

sentences = ["Kesri is a good movie to watch with family as it has good stories about India freedom fight.", 
             "The success of a movie depends on the performance of the actors and story it has.",
            "There are no new movies releasing this month due to corona virus."]
print(sentences)

Step 3:Now we will change the function for pre-processing  of text to include the stemming step also:

def preprocess(sentence):
    'changes sentence to lower case and removes stopwords'
     # change sentence to lower case
    sentence = sentence.lower()
     # tokenize into words
    words = word_tokenize(sentence)
     # remove stop words
    words = [word for word in words if word not in stopwords.words("english")]
    
    stemmer = PorterStemmer()
    porter_stemmed = [stemmer.stem(word) for word in words]
 
    # join words to make sentence
    sentence = " ".join(porter_stemmed)
    #print(sentence)
    return sentence
sentences = [preprocess(sentence) for sentence in sentences]
print(sentences)

Step 4: We will create the object of countvetorizer to create the bag- of-words 
models:

vectorizer = CountVectorizer()
bow_model = vectorizer.fit_transform(sentences)
# returns the rown and column number of cells which have 1 as value
print(bow_model)  
 
# print the full sparse matrix
print(bow_model.toarray())
We will get the following BOW matrix with shape : (3, 19)



print(bow_model.shape)
print(vectorizer.get_feature_names())

['actor', 'corona', 'depend', 'due', 'famili', 'fight', 'freedom', 'good', 'india', 'kesri',

'month', 'movi', 'new', 'perform','releas', 'stori', 'success', 'viru', 'watch']


Now if we analyse above bag-of-words created after including the stemming step,

we can clearly see that it has not generated redundant tokens for movie and movies

or story and stories. But it has changed the spelling of story and stories to stori &

movie and movies to movi and other spelling has also changed.  


BOW - Lemmatization

To avoid this problem & redundant token problem, we can use another technique

Lemmatization.

Step 1 : Let’s import library for performing lemmatization apart from library already

imported

from nltk.stem import WordNetLemmatizer

Step 2: Let’s store each sentences in the list:

sentences = 
["Kesri is a good movie to watch with family as it has good stories about India freedom fight.", 
"The success of a movie depends on the performance of the actors and story it has.",
 "There are no new movies releasing this month due to corona virus."]
print(sentences)

Step 3: Now we will change the function for pre-processing of text to include the

lemmatization step instead of stemming:

def preprocess(sentence):
    'changes sentence to lower case and removes stopwords'
     # change sentence to lower case
    sentence = sentence.lower()
     # tokenize into words
    words = word_tokenize(sentence)
     # remove stop words
    words = [word for word in words if word not in stopwords.words("english")]
  
    wordnet_lemmatizer = WordNetLemmatizer()
    lemmatized = [wordnet_lemmatizer.lemmatize(word) for word in words]
 
    # join words to make sentence
    sentence = " ".join(lemmatized)
    #print(sentence)
        return sentence
sentences = [preprocess(sentence) for sentence in sentences]
print(sentences)

Step 4: We will create the object of countvetorizer to create the bag- of-words

models:

vectorizer = CountVectorizer()
bow_model = vectorizer.fit_transform(sentences)
# returns the rown and column number of cells which have 1 as value
print(bow_model)  
# print the full sparse matrix
print(bow_model.toarray())
We will get the following BOW matrix with shape : (3, 19)



print(bow_model.shape)
print(vectorizer.get_feature_names())

['actor', 'corona', 'depends', 'due', 'family', 'fight', 'freedom', 'good', 'india', 'kesri',

'month', 'movie', 'new', 'performance', 'releasing', 'story', 'success', 'virus', 'watch']


Now if we analyse above bag-of-words created after including the lemmatization

step, we can see that now we have avoided the problem of redundant tokens

keeping spelling of word intact. Please do the hands on and in case of problem

faced or clarification, please put in comment section.


Next week, we will have discussion on another approach for creating bag-of words

from text. If you are finding information shared is useful, please do follow to my blog.


Happy weekend and keep learning... Stay safe and healthy.

Comments

Post a Comment

Popular posts from this blog

Levenshtein distance concept in NLP

Lexical Processing