Posts

Showing posts with the label FeatureExtraction

TF - IDF Implementation using Python

Image
In my last blog we have discussed how we can use TF - IDF method to extract the features of text, refer to - TF - IDF Method . Now we will see how can we implement the TF - IDF concept using python. Let's consider the same three sentences which we have discussed in our last blog to understand TF-IDF implementation in python. Kesri is a good movie to watch with family as it has good stories about India freedom fight. The success of it depends on the performance of the actors and story it has. There are no new movies releasing this month due to corona virus. The first step is to import the necessary libraries to perform the text processing. import pandas as pd from nltk.tokenize import word_tokenize from nltk.corpus import stopwords from sklearn.feature_extraction.text import TfidfVectorizer You must have already noticed that we have imported TfidfVectorizer to extract the text features using TF-IDF.  Second step is to store the sentences in the list: documents = ["Kesri is a go...

TF - IDF Method

Image
In my last blog we have discussed about how can we create the bag of words using python [refer this link  CreatingBag-of-Words using Python  ]. Now we have seen that bag-of-word approach is purely dependent on the frequency of words. Now let’s discuss another approach to convert the textual data into matrix format, it called us TF-IDF [Term Frequency – Inverse document Frequency] and it is the most preferred way used by most of data scientist and machine learning professionals. In this approach we consider a term is relevant to document if that term appears frequently in the document and term is unique to document i.e. term should not appear in all the documents. So its frequency considering with respect to all documents should be small and term frequency for specific document should be high. TF-IDF score is calculated as follows: Term frequency of a term (t) in a document (d).   Inverse document frequency of a term   Below are formulas for calculating the and...

Creating Bag-of-Words using Python

Image
In my last two blog we have discussed about the bag of words method for extracting features from text [refer this link  Feature Extraction - Text- Bag of word  ] and stemming and lemmatization techniques to avoid the redundant token problem [ refer this link Stemming and Lemmatization ] . Now it’s time to apply those concepts using python and see the things in action: We are going to explore below three options Let’s consider following three sentences: Kesri is a good movie to watch with family as it has good stories about India freedom fight. The success of a movie depends on the performance of the actors and story it has. There are no new movies releasing this month due to corona virus. Using above three sentences we will extract the Bag-of-words by applying the concepts of tokenization, stemming and lemmatization. So let’s get started: Step 1: Import the libraries word_tokenize for tokenization stopwords for stop words and CountVectorizer for creating bag-of words. #...

Stemming and Lemmatization

In my last blog we have discussed,  bag of words method for extracting features from text [refer this link  Feature Extraction - Text- Bag of word  ]. The drawback of bag of word method is size of bow matrix due to redundant tokens. if we will use these redundant tokens in building any machine learning model, it will be inefficient or will not perform good. To solve redundant token problem we can use "Stemming" and "Lemmatization" Stemming  Stemming  technique makes sure that different variations of word are represented by a single word. E.g. run, ran, running are represented by the single word "run". So the process of reducing the inflected forms of word to its root or stem word is called as Stemming . Root/Stem Word Variations of root word Gain Gained, gaining, gainful Do Do, done, doing, does Mr. Martin Stemmer had developed an algorithm for performing the stemming process to...

Extract Features from text for Machine Learning : Bag-of-Words

Image
In my last blog we have seen how can we generate the tokens.[Refer link:  Tokenization ] Now its time to discuss how can we convert the textual data into matrix form which can be understandable by machine learning algorithms. Let's get started with method known as "Bag-of-Words". The idea behind this method is that  "Any piece of text can be represented by list of words or tokens used in it".  Before we move forward, I just want us to revise power law[Reference Link: Power Law ] discussed earlier in my blog where we discussed that stopwords are not important and does not provide any useful information about the piece of text or document. Before converting the textual data into matrix form we should remove all the stopwords from the list of tokens. Let's understand bag of words in more detail considering the following sentence: Tiger is the biggest wild animal in the cat family. If we generate tokens for above sentence after removing the stopwords th...