Posts

Stemming and Lemmatization

In my last blog we have discussed,  bag of words method for extracting features from text [refer this link  Feature Extraction - Text- Bag of word  ]. The drawback of bag of word method is size of bow matrix due to redundant tokens. if we will use these redundant tokens in building any machine learning model, it will be inefficient or will not perform good. To solve redundant token problem we can use "Stemming" and "Lemmatization" Stemming  Stemming  technique makes sure that different variations of word are represented by a single word. E.g. run, ran, running are represented by the single word "run". So the process of reducing the inflected forms of word to its root or stem word is called as Stemming . Root/Stem Word Variations of root word Gain Gained, gaining, gainful Do Do, done, doing, does Mr. Martin Stemmer had developed an algorithm for performing the stemming process to...

Extract Features from text for Machine Learning : Bag-of-Words

Image
In my last blog we have seen how can we generate the tokens.[Refer link:  Tokenization ] Now its time to discuss how can we convert the textual data into matrix form which can be understandable by machine learning algorithms. Let's get started with method known as "Bag-of-Words". The idea behind this method is that  "Any piece of text can be represented by list of words or tokens used in it".  Before we move forward, I just want us to revise power law[Reference Link: Power Law ] discussed earlier in my blog where we discussed that stopwords are not important and does not provide any useful information about the piece of text or document. Before converting the textual data into matrix form we should remove all the stopwords from the list of tokens. Let's understand bag of words in more detail considering the following sentence: Tiger is the biggest wild animal in the cat family. If we generate tokens for above sentence after removing the stopwords th...

Lexical Processing

Image
In my previous blog WordFrequency Distribution : Power Law . I explained basic concepts of lexical processing for normal distribution of data for machine learning algorithm. In this article we will go through the high level steps to process the textual data for machine learning and as part of this series I will be explaining the lexical processing of the text using different types of tokenization feature in python. In processing the textual data for machine learning following steps are performed: Lexical Processing of text:  Converting raw text into words, sentences, paragraphs etc. Syntactic Processing of text: Understanding the relationships among words used in the sentences. Semantic processing of text: Understanding the meaning of text. To do the lexical processing of text we perform: -  Tokenization and Extraction of features from text “To kenization ” is technique that is used to split the text into smaller elements. These elements can be characters...

Power Law or Zipf’s Law: Word Frequency Distribution

Image
This article will help you to understand the basic concepts for lexical processing for text data before using in any machine learning model.   Working with any type of data, be it numeric, textual or images, involves following steps Explore : Performing  pre-processing of data Understand the data As text is made up of words, sentences and paragraphs, hence exploring of text data, can be started by analyzing  the words frequency distribution . Famous linguist,  George Zipf  had started a simple exercise:  Count the number of times each word appear in the document Create a rank order on the frequency of each word.  The most frequent word was given the rank 1, second most frequent work was given rank 2 and so on.  He repeated this exercise on many documents and found a specific pattern in which words are distributed in the document. Basis the pattern observed he has given a principle known as “Zipf Law or Power Law” Let’s analyze the word freq...