Extract Features from text for Machine Learning : Bag-of-Words
In my last blog we have seen how can we generate the tokens.[Refer link: Tokenization] Now its time to discuss how can we convert the textual data into matrix form which can be understandable by machine learning algorithms.
Let's get started with method known as "Bag-of-Words". The idea behind this method is that "Any piece of text can be represented by list of words or tokens used in it".
Let's get started with method known as "Bag-of-Words". The idea behind this method is that "Any piece of text can be represented by list of words or tokens used in it".
Before we move forward, I just want us to revise power law[Reference Link:Power Law] discussed earlier in my blog where we discussed that stopwords are not important and does not provide any useful information about the piece of text or document. Before converting the textual data into matrix form we should remove all the stopwords from the list of tokens.
Let's understand bag of words in more detail considering the following sentence:
[Tiger, biggest, wild, animal, cat, family].
So above list of words or bag of words representing above sentence still is giving us the meaningful information about the message even after removing the stopwords.
Let's discuss how these bag of words can be used for converting the textual data into matrix form:
Consider following sentences and we will discuss how can represent this data into bag-of-words[BOW]
You may have come across the spam SMS's. If we generate the bag-of-words for spam messages. each message will contain Win or Winner or Prize or Lottery etc. Whenever we will receive the new message we scan through bag words of spam messages and can identify that the message is spam or ham.
Using tokens to generate the matrix representation of word have following drawback, suppose documents have the words like run, running or win, winner, drive driving or work,working worked etc. For all of words a different token will be generated however they providing same information. It will create a huge matrix which will be difficult for machine learning algorithm to handle.
So how to tackle this problem while generating the features from text. This I will cover in my next blog. Stay tuned and get subscribe blog to get the notification.
Do not forget to share your thoughts comments or suggest any topic you want to me cover.
Happy Learning and Happy weekend....
Let's understand bag of words in more detail considering the following sentence:
Tiger is the biggest wild animal in the cat family.If we generate tokens for above sentence after removing the stopwords then we will have following list of words as tokens.
[Tiger, biggest, wild, animal, cat, family].
So above list of words or bag of words representing above sentence still is giving us the meaningful information about the message even after removing the stopwords.
Let's discuss how these bag of words can be used for converting the textual data into matrix form:
Consider following sentences and we will discuss how can represent this data into bag-of-words[BOW]
- Tiger is the biggest wild animal in the cat family.
- Tiger has long and strong body. It has four legs, strong paws with sharp nails and one tail.
Bag of words for above sentences will be as
S1 = [Tiger, biggest, wild, animal, cat, family]
S2 = [Tiger, long, strong, body, four, legs, paws, sharp, nails, one, tail].
Once we have generated tokens for sentences or document, we will write "Tokens" as column of matrix and sentence or document id as row of matrix. Refer to screen shot below
BOW - Representation of S1 and S2 |
We can see that unique tokens from S1 and S2 are put as column of matrix and S1 & S2 which is sentence or document id as put as row of the matrix.
Now next step is to check which word appear in which document id. Let's start with first token "Tiger" this has appeared in S1 and S2 both, in cell (S1, Tiger) & (S2,Tiger) we will put 1. Moving on next token "biggest", this has appeared only in S1 but not in S2 so in the cell(S1, biggest) we will put 1 and in the cell (S2, biggest) we will put 0. We continue this process till all tokens are scanned. Once we will finish this we will get the following matrix representation of S1 and S2 using BOW method.
Matrix Representation of Textual Data |
Now we have represented the text in the matrix form where each document represented as row and each word of vocabulary has its own column. These vocabulary words are known as "Features of text"
You may have come across the spam SMS's. If we generate the bag-of-words for spam messages. each message will contain Win or Winner or Prize or Lottery etc. Whenever we will receive the new message we scan through bag words of spam messages and can identify that the message is spam or ham.
Using tokens to generate the matrix representation of word have following drawback, suppose documents have the words like run, running or win, winner, drive driving or work,working worked etc. For all of words a different token will be generated however they providing same information. It will create a huge matrix which will be difficult for machine learning algorithm to handle.
So how to tackle this problem while generating the features from text. This I will cover in my next blog. Stay tuned and get subscribe blog to get the notification.
Do not forget to share your thoughts comments or suggest any topic you want to me cover.
Happy Learning and Happy weekend....
Great San!
ReplyDeleteVery interesting. Thanks
ReplyDelete