TF - IDF Method


In my last blog we have discussed about how can we create the bag of words using python [refer this link CreatingBag-of-Words using Python ]. Now we have seen that bag-of-word approach is purely dependent on the frequency of words. Now let’s discuss another approach to convert the textual data into matrix format, it called us TF-IDF [Term Frequency – Inverse document Frequency] and it is the most preferred way used by most of data scientist and machine learning professionals.

In this approach we consider a term is relevant to document if that term appears frequently in the document and term is unique to document i.e. term should not appear in all the documents. So its frequency considering with respect to all documents should be small and term frequency for specific document should be high.

TF-IDF score is calculated as follows:


Term frequency of a term (t) in a document (d).
  Inverse document frequency of a term

 

Below are formulas for calculating theand 


 Frequency of term (t) in a document (d) after pre-processing of text like removing stop words or performing stemming or lemmatization. 

Total number of terms in a given document (d) after pre-processing of text like removing stop words performing stemming or lemmatization

|D| = Total number of documents 

Total Number of documents in which term (t) appears

Let’s consider following three sentences as document1, document2 and document3 to get into details of it:

1.      Kesri is a good movie to watch with family as it has good stories about India freedom fight.

2.      The success of it depends on the performance of the actors and story it has.

3.      There are no new movies releasing this month due to corona virus.

Let’s calculate the term frequency of term movie for d1. After removing doing the pre-processing of  d1, we will have following words left in document

'kesri good movie watch family good stories india freedom fight .'

1 [Above document has appeared only once]

10 [total number of terms in document are 10]



 

|D| = Total number of documents = 3

Total Number of documents in which term (t) appears = 2



Now we can calculate the TF- IDF score for term movie for document 1.

 

After pre-processing of d3, we will have following words left in the d3

'new movie releasing month due corona virus .'

= 1 [Above document has appeared only once]

7 [total number of terms in document are 7]

= 0.1429

 

|D| = Total number of documents = 3

Total Number of documents in which term (t) appears = 2


 


Now we can calculate the TF- IDF score for term movie for document 3.






Now we can see that word movie is more relevant for document 3 as compared to document 1 as the TF-IDF score for term movie is more as compared to document d1.

Now to create matrix for each document we will put in the value of TF-IDF score instead of just frequency.

I would urge to practice and calculate the TF-IDF score for other terms and build matrix for that. In case you have any queries or clarifications please put in the comment section and I would love to help you in that.

Please do not forget to follow me on my blog page. Next week we will be discussing the how can create TF-IDF matrix using python and how can we check spelling errors in a document.

Happy weekend and keep learning... Stay safe and healthy.


Comments

Popular posts from this blog

Levenshtein distance concept in NLP

Lexical Processing

Creating Bag-of-Words using Python