Posts

Essential Maths for ML – Part 3

In my last blog [ Essential Maths for ML - Part 2 ]  we have discussed about addition rule, multiplication rule of probability and conditional probability  .  In this blog, we will discuss the Bayes Theorem which plays an important role in most of the machine learning algorithms. So let’s consider A1, A2, A3, and A4 be the mutually exclusive and exhaustive event of a random experiment. Let B be the common event i.e. the event B is made-up of 4- mutually exclusive and exhaustive events.   P(B) = P(A1 ⋂ B) + P(A2⋂B) + P(A3⋂B) + P(A4⋂B) P(B) = Σ P(Ai ⋂ B)…………(1) We already know from the concept of the conditional probability that P(A1⋂ B) = P(B) * P(A1/B) P(A1/B) = P(A1⋂ B) / P(B)………(2) Replacing the value of P(B) from the eq2 we can say that P(A1/B) = P(A1 ⋂ B) / Σ P(Ai ⋂ B) P(A1/B) = P(A1) * P(B/A1) / Σ P(Ai ⋂ B) So the Bayes Theorem states that if A1, A2, A3……….An, are n mutually exclusive and exhaustive events with prior probabilities P(A...

Essential Maths for ML – Part 2

In my last blog [ Essential Maths for ML - Part 1 ] we have discussed about various type of events. In this blog let's discuss other important concepts related to probability. 1. Rules of probability a. Addition Rule: If A and B are any two events that are not mutually exclusive events, then the probability of occurrence of either A or B is given by P(A U B) = P(A) + P(B) - P(A  ∩ B) If A and B are any two events that are mutually exclusive events, then the probability of occurrence of either A or B is given by  P(A U B) = P(A) + P(B) b. Multiplication Rule: If A and B are two independent events then probability of occurrence of A and B is given by P(A ∩B) = P(A) * P(B) Conditional Probability Conditional probability of occurrence of event A given that event B has already occurred is denoted by P(A/B) where A and B are dependent events. Now probability of occurrence of event A and event B is given by P(A ∩B) = P(B) * P(A/B) so  P(A/B) = P(A ∩B) / P(B) Let's also unde...

Essential Maths Concepts for ML – Part 1

Let's discuss some terms and their definitions related to statistics and probability. It will help us in brushing our concepts of probability which are essential for machine learning algorithms Definition of Probability  Probability is a numerical measurement which indicates the chances of occurrence of an event, say A. It is denoted by P(A). It is the ratio of favorable outcomes of an event A say m to the total outcomes of the experiment say n. P(A) = m/n where m represents the number of favorable outcomes of an event A and n is the total number of outcomes of the experiment. Let's understand the experiment term in more details now. An operation that results in a definite outcome is called an experiment E.g.  Tossing a coin is an experiment as it can have two outcomes either Head or Tail and it is definite in number. Throwing a fair dice is an experiment as it can have only 6 outcomes which is definite in number. Random Experiment When the outcome of an ex...

Levenshtein distance concept in NLP

Image
In my last blog we have discussed how we can use TF - IDF method implementation using python for more details refer [ TF - IDF Implementation Using Python  ]. In this blog we will discuss how to deal with spelling correction to do the stemming or lemmatization effectively.  There is a concept known as "Edit Distance". "Edit distance is the minimum number of edits required to one string to another string". We can perform following types of operations Insertion of a letter Deletion of a letter Modification of a letter Let's take an example to understand this concept in more detail. "success" is written as "sucess". We have two strings one with length 7 [with correct spelling] and another with length 6 [ with incorrect spelling].  Step 1:   If the string of length M and N then we need to create the matrix of size (M+1) and (N+1). In our case we will create the matrix of size 7 X 8 as follows. Step 2: Initialize the first row and first column st...

TF - IDF Implementation using Python

Image
In my last blog we have discussed how we can use TF - IDF method to extract the features of text, refer to - TF - IDF Method . Now we will see how can we implement the TF - IDF concept using python. Let's consider the same three sentences which we have discussed in our last blog to understand TF-IDF implementation in python. Kesri is a good movie to watch with family as it has good stories about India freedom fight. The success of it depends on the performance of the actors and story it has. There are no new movies releasing this month due to corona virus. The first step is to import the necessary libraries to perform the text processing. import pandas as pd from nltk.tokenize import word_tokenize from nltk.corpus import stopwords from sklearn.feature_extraction.text import TfidfVectorizer You must have already noticed that we have imported TfidfVectorizer to extract the text features using TF-IDF.  Second step is to store the sentences in the list: documents = ["Kesri is a go...

TF - IDF Method

Image
In my last blog we have discussed about how can we create the bag of words using python [refer this link  CreatingBag-of-Words using Python  ]. Now we have seen that bag-of-word approach is purely dependent on the frequency of words. Now let’s discuss another approach to convert the textual data into matrix format, it called us TF-IDF [Term Frequency – Inverse document Frequency] and it is the most preferred way used by most of data scientist and machine learning professionals. In this approach we consider a term is relevant to document if that term appears frequently in the document and term is unique to document i.e. term should not appear in all the documents. So its frequency considering with respect to all documents should be small and term frequency for specific document should be high. TF-IDF score is calculated as follows: Term frequency of a term (t) in a document (d).   Inverse document frequency of a term   Below are formulas for calculating the and...

Creating Bag-of-Words using Python

Image
In my last two blog we have discussed about the bag of words method for extracting features from text [refer this link  Feature Extraction - Text- Bag of word  ] and stemming and lemmatization techniques to avoid the redundant token problem [ refer this link Stemming and Lemmatization ] . Now it’s time to apply those concepts using python and see the things in action: We are going to explore below three options Let’s consider following three sentences: Kesri is a good movie to watch with family as it has good stories about India freedom fight. The success of a movie depends on the performance of the actors and story it has. There are no new movies releasing this month due to corona virus. Using above three sentences we will extract the Bag-of-words by applying the concepts of tokenization, stemming and lemmatization. So let’s get started: Step 1: Import the libraries word_tokenize for tokenization stopwords for stop words and CountVectorizer for creating bag-of words. #...