Lexical Processing

In my previous blog WordFrequency Distribution : Power Law. I explained basic concepts of lexical processing for normal distribution of data for machine learning algorithm.

In this article we will go through the high level steps to process the textual data for machine learning and as part of this series I will be explaining the lexical processing of the text using different types of tokenization feature in python.

In processing the textual data for machine learning following steps are performed:






  1. Lexical Processing of text: Converting raw text into words, sentences, paragraphs etc.
  2. Syntactic Processing of text: Understanding the relationships among words used in the sentences.
  3. Semantic processing of text: Understanding the meaning of text.

To do the lexical processing of text we perform: - 

  • Tokenization and
  • Extraction of features from text

“Tokenization” is technique that is used to split the text into smaller elements. These elements can be characters, words, sentences or paragraphs; depending upon the type of application we are working on.

 E.g. you must have heard about the spam detectors; in spam detector we break the message or email text into words, in order to identify whether message or email is spam or ham.

 This technique of splitting text into either words or sentences or paragraphs is called “tokenization”.

 There is library “NLTLK” in python which have different types of tokenizer available; some of the most popular are: 

Word Tokenizer: 
As the name suggest it is used to split the text into words.
Let's consider for the following sentence
“We are entering a new world. The technologies of machine learning, speech recognition, and natural language understanding are reaching a nexus of capability.The end result is that we’ll soon have artificially intelligent assistants to help us in every aspect of our lives.”
If we will generate the word tokens for above sentence, it will extract all the words as tokens [including punctuation marks]. 
['We', 'are', 'entering', 'a', 'new', 'world', '.', 'The', 'technologies', 'of', 'machine', 'learning', ',', 'speech', 'recognition', ',', 'and', 'natural', 'language', 'understanding', 'are', 'reaching', 'a', 'nexus', 'of', 'capability.The', 'end', 'result', 'is', 'that', 'we', "'ll", 'soon', 'have', 'artificially', 'intelligent', 'assistants', 'to', 'help', 'us', 'in', 'every', 'aspect', 'of', 'our', 'lives', '.']

This can be implemented in python as follows:
from nltk.tokenize import word_tokenize

text = "We are entering a new world. The technologies of machine learning, speech recognition, and natural language understanding are reaching a nexus of capability. The end result is that we'll soon have artificially intelligent assistants to help us in every aspect of our lives."

words = word_tokenize(text)

print(words)

Sentence Tokenizer: 

It is used to split the text into sentences. if we generate the sentence tokens for above text then it will be split the text into 2 sentences as the text contains only two sentences.

'We are entering a new world.', 
"The technologies of machine learning, speech recognition, and natural language understanding are reaching a nexus of capability.The end result is that we'll soon have artificially intelligent assistants to help us in every aspect of our lives."
This can be implemented in python as follows:
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text)
print(sentences)
Tweet Tokenizer: 

It is used to extract the words from emoji’s or hashtags that we generally use when post the text in social media.

Consider the following tweet as an example.

message = "Another Record Of its own...#Tubelight gets its own emoji..FIRST EVER fr Hindi Cinema , Kmaal krte ho \
@BeingSalmanKhan\@kabirkhankk 👏👌✌"

word tokens of above text will be:

['Another', 'Record', 'Of', 'its', 'own', '...', '#', 'Tubelight', 'gets', 'its', 'own', 'emoji..FIRST', 'EVER', 'fr', 'Hindi', 'Cinema', ',', 'Kmaal', 'krte', 'ho', '@', 'BeingSalmanKhan\\', '@', 'kabirkhankk', '👏👌✌']

if  we will generate the word tokens from above message then word tokenizer would not be able to extract the emojis. While doing text data analysis emojis are also give useful information about the sentiment of the user. So for this purpose we should use the tweet tokenizer .
Tokens using tweet tokenizer
['Another', 'Record', 'Of', 'its', 'own', '...', '#Tubelight', 'gets', 'its', 'own',
 'emoji', '..', 'FIRST', 'EVER', 'fr', 'Hindi', 'Cinema', ',', 'Kmaal', 'krte', 'ho', '@BeingSalmanKhan', '\\', '@kabirkhankk', '👏', '👌', '✌']

This can be implemented in python as follows:

from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()
tknzr.tokenize(message)
Regex Tokenizer:
It used to write the custom tokenizer using regex patterns according requirement of application we will be working.
Let's take an example of above message where we want extract #tags or user tagged using @ from any message and we need to write custom tokenizer to extract this information
from nltk.tokenize import regexp_tokenize 
Below mentioned pattern will extract all the words starting with #
pattern = "#[\w]+"
As we already know that machine learning algorithm understands data in mathematical form, next article I will be covering few approaches to covert text data into machine understandable language using tokens.

Kindly share your queries under comment section that you would like me to address.

Please do subscribe to blog and share your comments….

Comments

Popular posts from this blog

Levenshtein distance concept in NLP

Creating Bag-of-Words using Python