Lexical Processing
In this
article we will go through the high level steps to process the textual
data for machine learning and as part of this series I will be explaining the lexical
processing of the text using different types of tokenization feature
in python.
In
processing the textual data for machine learning following steps are performed:
- Lexical Processing of text: Converting raw text into words, sentences, paragraphs etc.
- Syntactic Processing of text: Understanding the relationships among words used in the sentences.
- Semantic processing of text: Understanding the meaning of text.
To do the lexical processing of text we
perform: -
- Tokenization and
- Extraction of features from text
“Tokenization”
is technique that is used to split the text into smaller elements. These
elements can be characters, words, sentences or paragraphs; depending upon the type
of application we are working on.
['We', 'are', 'entering', 'a', 'new', 'world', '.', 'The', 'technologies', 'of', 'machine', 'learning', ',', 'speech', 'recognition', ',', 'and', 'natural', 'language', 'understanding', 'are', 'reaching', 'a', 'nexus', 'of', 'capability.The', 'end', 'result', 'is', 'that', 'we', "'ll", 'soon', 'have', 'artificially', 'intelligent', 'assistants', 'to', 'help', 'us', 'in', 'every', 'aspect', 'of', 'our', 'lives', '.']
'We are entering a new world.',
"The technologies of machine learning, speech recognition, and natural language understanding are reaching a nexus of capability.The end result is that we'll soon have artificially intelligent assistants to help us in every aspect of our lives."
This can be implemented in python as follows:
from nltk.tokenize import sent_tokenizesentences = sent_tokenize(text)
print(sentences)
Tweet Tokenizer:
It is used to extract the words from emoji’s or hashtags that we generally use when post the text in social media.
Consider the following tweet as an example.
message = "Another Record Of its own...#Tubelight gets its own emoji..FIRST EVER fr Hindi Cinema , Kmaal krte ho \
@BeingSalmanKhan\@kabirkhankk 👏👌✌"
word tokens of above text will be:
['Another', 'Record', 'Of', 'its', 'own', '...', '#', 'Tubelight', 'gets', 'its', 'own', 'emoji..FIRST', 'EVER', 'fr', 'Hindi', 'Cinema', ',', 'Kmaal', 'krte', 'ho', '@', 'BeingSalmanKhan\\', '@', 'kabirkhankk', '👏👌✌']
if we will generate the word tokens from above message then word tokenizer would not be able to extract the emojis. While doing text data analysis emojis are also give useful information about the sentiment of the user. So for this purpose we should use the tweet tokenizer .
Tokens using tweet tokenizer
['Another', 'Record', 'Of', 'its', 'own', '...', '#Tubelight', 'gets', 'its', 'own',
'emoji', '..', 'FIRST', 'EVER', 'fr', 'Hindi', 'Cinema', ',', 'Kmaal', 'krte', 'ho', '@BeingSalmanKhan', '\\', '@kabirkhankk', '👏', '👌', '✌']
This can be implemented in python as follows:
from nltk.tokenize import TweetTokenizertknzr = TweetTokenizer()tknzr.tokenize(message)Regex Tokenizer:It used to write the custom tokenizer using regex patterns according requirement of application we will be working.Let's take an example of above message where we want extract #tags or user tagged using @ from any message and we need to write custom tokenizer to extract this informationfrom nltk.tokenize import regexp_tokenizeBelow mentioned pattern will extract all the words starting with #pattern = "#[\w]+"As we already know that machine learning algorithm understands data in mathematical form, next article I will be covering few approaches to covert text data into machine understandable language using tokens.Kindly share your queries under comment section that you would like me to address.
Please do subscribe to blog and share your comments….
Comments
Post a Comment