Natural Language Processing
Natural Language Processing
What is NLP?
Natural Language Processing (NLP) is a field of Artificial Intelligence (AI)
and Computer Science that is concerned with the interactions between
computers and humans in natural language. The goal of NLP is to develop
algorithms and models that enable computers to understand, interpret,
generate, and manipulate human languages.
Eg: Spam Filters: Gmail uses natural language processing (NLP) to discern
which emails are legitimate and which are spam. These spam filters look
at the text in all the emails you receive and try to figure out what it means
to see if it’s spam or not.
NLP tasks
Text and speech processing: This includes Speech recognition, text-&-
speech processing, encoding(i.e converting speech or text to machine-
readable language), etc.
Text classification: This includes Sentiment Analysis in which the machine
can analyze the qualities, emotions, and sarcasm from text and also
classify it accordingly.
Language generation: This includes tasks such as machine translation,
summary writing, essay writing, etc. which aim to produce coherent and
fluent text.
Language interaction: This includes tasks such as dialogue systems, voice
assistants, and chatbots, which aim to enable natural communication
between humans and computers.
Word Clouds
Word clouds are a popular visualization technique in data science used to
represent text data
Useful for summarizing large amounts of text data by highlighting the
most frequently occurring words.
One approach to visualizing words and counts is word clouds, which
artistically depict the words at sizes proportional to their counts.
eg:
This looks neat but doesn’t really tell us anything. A more interesting
approach might be to scatter them so that horizontal position indicates
posting popularity and vertical position indicates résumé popularity, which
produces a visualization that conveys a few insights.
N gram model
N-gram language models are a type of statistical language model used in
natural language processing (NLP).
They are based on the concept of predicting the probability of a word (or
sequence of words) occurring in a given context based on the history of
preceding words.
In an N-gram model, "N" refers to the number of words considered as
context. For example:
-> Unigram (N=1): Considers only a single word as context.
-> Bigram (N=2): Considers pairs of consecutive words as context.
-> Trigram (N=3): Considers triplets of consecutive words as context.
-> N-gram (N>3): Considers sequences of N consecutive words.
we’ll use the Requests and Beautiful Soup libraries to retrieve the data but
there are couple of issues:
i) The first is that the apostrophes (‘) in the text are actually the Unicode
character u"\u2019". We’ll create a helper function to replace them with
normal apostrophes: