0% found this document useful (0 votes)
8 views

DL Practical 09text Pre Processing

Deep learning text pre processing

Uploaded by

tkalyankar200
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

DL Practical 09text Pre Processing

Deep learning text pre processing

Uploaded by

tkalyankar200
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Practical 09: Text Pre Processing

Aim:- To clean the text data and make it ready to feed data to the model

Theory:- Tokenization

Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens

can be either words, characters, or subwords. Hence, tokenization can be broadly classified into

3 types – word, character, and subword (n-gram characters) tokenization.

For example, consider the sentence: “Never give up”.

The most common way of forming tokens is based on space. Assuming space as a delimiter,

the tokenization of the sentence results in 3 tokens – Never-give-up. As each token is a word,

it becomes an example of Word tokenization.

Similarly, tokens can be either characters or subwords. For example, let us consider “smarter”:

1. Character tokens: s-m-a-r-t-e-r

2. Subword tokens: smart-er

The True Reasons behind Tokenization

As tokens are the building blocks of Natural Language, the most common way of processing

the raw text happens at the token level.

For example, Transformer based models – the State of The Art (SOTA) Deep Learning

architectures in NLP – process the raw text at the token level. Similarly, the most popular deep

learning architectures for NLP like RNN, GRU, and LSTM also process the raw text at the

token level.
Hence, Tokenization is the foremost step while modeling text data. Tokenization is
performed on the corpus to obtain tokens. The following tokens are then used to prepare a
vocabulary. Vocabulary refers to the set of unique tokens in the corpus. Remember that
vocabulary can be constructed by considering each unique token in the corpus or by
considering the top K Frequently Occurring Words.

Word Tokenization

Word Tokenization is the most commonly used tokenization algorithm. It splits a piece of text

into individual words based on a certain delimiter. Depending upon delimiters, different word-

level tokens are formed. Pretrained Word Embeddings such as Word2Vec and GloVe comes

under word tokenization.

Character Tokenization

Character Tokenization splits apiece of text into a set of characters. It overcomes the drawbacks

we saw above about Word Tokenization.

• Character Tokenizers handles OOV words coherently by preserving the information of

the word. It breaks down the OOV word into characters and represents the word in

terms of these characters

• It also limits the size of the vocabulary. Want to talk a guess on the size of the

vocabulary? 26 since the vocabulary contains a unique set of characters

Text classification with TensorFlow Hub: Movie reviews


This notebook classifies movie reviews as positive or negative using the text of the review.
This is an example of binary—or two-class—classification, an important and widely
applicable kind of machine learning problem.
The tutorial demonstrates the basic application of transfer learning with TensorFlow Hub and
Keras.

Loss function and optimizer


A model needs a loss function and an optimizer for training. Since this is a binary
classification problem and the model outputs logits (a single-unit layer with a linear
activation), you'll use the binary_crossentropy loss function.
This isn't the only choice for a loss function, you could, for instance, choose
mean_squared_error. But, generally, binary_crossentropy is better for dealing with
probabilities—it measures the "distance" between probability distributions, or in our case,
between the ground-truth distribution and the predictions.
Later, when you are exploring regression problems (say, to predict the price of a house),
you'll see how to use another loss function called mean squared error.

Code :-

from keras.preprocessing.text import Tokenizer


#from keras.preprocessing.sequence import pad_sequences
from keras.utils import pad_sequences
import numpy as np

maxlen = 10
#training_samples = 20
#validation_samples = 100
max_words = 10

Model_tokenizer = Tokenizer(num_words = max_words)

texts = ["This is a girl.","Girl is tall","A tall boy is here",]

#model training
Model_tokenizer.fit_on_texts(texts)

#use the trained model for predicting the sequence number


sequences = Model_tokenizer.texts_to_sequences(texts)

sequences

text_new = ["this is my house","the girl and boy","The house is small"]

sequences_new = Model_tokenizer.texts_to_sequences(text_new)
print(sequences_new)

seq_try = [[5, 1,2,23], [3, 6]]


text_try = Model_tokenizer.sequences_to_texts(seq_try)
print(text_try)
print(Model_tokenizer.document_count)
print(Model_tokenizer.get_config())

data = pad_sequences(sequences,maxlen=6)
print(data)

data_new = pad_sequences(sequences_new,maxlen=4)
print(data_new)

Text classification with TensorFlow Hub: Movie reviews

import os
import numpy as np

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

Download the IMDB dataset

# Split the training set into 60% and 40% to end up with 15,000 examples
# for training, 10,000 examples for validation and 25,000 examples for testing.
train_data, validation_data, test_data = tfds.load(
name="imdb_reviews",
split=('train[:60%]', 'train[60%:]', 'test'),
as_supervised=True)

Explore the data

train_examples_batch, train_labels_batch = next(iter(train_data.batch(4)))


train_examples_batch

Build the model

embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding, input_shape=[],
dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:4])
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.summary()

Loss function and optimizer

model.compile(optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])

Train the model

len(train_data)

history = model.fit(train_data.shuffle(10000).batch(512),
epochs=2,
validation_data=validation_data.batch(512),
verbose=1)

Evaluate the model

results = model.evaluate(test_data.batch(512), verbose=2)

for name, value in zip(model.metrics_names, results):


print("%s: %.3f" % (name, value))

Conclusion
Text preprocessing involves transforming text into a clean and consistent format that can then
be fed into a model for further analysis and learning. Text preprocessing techniques may be
general so that they are applicable to many types of applications, or they can be specialized
for a specific task.
Experiment Date of Grade Teacher's Sign
Number Performance

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy