DL Practical 09text Pre Processing
DL Practical 09text Pre Processing
Aim:- To clean the text data and make it ready to feed data to the model
Theory:- Tokenization
Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens
can be either words, characters, or subwords. Hence, tokenization can be broadly classified into
The most common way of forming tokens is based on space. Assuming space as a delimiter,
the tokenization of the sentence results in 3 tokens – Never-give-up. As each token is a word,
Similarly, tokens can be either characters or subwords. For example, let us consider “smarter”:
As tokens are the building blocks of Natural Language, the most common way of processing
For example, Transformer based models – the State of The Art (SOTA) Deep Learning
architectures in NLP – process the raw text at the token level. Similarly, the most popular deep
learning architectures for NLP like RNN, GRU, and LSTM also process the raw text at the
token level.
Hence, Tokenization is the foremost step while modeling text data. Tokenization is
performed on the corpus to obtain tokens. The following tokens are then used to prepare a
vocabulary. Vocabulary refers to the set of unique tokens in the corpus. Remember that
vocabulary can be constructed by considering each unique token in the corpus or by
considering the top K Frequently Occurring Words.
Word Tokenization
Word Tokenization is the most commonly used tokenization algorithm. It splits a piece of text
into individual words based on a certain delimiter. Depending upon delimiters, different word-
level tokens are formed. Pretrained Word Embeddings such as Word2Vec and GloVe comes
Character Tokenization
Character Tokenization splits apiece of text into a set of characters. It overcomes the drawbacks
the word. It breaks down the OOV word into characters and represents the word in
• It also limits the size of the vocabulary. Want to talk a guess on the size of the
Code :-
maxlen = 10
#training_samples = 20
#validation_samples = 100
max_words = 10
#model training
Model_tokenizer.fit_on_texts(texts)
sequences
sequences_new = Model_tokenizer.texts_to_sequences(text_new)
print(sequences_new)
data = pad_sequences(sequences,maxlen=6)
print(data)
data_new = pad_sequences(sequences_new,maxlen=4)
print(data_new)
import os
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds
# Split the training set into 60% and 40% to end up with 15,000 examples
# for training, 10,000 examples for validation and 25,000 examples for testing.
train_data, validation_data, test_data = tfds.load(
name="imdb_reviews",
split=('train[:60%]', 'train[60%:]', 'test'),
as_supervised=True)
embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding, input_shape=[],
dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:4])
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))
model.summary()
model.compile(optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
len(train_data)
history = model.fit(train_data.shuffle(10000).batch(512),
epochs=2,
validation_data=validation_data.batch(512),
verbose=1)
Conclusion
Text preprocessing involves transforming text into a clean and consistent format that can then
be fed into a model for further analysis and learning. Text preprocessing techniques may be
general so that they are applicable to many types of applications, or they can be specialized
for a specific task.
Experiment Date of Grade Teacher's Sign
Number Performance