0% found this document useful (0 votes)

8 views

DL Practical 09text Pre Processing

Deep learning text pre processing

Uploaded by

tkalyankar200

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

DL Practical 09text Pre Processing

Deep learning text pre processing

Uploaded by

tkalyankar200

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Practical 09: Text Pre Processing

Aim:- To clean the text data and make it ready to feed data to the model

Theory:- Tokenization

Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens

can be either words, characters, or subwords. Hence, tokenization can be broadly classified into

3 types – word, character, and subword (n-gram characters) tokenization.

For example, consider the sentence: “Never give up”.

The most common way of forming tokens is based on space. Assuming space as a delimiter,

the tokenization of the sentence results in 3 tokens – Never-give-up. As each token is a word,

it becomes an example of Word tokenization.

Similarly, tokens can be either characters or subwords. For example, let us consider “smarter”:

1. Character tokens: s-m-a-r-t-e-r

2. Subword tokens: smart-er

The True Reasons behind Tokenization

As tokens are the building blocks of Natural Language, the most common way of processing

the raw text happens at the token level.

For example, Transformer based models – the State of The Art (SOTA) Deep Learning

architectures in NLP – process the raw text at the token level. Similarly, the most popular deep

learning architectures for NLP like RNN, GRU, and LSTM also process the raw text at the

token level.
Hence, Tokenization is the foremost step while modeling text data. Tokenization is
performed on the corpus to obtain tokens. The following tokens are then used to prepare a
vocabulary. Vocabulary refers to the set of unique tokens in the corpus. Remember that
vocabulary can be constructed by considering each unique token in the corpus or by
considering the top K Frequently Occurring Words.

Word Tokenization

Word Tokenization is the most commonly used tokenization algorithm. It splits a piece of text

into individual words based on a certain delimiter. Depending upon delimiters, different word-

level tokens are formed. Pretrained Word Embeddings such as Word2Vec and GloVe comes

under word tokenization.

Character Tokenization

Character Tokenization splits apiece of text into a set of characters. It overcomes the drawbacks

we saw above about Word Tokenization.

• Character Tokenizers handles OOV words coherently by preserving the information of

the word. It breaks down the OOV word into characters and represents the word in

terms of these characters

• It also limits the size of the vocabulary. Want to talk a guess on the size of the

vocabulary? 26 since the vocabulary contains a unique set of characters

Text classification with TensorFlow Hub: Movie reviews

This notebook classifies movie reviews as positive or negative using the text of the review.
This is an example of binary—or two-class—classification, an important and widely
applicable kind of machine learning problem.
The tutorial demonstrates the basic application of transfer learning with TensorFlow Hub and
Keras.

Loss function and optimizer

A model needs a loss function and an optimizer for training. Since this is a binary
classification problem and the model outputs logits (a single-unit layer with a linear
activation), you'll use the binary_crossentropy loss function.
This isn't the only choice for a loss function, you could, for instance, choose
mean_squared_error. But, generally, binary_crossentropy is better for dealing with
probabilities—it measures the "distance" between probability distributions, or in our case,
between the ground-truth distribution and the predictions.
Later, when you are exploring regression problems (say, to predict the price of a house),
you'll see how to use another loss function called mean squared error.

Code :-

from keras.preprocessing.text import Tokenizer

#from keras.preprocessing.sequence import pad_sequences
from keras.utils import pad_sequences
import numpy as np

maxlen = 10
#training_samples = 20
#validation_samples = 100
max_words = 10

Model_tokenizer = Tokenizer(num_words = max_words)

texts = ["This is a girl.","Girl is tall","A tall boy is here",]

#model training
Model_tokenizer.fit_on_texts(texts)

#use the trained model for predicting the sequence number

sequences = Model_tokenizer.texts_to_sequences(texts)

sequences

text_new = ["this is my house","the girl and boy","The house is small"]

sequences_new = Model_tokenizer.texts_to_sequences(text_new)
print(sequences_new)

seq_try = [[5, 1,2,23], [3, 6]]

text_try = Model_tokenizer.sequences_to_texts(seq_try)
print(text_try)
print(Model_tokenizer.document_count)
print(Model_tokenizer.get_config())

data = pad_sequences(sequences,maxlen=6)
print(data)

data_new = pad_sequences(sequences_new,maxlen=4)
print(data_new)

Text classification with TensorFlow Hub: Movie reviews

import os
import numpy as np

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

Download the IMDB dataset

# Split the training set into 60% and 40% to end up with 15,000 examples
# for training, 10,000 examples for validation and 25,000 examples for testing.
train_data, validation_data, test_data = tfds.load(
name="imdb_reviews",
split=('train[:60%]', 'train[60%:]', 'test'),
as_supervised=True)

Explore the data

train_examples_batch, train_labels_batch = next(iter(train_data.batch(4)))

train_examples_batch

Build the model

embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding, input_shape=[],
dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:4])
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.summary()

Loss function and optimizer

model.compile(optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])

Train the model

len(train_data)

history = model.fit(train_data.shuffle(10000).batch(512),
epochs=2,
validation_data=validation_data.batch(512),
verbose=1)

Evaluate the model

results = model.evaluate(test_data.batch(512), verbose=2)

for name, value in zip(model.metrics_names, results):

print("%s: %.3f" % (name, value))

Conclusion
Text preprocessing involves transforming text into a clean and consistent format that can then
be fed into a model for further analysis and learning. Text preprocessing techniques may be
general so that they are applicable to many types of applications, or they can be specialized
for a specific task.
Experiment Date of Grade Teacher's Sign
Number Performance

Natural language processing with TensorFlow Teach language to machines using Python s deep learning library 1st Edition Thushan Ganegedara 2024 scribd download
50% (2)
Natural language processing with TensorFlow Teach language to machines using Python s deep learning library 1st Edition Thushan Ganegedara 2024 scribd download
62 pages
Building LLMs - Stanford
No ratings yet
Building LLMs - Stanford
78 pages
Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
No ratings yet
Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
2 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
14 pages
Deep Learning TensorFlow and Keras
No ratings yet
Deep Learning TensorFlow and Keras
454 pages
Glove
100% (1)
Glove
10 pages
Tutorialspoint Python
100% (3)
Tutorialspoint Python
444 pages
Beauty and The Beast Song Lyrics
No ratings yet
Beauty and The Beast Song Lyrics
9 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
تمثيل النص كموترات - تدريب _ مايكروسوفت ليرن
No ratings yet
تمثيل النص كموترات - تدريب _ مايكروسوفت ليرن
14 pages
Text Classification With Transformer - 1716327784332
No ratings yet
Text Classification With Transformer - 1716327784332
3 pages
Code Explanation
No ratings yet
Code Explanation
8 pages
566f0619-9145-4b8f-b12b-cb8a5b0cd30d
No ratings yet
566f0619-9145-4b8f-b12b-cb8a5b0cd30d
17 pages
Next Word Prediction With NLP and Deep Learning
No ratings yet
Next Word Prediction With NLP and Deep Learning
13 pages
Keras For Beginners: Implementing A Recurrent Neural Network
No ratings yet
Keras For Beginners: Implementing A Recurrent Neural Network
13 pages
cl12_huggingface
No ratings yet
cl12_huggingface
34 pages
Text Classification_movie Review_news Wires
No ratings yet
Text Classification_movie Review_news Wires
5 pages
Transformer
No ratings yet
Transformer
39 pages
Get Natural language processing with TensorFlow Teach language to machines using Python s deep learning library 1st Edition Thushan Ganegedara PDF ebook with Full Chapters Now
100% (2)
Get Natural language processing with TensorFlow Teach language to machines using Python s deep learning library 1st Edition Thushan Ganegedara PDF ebook with Full Chapters Now
65 pages
IRT Lab Programs
No ratings yet
IRT Lab Programs
9 pages
Deep DL Manual Deep
No ratings yet
Deep DL Manual Deep
8 pages
UNIT 5a
No ratings yet
UNIT 5a
48 pages
Chapter 2. Transformers: A Note For Early Release Readers
No ratings yet
Chapter 2. Transformers: A Note For Early Release Readers
85 pages
Sentence Embedding Code
No ratings yet
Sentence Embedding Code
9 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
Deep DL Manual Nainish
No ratings yet
Deep DL Manual Nainish
8 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
CNN Text Classification
No ratings yet
CNN Text Classification
12 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
Character-Level Convolutional Networks For Text Classification
No ratings yet
Character-Level Convolutional Networks For Text Classification
9 pages
TF Recitation
No ratings yet
TF Recitation
38 pages
Natural language processing with TensorFlow Teach language to machines using Python s deep learning library 1st Edition Thushan Ganegedara pdf download
100% (2)
Natural language processing with TensorFlow Teach language to machines using Python s deep learning library 1st Edition Thushan Ganegedara pdf download
68 pages
Microproject Report
No ratings yet
Microproject Report
23 pages
Lecture 15 - Foundation Models - CLIP and GPT
No ratings yet
Lecture 15 - Foundation Models - CLIP and GPT
45 pages
Report On Text Classification Using CNN, RNN & HAN - Jatana - Medium
No ratings yet
Report On Text Classification Using CNN, RNN & HAN - Jatana - Medium
15 pages
Dl Lab 8 Excuted
No ratings yet
Dl Lab 8 Excuted
3 pages
08 Natural Language Processing in Tensorflow
No ratings yet
08 Natural Language Processing in Tensorflow
29 pages
Blue Doodle Project Presentation
No ratings yet
Blue Doodle Project Presentation
15 pages
"I C U N N ": Mage Lassification Sing Eural Etworks
No ratings yet
"I C U N N ": Mage Lassification Sing Eural Etworks
15 pages
106106213
No ratings yet
106106213
637 pages
Unit 2
No ratings yet
Unit 2
34 pages
Building Transformer Models With Attention Crash Course Build A Neural Machine Translator in 12 Days
No ratings yet
Building Transformer Models With Attention Crash Course Build A Neural Machine Translator in 12 Days
33 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
Hugging Face
100% (1)
Hugging Face
11 pages
Over Description About The Model
No ratings yet
Over Description About The Model
3 pages
Tensorflow
No ratings yet
Tensorflow
9 pages
09 Milestone Project 2 Skimlit
No ratings yet
09 Milestone Project 2 Skimlit
32 pages
NLP_basics
No ratings yet
NLP_basics
119 pages
Exercise 8
No ratings yet
Exercise 8
6 pages
08 NLP With Deep Learning
No ratings yet
08 NLP With Deep Learning
31 pages
Reproducibility at ICLR 2019
No ratings yet
Reproducibility at ICLR 2019
82 pages
CISC 867 Deep Learning: 14. Text Classification With Recurrent Neural Networks and Word Embeddings
No ratings yet
CISC 867 Deep Learning: 14. Text Classification With Recurrent Neural Networks and Word Embeddings
28 pages
Tensor Flow 2
No ratings yet
Tensor Flow 2
3 pages
Google Aiml
No ratings yet
Google Aiml
50 pages
CNN and RNN code
No ratings yet
CNN and RNN code
10 pages
Preet Hi
No ratings yet
Preet Hi
75 pages
Machine Learning For Natural Language Processing Lecture Notes Columbia E6998 Itebooks download
No ratings yet
Machine Learning For Natural Language Processing Lecture Notes Columbia E6998 Itebooks download
42 pages
WDM - Week - I
No ratings yet
WDM - Week - I
24 pages
Image_Captioning_with_Visual_Attention.pdf
No ratings yet
Image_Captioning_with_Visual_Attention.pdf
16 pages
Tensorflow Ensai SID 13 01 17
No ratings yet
Tensorflow Ensai SID 13 01 17
99 pages
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
From Everand
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
Anthony Serpico
No ratings yet
Perceptual Computing: Fundamentals and Applications
From Everand
Perceptual Computing: Fundamentals and Applications
Fouad Sabry
No ratings yet
Sarvam Safety Equipment (P) LTD.,: Commercial Offer For Safety Equipments
No ratings yet
Sarvam Safety Equipment (P) LTD.,: Commercial Offer For Safety Equipments
1 page
Jerryl V. Santiago: Brgy. Homestead 1, Talavera Nueva Ecija Mobile: (0917) 160-9238
No ratings yet
Jerryl V. Santiago: Brgy. Homestead 1, Talavera Nueva Ecija Mobile: (0917) 160-9238
2 pages
Meat Basics
100% (2)
Meat Basics
97 pages
21IMG23GB1 Business Analytics.
No ratings yet
21IMG23GB1 Business Analytics.
1 page
Writing Magazine April 2020
100% (2)
Writing Magazine April 2020
132 pages
WIL6 Ottawa RSR OTP SPECS 07-01-2024-1
No ratings yet
WIL6 Ottawa RSR OTP SPECS 07-01-2024-1
998 pages
Rcaf Map
No ratings yet
Rcaf Map
1 page
Manufacturing Thesis Topics
100% (2)
Manufacturing Thesis Topics
7 pages
Rules of Machine Learning
100% (1)
Rules of Machine Learning
24 pages
Bharat Heavy Electricals Limited: Enquiry No Enquiry Date Due Date For Quotation 06.04.2015
No ratings yet
Bharat Heavy Electricals Limited: Enquiry No Enquiry Date Due Date For Quotation 06.04.2015
8 pages
What Are The Governing Rules Regarding The Liquidation?
No ratings yet
What Are The Governing Rules Regarding The Liquidation?
9 pages
Project Profile On Fly Ash Bricks: Dharani Modular Constructions
No ratings yet
Project Profile On Fly Ash Bricks: Dharani Modular Constructions
2 pages
This Defendant Document
No ratings yet
This Defendant Document
18 pages
Affiliation: Philippine Institute of Industrial Engineers, Inc. Operations Research Society of The Philippines
No ratings yet
Affiliation: Philippine Institute of Industrial Engineers, Inc. Operations Research Society of The Philippines
11 pages
TLX Xtra 303 MSDS
No ratings yet
TLX Xtra 303 MSDS
12 pages
Operating Systems
No ratings yet
Operating Systems
4 pages
Section Description:: Manual Calculation Sheet For Reinforced Concrete Beam B 25 75
No ratings yet
Section Description:: Manual Calculation Sheet For Reinforced Concrete Beam B 25 75
8 pages
Botswana: 2021 Article Iv Consultation-Press Release Staff Report and Statement by The Executive Director For Botswana
No ratings yet
Botswana: 2021 Article Iv Consultation-Press Release Staff Report and Statement by The Executive Director For Botswana
67 pages
PEd 222 Emergency Preparedness and Safety Management
No ratings yet
PEd 222 Emergency Preparedness and Safety Management
9 pages
Lab Manual: Visual Modelling
No ratings yet
Lab Manual: Visual Modelling
45 pages
Arguments For and Against Quebec Separation
25% (4)
Arguments For and Against Quebec Separation
2 pages
Professor Salary Salary+1000 Salary 1.05
No ratings yet
Professor Salary Salary+1000 Salary 1.05
6 pages
La 33KV
100% (1)
La 33KV
6 pages
Scheme 2024 Central Govt (2022-23)
No ratings yet
Scheme 2024 Central Govt (2022-23)
112 pages
Part 2:: Describe A River, Lake or Sea Which You Like
No ratings yet
Part 2:: Describe A River, Lake or Sea Which You Like
4 pages
Rajasthan Technical University, Kota: Subject-Operating System
No ratings yet
Rajasthan Technical University, Kota: Subject-Operating System
9 pages
Maria Socorro Cancio (National Book Store)
No ratings yet
Maria Socorro Cancio (National Book Store)
3 pages
Green Screen Skmei
No ratings yet
Green Screen Skmei
32 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

DL Practical 09text Pre Processing

Uploaded by

DL Practical 09text Pre Processing

Uploaded by

Practical 09: Text Pre Processing

3 types – word, character, and subword (n-gram characters) tokenization.

For example, consider the sentence: “Never give up”.

it becomes an example of Word tokenization.

1. Character tokens: s-m-a-r-t-e-r

2. Subword tokens: smart-er

The True Reasons behind Tokenization

the raw text happens at the token level.

under word tokenization.

we saw above about Word Tokenization.

• Character Tokenizers handles OOV words coherently by preserving the information of

terms of these characters

vocabulary? 26 since the vocabulary contains a unique set of characters

Text classification with TensorFlow Hub: Movie reviews

Loss function and optimizer

from keras.preprocessing.text import Tokenizer

Model_tokenizer = Tokenizer(num_words = max_words)

texts = ["This is a girl.","Girl is tall","A tall boy is here",]

#use the trained model for predicting the sequence number

text_new = ["this is my house","the girl and boy","The house is small"]

seq_try = [[5, 1,2,23], [3, 6]]

Text classification with TensorFlow Hub: Movie reviews

Download the IMDB dataset

Explore the data

train_examples_batch, train_labels_batch = next(iter(train_data.batch(4)))

Build the model

Loss function and optimizer

Train the model

Evaluate the model

results = model.evaluate(test_data.batch(512), verbose=2)

for name, value in zip(model.metrics_names, results):

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.