Assignment Text Classification Using Hugging Face

Load IMDb dataset
Start by loading the IMDb dataset from the Datasets library:
from datasets import load_dataset
imdb = load_dataset("imdb")
Then take a look at an example:
imdb["test"][0]
{
"label": 0,
"text": "I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV
are usually underfunded, under-appreciated and misunderstood. I tried to like this,
I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the
original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that
doesn't match the background, and painfully one-dimensional characters cannot be
overcome with a 'sci-fi' setting. (I'm sure there are those of you out there who
think Babylon 5 is good sci-fi TV. It's not. It's clichéd and uninspiring.) While
US viewers might like emotion and character development, sci-fi is a genre that
does not take itself seriously (cf. Star Trek). It may treat important issues, yet
not as a serious philosophy. It's really difficult to care about the characters
here as they are not simply foolish, just missing a spark of life. Their actions
and reactions are wooden and predictable, often painful to watch. The makers of
Earth KNOW it's rubbish as they have to always say \"Gene Roddenberry's Earth...\"
otherwise people would not continue watching. Roddenberry's ashes must be turning
in their orbit as this dull, cheap, poorly edited (watching it without advert
breaks really brings this home) trudging Trabant of a show lumbers into space.
Spoiler. So, kill off a main character. And then bring him back as another actor.
Jeeez! Dallas all over again.",
}
There are two fields in this dataset:
text: the movie review text.
label: a value that is either 0 for a negative review or 1 for a positive review.
Preprocess
The next step is to load a DistilBERT tokenizer to preprocess the text field:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
Create a preprocessing function to tokenize text and truncate sequences to be no

longer than DistilBERT’s maximum input length:
def preprocess_function(examples):
return tokenizer(examples["text"], truncation=True)
To apply the preprocessing function over the entire dataset, use 🤗 Datasets map
function. You can speed up map by setting batched=True to process multiple elements
of the dataset at once:
tokenized_imdb = imdb.map(preprocess_function, batched=True)
Now create a batch of examples using DataCollatorWithPadding. It’s more efficient

to dynamically pad the sentences to the longest length in a batch during collation,
instead of padding the whole dataset to the maximium length.
Pytorch
Hide Pytorch content
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
TensorFlow
Hide TensorFlow content
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
Evaluate
Including a metric during training is often helpful for evaluating your model’s
performance. You can quickly load a evaluation method with the 🤗 Evaluate library.
For this task, load the accuracy metric (see the 🤗 Evaluate quick tour to learn
more about how to load and compute a metric):
import evaluate
accuracy = evaluate.load("accuracy")
Then create a function that passes your predictions and labels to compute to
calculate the accuracy:
import numpy as np
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return accuracy.compute(predictions=predictions, references=labels)
Your compute_metrics function is ready to go now, and you’ll return to it when you
setup your training.
Train
Before you start training your model, create a map of the expected ids to their
labels with id2label and label2id:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}

label2id = {"NEGATIVE": 0, "POSITIVE": 1}
Pytorch
If you aren’t familiar with finetuning a model with the Trainer, take a look at the
basic tutorial here!
You're ready to start training your model now! Load DistilBERT with
[AutoModelForSequenceClassification](/docs/transformers/v4.26.1/en/model_doc/
auto#transformers.AutoModelForSequenceClassification) along with the number of
expected labels, and the label mappings:
from transformers import AutoModelForSequenceClassification, TrainingArguments,
Trainer
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)
At this point, only three steps remain:
Define your training hyperparameters in TrainingArguments. The only required

parameter is output_dir which specifies where to save your model. You’ll push this
model to the Hub by setting push_to_hub=True (you need to be signed in to Hugging
Face to upload your model). At the end of each epoch, the Trainer will evaluate the
accuracy and save the training checkpoint.
Pass the training arguments to Trainer along with the model, dataset, tokenizer,
data collator, and compute_metrics function.
Call train() to finetune your model.
training_args = TrainingArguments(
output_dir="my_awesome_model",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=2,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
push_to_hub=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_imdb["train"],
eval_dataset=tokenized_imdb["test"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
trainer.train()
Trainer applies dynamic padding by default when you pass tokenizer to it. In this
case, you don’t need to specify a data collator explicitly.
Once training is completed, share your model to the Hub with the push_to_hub()
method so everyone can use your model:
trainer.push_to_hub()
TensorFlow
If you aren’t familiar with finetuning a model with Keras, take a look at the basic
tutorial here!
To finetune a model in TensorFlow, start by setting up an optimizer function,
learning rate schedule, and some training hyperparameters:
from transformers import create_optimizer
import tensorflow as tf
batch_size = 16
num_epochs = 5
batches_per_epoch = len(tokenized_imdb["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0,
num_train_steps=total_train_steps)
Then you can load DistilBERT with TFAutoModelForSequenceClassification along with
the number of expected labels, and the label mappings:
from transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)
Convert your datasets to the tf.data.Dataset format with prepare_tf_dataset():
tf_train_set = model.prepare_tf_dataset(
tokenized_imdb["train"],
shuffle=True,
batch_size=16,
collate_fn=data_collator,
)
tf_validation_set = model.prepare_tf_dataset(
tokenized_imdb["test"],
shuffle=False,
batch_size=16,
collate_fn=data_collator,
)
Configure the model for training with compile:
import tensorflow as tf
model.compile(optimizer=optimizer)
The last two things to setup before you start training is to compute the accuracy
from the predictions, and provide a way to push your model to the Hub. Both are
done by using Keras callbacks.
Pass your compute_metrics function to KerasMetricCallback:
from transformers.keras_callbacks import KerasMetricCallback
metric_callback = KerasMetricCallback(metric_fn=compute_metrics,
eval_dataset=tf_validation_set)
Specify where to push your model and tokenizer in the PushToHubCallback:
from transformers.keras_callbacks import PushToHubCallback
push_to_hub_callback = PushToHubCallback(
output_dir="my_awesome_model",
tokenizer=tokenizer,
)
Then bundle your callbacks together:
callbacks = [metric_callback, push_to_hub_callback]

Finally, you’re ready to start training your model! Call fit with your training and
validation datasets, the number of epochs, and your callbacks to finetune the
model:
model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3,

callbacks=callbacks)
Once training is completed, your model is automatically uploaded to the Hub so
everyone can use it!
For a more in-depth example of how to finetune a model for text classification,
take a look at the corresponding PyTorch notebook or TensorFlow notebook.
Inference
Great, now that you’ve finetuned a model, you can use it for inference!
Grab some text you’d like to run inference on:
text = "This was a masterpiece. Not completely faithful to the books, but
enthralling from beginning to end. Might be my favorite of the three."
The simplest way to try out your finetuned model for inference is to use it in a
pipeline(). Instantiate a pipeline for sentiment analysis with your model, and pass
your text to it:
from transformers import pipeline
classifier = pipeline("sentiment-analysis", model="stevhliu/my_awesome_model")

classifier(text)
[{'label': 'POSITIVE', 'score': 0.9994940757751465}]
You can also manually replicate the results of the pipeline if you’d like:
Pytorch
Tokenize the text and return PyTorch tensors:
tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_model")
inputs = tokenizer(text, return_tensors="pt")
Pass your inputs to the model and return the logits:
from transformers import AutoModelForSequenceClassification
model =
AutoModelForSequenceClassification.from_pretrained("stevhliu/my_awesome_model")
with torch.no_grad():
logits = model(**inputs).logits
Get the class with the highest probability, and use the model’s id2label mapping to
convert it to a text label:
predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]
'POSITIVE'
TensorFlow
Tokenize the text and return TensorFlow tensors:
tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_model")
inputs = tokenizer(text, return_tensors="tf")
Pass your inputs to the model and return the logits:
from transformers import TFAutoModelForSequenceClassification
model =
TFAutoModelForSequenceClassification.from_pretrained("stevhliu/my_awesome_model")
logits = model(**inputs).logits
Get the class with the highest probability, and use the model’s id2label mapping to
convert it to a text label:
predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])

model.config.id2label[predicted_class_id]
'POSITIVE'

Assignment Text Classification Using Hugging Face

Uploaded by

Copyright:

Available Formats

Assignment Text Classification Using Hugging Face

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assignment Text Classification Using Hugging Face

Uploaded by

Copyright:

Available Formats

Load IMDb dataset

Start by loading the IMDb dataset from the Datasets library:

from datasets import load_dataset

Then take a look at an example:

There are two fields in this dataset:

text: the movie review text.

from transformers import AutoTokenizer

Create a preprocessing function to tokenize text and truncate sequences to be no

Now create a batch of examples using DataCollatorWithPadding. It’s more efficient

from transformers import DataCollatorWithPadding

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

id2label = {0: "NEGATIVE", 1: "POSITIVE"}

At this point, only three steps remain:

Define your training hyperparameters in TrainingArguments. The only required

Call train() to finetune your model.

Hide TensorFlow content

from transformers import create_optimizer

from transformers import TFAutoModelForSequenceClassification

Pass your compute_metrics function to KerasMetricCallback:

from transformers.keras_callbacks import KerasMetricCallback

from transformers.keras_callbacks import PushToHubCallback

callbacks = [metric_callback, push_to_hub_callback]

model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3,

Grab some text you’d like to run inference on:

from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="stevhliu/my_awesome_model")

from transformers import AutoTokenizer

from transformers import AutoModelForSequenceClassification

from transformers import TFAutoModelForSequenceClassification

predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.