Assignment Text Classification Using Hugging Face

Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 6

Load IMDb dataset

Start by loading the IMDb dataset from the Datasets library:

from datasets import load_dataset

imdb = load_dataset("imdb")

Then take a look at an example:

imdb["test"][0]
{
"label": 0,
"text": "I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV
are usually underfunded, under-appreciated and misunderstood. I tried to like this,
I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the
original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that
doesn't match the background, and painfully one-dimensional characters cannot be
overcome with a 'sci-fi' setting. (I'm sure there are those of you out there who
think Babylon 5 is good sci-fi TV. It's not. It's clichéd and uninspiring.) While
US viewers might like emotion and character development, sci-fi is a genre that
does not take itself seriously (cf. Star Trek). It may treat important issues, yet
not as a serious philosophy. It's really difficult to care about the characters
here as they are not simply foolish, just missing a spark of life. Their actions
and reactions are wooden and predictable, often painful to watch. The makers of
Earth KNOW it's rubbish as they have to always say \"Gene Roddenberry's Earth...\"
otherwise people would not continue watching. Roddenberry's ashes must be turning
in their orbit as this dull, cheap, poorly edited (watching it without advert
breaks really brings this home) trudging Trabant of a show lumbers into space.
Spoiler. So, kill off a main character. And then bring him back as another actor.
Jeeez! Dallas all over again.",
}

There are two fields in this dataset:

text: the movie review text.

label: a value that is either 0 for a negative review or 1 for a positive review.

Preprocess

The next step is to load a DistilBERT tokenizer to preprocess the text field:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Create a preprocessing function to tokenize text and truncate sequences to be no


longer than DistilBERT’s maximum input length:

def preprocess_function(examples):
return tokenizer(examples["text"], truncation=True)

To apply the preprocessing function over the entire dataset, use 🤗 Datasets map
function. You can speed up map by setting batched=True to process multiple elements
of the dataset at once:
tokenized_imdb = imdb.map(preprocess_function, batched=True)

Now create a batch of examples using DataCollatorWithPadding. It’s more efficient


to dynamically pad the sentences to the longest length in a batch during collation,
instead of padding the whole dataset to the maximium length.

Pytorch
Hide Pytorch content

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

TensorFlow
Hide TensorFlow content

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

Evaluate
Including a metric during training is often helpful for evaluating your model’s
performance. You can quickly load a evaluation method with the 🤗 Evaluate library.
For this task, load the accuracy metric (see the 🤗 Evaluate quick tour to learn
more about how to load and compute a metric):

import evaluate

accuracy = evaluate.load("accuracy")
Then create a function that passes your predictions and labels to compute to
calculate the accuracy:

import numpy as np

def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return accuracy.compute(predictions=predictions, references=labels)
Your compute_metrics function is ready to go now, and you’ll return to it when you
setup your training.

Train
Before you start training your model, create a map of the expected ids to their
labels with id2label and label2id:

id2label = {0: "NEGATIVE", 1: "POSITIVE"}


label2id = {"NEGATIVE": 0, "POSITIVE": 1}

Pytorch
Hide Pytorch content
If you aren’t familiar with finetuning a model with the Trainer, take a look at the
basic tutorial here!

You're ready to start training your model now! Load DistilBERT with
[AutoModelForSequenceClassification](/docs/transformers/v4.26.1/en/model_doc/
auto#transformers.AutoModelForSequenceClassification) along with the number of
expected labels, and the label mappings:
from transformers import AutoModelForSequenceClassification, TrainingArguments,
Trainer

model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

At this point, only three steps remain:

Define your training hyperparameters in TrainingArguments. The only required


parameter is output_dir which specifies where to save your model. You’ll push this
model to the Hub by setting push_to_hub=True (you need to be signed in to Hugging
Face to upload your model). At the end of each epoch, the Trainer will evaluate the
accuracy and save the training checkpoint.

Pass the training arguments to Trainer along with the model, dataset, tokenizer,
data collator, and compute_metrics function.

Call train() to finetune your model.

training_args = TrainingArguments(
output_dir="my_awesome_model",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=2,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
push_to_hub=True,
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_imdb["train"],
eval_dataset=tokenized_imdb["test"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)

trainer.train()
Trainer applies dynamic padding by default when you pass tokenizer to it. In this
case, you don’t need to specify a data collator explicitly.

Once training is completed, share your model to the Hub with the push_to_hub()
method so everyone can use your model:

trainer.push_to_hub()

TensorFlow

Hide TensorFlow content

If you aren’t familiar with finetuning a model with Keras, take a look at the basic
tutorial here!
To finetune a model in TensorFlow, start by setting up an optimizer function,
learning rate schedule, and some training hyperparameters:

from transformers import create_optimizer

import tensorflow as tf

batch_size = 16
num_epochs = 5
batches_per_epoch = len(tokenized_imdb["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0,
num_train_steps=total_train_steps)
Then you can load DistilBERT with TFAutoModelForSequenceClassification along with
the number of expected labels, and the label mappings:

from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)
Convert your datasets to the tf.data.Dataset format with prepare_tf_dataset():

tf_train_set = model.prepare_tf_dataset(
tokenized_imdb["train"],
shuffle=True,
batch_size=16,
collate_fn=data_collator,
)

tf_validation_set = model.prepare_tf_dataset(
tokenized_imdb["test"],
shuffle=False,
batch_size=16,
collate_fn=data_collator,
)
Configure the model for training with compile:

import tensorflow as tf

model.compile(optimizer=optimizer)
The last two things to setup before you start training is to compute the accuracy
from the predictions, and provide a way to push your model to the Hub. Both are
done by using Keras callbacks.

Pass your compute_metrics function to KerasMetricCallback:

from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(metric_fn=compute_metrics,
eval_dataset=tf_validation_set)
Specify where to push your model and tokenizer in the PushToHubCallback:

from transformers.keras_callbacks import PushToHubCallback

push_to_hub_callback = PushToHubCallback(
output_dir="my_awesome_model",
tokenizer=tokenizer,
)
Then bundle your callbacks together:

callbacks = [metric_callback, push_to_hub_callback]


Finally, you’re ready to start training your model! Call fit with your training and
validation datasets, the number of epochs, and your callbacks to finetune the
model:

model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3,


callbacks=callbacks)
Once training is completed, your model is automatically uploaded to the Hub so
everyone can use it!

For a more in-depth example of how to finetune a model for text classification,
take a look at the corresponding PyTorch notebook or TensorFlow notebook.

Inference
Great, now that you’ve finetuned a model, you can use it for inference!

Grab some text you’d like to run inference on:

text = "This was a masterpiece. Not completely faithful to the books, but
enthralling from beginning to end. Might be my favorite of the three."
The simplest way to try out your finetuned model for inference is to use it in a
pipeline(). Instantiate a pipeline for sentiment analysis with your model, and pass
your text to it:

from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="stevhliu/my_awesome_model")


classifier(text)
[{'label': 'POSITIVE', 'score': 0.9994940757751465}]
You can also manually replicate the results of the pipeline if you’d like:

Pytorch
Hide Pytorch content
Tokenize the text and return PyTorch tensors:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_model")
inputs = tokenizer(text, return_tensors="pt")
Pass your inputs to the model and return the logits:

from transformers import AutoModelForSequenceClassification

model =
AutoModelForSequenceClassification.from_pretrained("stevhliu/my_awesome_model")
with torch.no_grad():
logits = model(**inputs).logits
Get the class with the highest probability, and use the model’s id2label mapping to
convert it to a text label:

predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]
'POSITIVE'
TensorFlow
Hide TensorFlow content
Tokenize the text and return TensorFlow tensors:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_model")
inputs = tokenizer(text, return_tensors="tf")
Pass your inputs to the model and return the logits:

from transformers import TFAutoModelForSequenceClassification

model =
TFAutoModelForSequenceClassification.from_pretrained("stevhliu/my_awesome_model")
logits = model(**inputs).logits
Get the class with the highest probability, and use the model’s id2label mapping to
convert it to a text label:

predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])


model.config.id2label[predicted_class_id]
'POSITIVE'

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy