Assignment Text Classification Using Hugging Face
Assignment Text Classification Using Hugging Face
Assignment Text Classification Using Hugging Face
imdb = load_dataset("imdb")
imdb["test"][0]
{
"label": 0,
"text": "I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV
are usually underfunded, under-appreciated and misunderstood. I tried to like this,
I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the
original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that
doesn't match the background, and painfully one-dimensional characters cannot be
overcome with a 'sci-fi' setting. (I'm sure there are those of you out there who
think Babylon 5 is good sci-fi TV. It's not. It's clichéd and uninspiring.) While
US viewers might like emotion and character development, sci-fi is a genre that
does not take itself seriously (cf. Star Trek). It may treat important issues, yet
not as a serious philosophy. It's really difficult to care about the characters
here as they are not simply foolish, just missing a spark of life. Their actions
and reactions are wooden and predictable, often painful to watch. The makers of
Earth KNOW it's rubbish as they have to always say \"Gene Roddenberry's Earth...\"
otherwise people would not continue watching. Roddenberry's ashes must be turning
in their orbit as this dull, cheap, poorly edited (watching it without advert
breaks really brings this home) trudging Trabant of a show lumbers into space.
Spoiler. So, kill off a main character. And then bring him back as another actor.
Jeeez! Dallas all over again.",
}
label: a value that is either 0 for a negative review or 1 for a positive review.
Preprocess
The next step is to load a DistilBERT tokenizer to preprocess the text field:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def preprocess_function(examples):
return tokenizer(examples["text"], truncation=True)
To apply the preprocessing function over the entire dataset, use 🤗 Datasets map
function. You can speed up map by setting batched=True to process multiple elements
of the dataset at once:
tokenized_imdb = imdb.map(preprocess_function, batched=True)
Pytorch
Hide Pytorch content
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
TensorFlow
Hide TensorFlow content
Evaluate
Including a metric during training is often helpful for evaluating your model’s
performance. You can quickly load a evaluation method with the 🤗 Evaluate library.
For this task, load the accuracy metric (see the 🤗 Evaluate quick tour to learn
more about how to load and compute a metric):
import evaluate
accuracy = evaluate.load("accuracy")
Then create a function that passes your predictions and labels to compute to
calculate the accuracy:
import numpy as np
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return accuracy.compute(predictions=predictions, references=labels)
Your compute_metrics function is ready to go now, and you’ll return to it when you
setup your training.
Train
Before you start training your model, create a map of the expected ids to their
labels with id2label and label2id:
Pytorch
Hide Pytorch content
If you aren’t familiar with finetuning a model with the Trainer, take a look at the
basic tutorial here!
You're ready to start training your model now! Load DistilBERT with
[AutoModelForSequenceClassification](/docs/transformers/v4.26.1/en/model_doc/
auto#transformers.AutoModelForSequenceClassification) along with the number of
expected labels, and the label mappings:
from transformers import AutoModelForSequenceClassification, TrainingArguments,
Trainer
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)
Pass the training arguments to Trainer along with the model, dataset, tokenizer,
data collator, and compute_metrics function.
training_args = TrainingArguments(
output_dir="my_awesome_model",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=2,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
push_to_hub=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_imdb["train"],
eval_dataset=tokenized_imdb["test"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
trainer.train()
Trainer applies dynamic padding by default when you pass tokenizer to it. In this
case, you don’t need to specify a data collator explicitly.
Once training is completed, share your model to the Hub with the push_to_hub()
method so everyone can use your model:
trainer.push_to_hub()
TensorFlow
If you aren’t familiar with finetuning a model with Keras, take a look at the basic
tutorial here!
To finetune a model in TensorFlow, start by setting up an optimizer function,
learning rate schedule, and some training hyperparameters:
import tensorflow as tf
batch_size = 16
num_epochs = 5
batches_per_epoch = len(tokenized_imdb["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0,
num_train_steps=total_train_steps)
Then you can load DistilBERT with TFAutoModelForSequenceClassification along with
the number of expected labels, and the label mappings:
model = TFAutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)
Convert your datasets to the tf.data.Dataset format with prepare_tf_dataset():
tf_train_set = model.prepare_tf_dataset(
tokenized_imdb["train"],
shuffle=True,
batch_size=16,
collate_fn=data_collator,
)
tf_validation_set = model.prepare_tf_dataset(
tokenized_imdb["test"],
shuffle=False,
batch_size=16,
collate_fn=data_collator,
)
Configure the model for training with compile:
import tensorflow as tf
model.compile(optimizer=optimizer)
The last two things to setup before you start training is to compute the accuracy
from the predictions, and provide a way to push your model to the Hub. Both are
done by using Keras callbacks.
metric_callback = KerasMetricCallback(metric_fn=compute_metrics,
eval_dataset=tf_validation_set)
Specify where to push your model and tokenizer in the PushToHubCallback:
push_to_hub_callback = PushToHubCallback(
output_dir="my_awesome_model",
tokenizer=tokenizer,
)
Then bundle your callbacks together:
For a more in-depth example of how to finetune a model for text classification,
take a look at the corresponding PyTorch notebook or TensorFlow notebook.
Inference
Great, now that you’ve finetuned a model, you can use it for inference!
text = "This was a masterpiece. Not completely faithful to the books, but
enthralling from beginning to end. Might be my favorite of the three."
The simplest way to try out your finetuned model for inference is to use it in a
pipeline(). Instantiate a pipeline for sentiment analysis with your model, and pass
your text to it:
Pytorch
Hide Pytorch content
Tokenize the text and return PyTorch tensors:
tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_model")
inputs = tokenizer(text, return_tensors="pt")
Pass your inputs to the model and return the logits:
model =
AutoModelForSequenceClassification.from_pretrained("stevhliu/my_awesome_model")
with torch.no_grad():
logits = model(**inputs).logits
Get the class with the highest probability, and use the model’s id2label mapping to
convert it to a text label:
predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]
'POSITIVE'
TensorFlow
Hide TensorFlow content
Tokenize the text and return TensorFlow tensors:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_model")
inputs = tokenizer(text, return_tensors="tf")
Pass your inputs to the model and return the logits:
model =
TFAutoModelForSequenceClassification.from_pretrained("stevhliu/my_awesome_model")
logits = model(**inputs).logits
Get the class with the highest probability, and use the model’s id2label mapping to
convert it to a text label: