HuggingFace GPT2

Hugging Face Search models, datasets, users...
Transformers documentation
OpenAI GPT2
Join the Hugging Face community

and get access to the augmented documentation experience
Collaborate on models, datasets and Spaces
Faster examples with accelerated inference
Switch between documentation themes
Sign Up to get started
OpenAI GPT2
All model pages gpt2 🤗 Hugging Face Spaces
Overview
OpenAI GPT-2 model was proposed in Language Models are Unsupervised Multitask Learners by Alec
Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever from OpenAI. It’s a
causal (unidirectional) transformer pretrained using language modeling on a very large corpus of ~40
GB of text data.
The abstract from the paper is the following:
GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset[1] of
8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the
previous words within some text. The diversity of the dataset causes this simple goal to contain naturally
occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with
more than 10X the parameters and trained on more than 10X the amount of data.
Write With Transformer is a webapp created and hosted by Hugging Face showcasing the generative
capabilities of several models. GPT-2 is one of them and is available in five different sizes: small,
medium, large, xl and a distilled version of the small checkpoint: distilgpt-2.
This model was contributed by thomwolf. The original code can be found here.
Usage tips
GPT-2 is a model with absolute position embeddings so it’s usually advised to pad the inputs on
the right rather than the left.
GPT-2 was trained with a causal language modeling (CLM) objective and is therefore powerful at
predicting the next token in a sequence. Leveraging this feature allows GPT-2 to generate
syntactically coherent text as it can be observed in the run_generation.py example script.
The model can take the past_key_values (for PyTorch) or past (for TF) as input, which is the
previously computed key/value attention pairs. Using this (past_key_values or past) value prevents
the model from re-computing pre-computed values in the context of text generation. For PyTorch,
see past_key_values argument of the GPT2Model.forward() method, or for TF the past argument of
the TFGPT2Model.call() method for more information on its usage.
Enabling the scale_attn_by_inverse_layer_idx and reorder_and_upcast_attn flags will apply the

training stability improvements from Mistral (for PyTorch only).
Usage example
The generate() method can be used to generate text using GPT2 model.
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> model = AutoModelForCausalLM.from_pretrained("gpt2")

>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
>>> prompt = "GPT2 is a model developed by OpenAI."
>>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids

>>> gen_tokens = model.generate(
... input_ids,
... do_sample=True,
... temperature=0.9,
... max_length=100,
... )
>>> gen_text = tokenizer.batch_decode(gen_tokens)[0]
Using Flash Attention 2
Flash Attention 2 is a faster, optimized version of the attention scores computation which relies on
cuda kernels.
Installation
First, check whether your hardware is compatible with Flash Attention 2. The latest list of compatible
hardware can be found in the official documentation. If your hardware is not compatible with Flash
Attention 2, you can still benefit from attention kernel optimisations through Better Transformer
support covered above.
Next, install the latest version of Flash Attention 2:
pip install -U flash-attn --no-build-isolation
Usage
To load a model using Flash Attention 2, we can pass the argument

attn_implementation="flash_attention_2" to .from_pretrained. We’ll also load the model in
half-precision (e.g. torch.float16 ), since it results in almost no degradation to audio quality but
significantly lower memory usage and faster inference:
>>> import torch

>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> device = "cuda" # the device to load the model onto
>>> model = AutoModelForCausalLM.from_pretrained("gpt2", torch_dtype=torch.float16,

>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
>>> prompt = "def hello_world():"
>>> model_inputs = tokenizer([prompt], return_tensors="pt").to(device)

>>> model.to(device)
>>> generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=Tru

>>> tokenizer.batch_decode(generated_ids)[0]
Expected speedups
Below is an expected speedup diagram that compares pure inference time between the native
implementation in transformers using gpt2 checkpoint and the Flash Attention 2 version of the model
using a sequence length of 512.
Using Scaled Dot Product Attention (SDPA)

PyTorch includes a native scaled dot-product attention (SDPA) operator as part of
torch.nn.functional . This function encompasses several implementations that can be applied
depending on the inputs and the hardware in use. See the official documentation or the GPU Inference
page for more information.
SDPA is used by default for torch>=2.1.1 when an implementation is available, but you may also set
attn_implementation="sdpa" in from_pretrained() to explicitly request SDPA to be used.
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2", torch_dtype=torch.float16, attn
...
For the best speedups, we recommend loading the model in half-precision (e.g. torch.float16 or
torch.bfloat16 ).
On a local benchmark (rtx3080ti-16GB, PyTorch 2.2.1, OS Ubuntu 22.04) using float16 with gpt2-large,
we saw the following speedups during training and inference.
Training
Time per Time per

Batch Seq Speedup Eager peak SDPA peak Mem saving
batch (Eager - batch (SDPA -
size len (%) mem (MB) mem (MB) (%)
s) s)
1 128 0.039 0.032 23.042 3482.32 3494.62 -0.352
1 256 0.073 0.059 25.15 3546.66 3552.6 -0.167
1 512 0.155 0.118 30.96 4230.1 3665.59 15.4
1 1024 0.316 0.209 50.839 8682.26 4881.09 77.875
2 128 0.07 0.06 15.324 3557.8 3545.91 0.335
2 256 0.143 0.122 16.53 3901.5 3657.68 6.666
2 512 0.267 0.213 25.626 7062.21 4876.47 44.822
2 1024 OOM 0.404 / OOM 8096.35 SDPA does

not OOM
Time per Time per
Batch Seq Speedup Eager peak SDPA peak Mem saving
batch (Eager - batch (SDPA -
size len (%) mem (MB) mem (MB) (%)
s) s)
4 128 0.134 0.128 4.412 3675.79 3648.72 0.742
4 256 0.243 0.217 12.292 6129.76 4871.12 25.839
4 512 0.494 0.406 21.687 12466.6 8102.64 53.858
4 1024 OOM 0.795 / OOM 14568.2 SDPA does

not OOM
Inference
Per token
Batch Seq Per token latency Speedup Mem Eager Mem SDPA Mem
latency SDPA
size len Eager (ms) (%) (MB) (MB) saved (%)
(ms)
1 128 7.991 6.968 14.681 1685.2 1701.32 -0.947
1 256 8.462 7.199 17.536 1745.49 1770.78 -1.428
1 512 8.68 7.853 10.529 1907.69 1921.29 -0.708
1 768 9.101 8.365 8.791 2032.93 2068.12 -1.701
2 128 9.169 9.001 1.861 1803.84 1811.4 -0.418
2 256 9.907 9.78 1.294 1907.72 1921.44 -0.714
2 512 11.519 11.644 -1.071 2176.86 2197.75 -0.951
2 768 13.022 13.407 -2.873 2464.3 2491.06 -1.074
4 128 10.097 9.831 2.709 1942.25 1985.13 -2.16
4 256 11.599 11.398 1.764 2177.28 2197.86 -0.937
4 512 14.653 14.45 1.411 2753.16 2772.57 -0.7
4 768 17.846 17.617 1.299 3327.04 3343.97 -0.506
Resources
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with
GPT2. If you’re interested in submitting a resource to be included here, please feel free to open a Pull
Request and we’ll review it! The resource should ideally demonstrate something new instead of
duplicating an existing resource.
Text Generation
A blog on how to Finetune a non-English GPT-2 Model with Hugging Face.
A blog on How to generate text: using different decoding methods for language generation with
Transformers with GPT-2.
A blog on Training CodeParrot 🦜 from Scratch, a large GPT-2 model.
A blog on Faster Text Generation with TensorFlow and XLA with GPT-2.
A blog on How to train a Language Model with Megatron-LM with a GPT-2 model.
A notebook on how to finetune GPT2 to generate lyrics in the style of your favorite artist. 🌎
A notebook on how to finetune GPT2 to generate tweets in the style of your favorite Twitter user.
🌎
Causal language modeling chapter of the 🤗 Hugging Face Course.
GPT2LMHeadModel is supported by this causal language modeling example script, text generation
example script, and notebook.
TFGPT2LMHeadModel is supported by this causal language modeling example script and

notebook.
FlaxGPT2LMHeadModel is supported by this causal language modeling example script and

notebook.
Text classification task guide
Token classification task guide
Causal language modeling task guide
GPT2Config
class transformers.GPT2Config < source >

( vocab_size = 50257, n_positions = 1024, n_embd = 768, n_layer = 12, n_head = 12,
n_inner = None, activation_function = 'gelu_new', resid_pdrop = 0.1, embd_pdrop =
0.1, attn_pdrop = 0.1, layer_norm_epsilon = 1e-05, initializer_range = 0.02,
summary_type = 'cls_index', summary_use_proj = True, summary_activation = None,
summary_proj_to_labels = True, summary_first_dropout = 0.1, scale_attn_weights =
True, use_cache = True, bos_token_id = 50256, eos_token_id = 50256,
scale_attn_by_inverse_layer_idx = False, reorder_and_upcast_attn = False, **kwargs
)
Parameters
• vocab_size ( int , optional, defaults to 50257) — Vocabulary size of the GPT-2 model. Defines the
number of different tokens that can be represented by the inputs_ids passed when calling
GPT2Model or TFGPT2Model.
• n_positions ( int , optional, defaults to 1024) — The maximum sequence length that this model
might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
• n_embd ( int , optional, defaults to 768) — Dimensionality of the embeddings and hidden states.
• n_layer ( int , optional, defaults to 12) — Number of hidden layers in the Transformer encoder.
• n_head ( int , optional, defaults to 12) — Number of attention heads for each attention layer in the
Transformer encoder.
Expand 23 parameters
• n_inner ( int , optional) — Dimensionality of the inner feed-forward layers. None will set it to 4
times n_embd
This is the configuration class to store the configuration of a GPT2Model or a TFGPT2Model. It is used
to instantiate a GPT-2 model according to the specified arguments, defining the model architecture.
Instantiating a configuration with the defaults will yield a similar configuration to that of the GPT-2
openai-community/gpt2 architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs.
Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import GPT2Config, GPT2Model
>>> # Initializing a GPT2 configuration

>>> configuration = GPT2Config()
>>> # Initializing a model (with random weights) from the configuration

>>> model = GPT2Model(configuration)
>>> # Accessing the model configuration

>>> configuration = model.config
GPT2Tokenizer
class transformers.GPT2Tokenizer < source >
( vocab_file, merges_file, errors = 'replace', unk_token = '<|endoftext|>',

bos_token = '<|endoftext|>', eos_token = '<|endoftext|>', pad_token = None,
add_prefix_space = False, add_bos_token = False, **kwargs )
Parameters
• vocab_file ( str ) — Path to the vocabulary file.
• merges_file ( str ) — Path to the merges file.
• errors ( str , optional, defaults to "replace" ) — Paradigm to follow when decoding bytes to UTF-8.
See bytes.decode for more information.
• unk_token ( str , optional, defaults to "<|endoftext|>" ) — The unknown token. A token that is
not in the vocabulary cannot be converted to an ID and is set to be this token instead.
• bos_token ( str , optional, defaults to "<|endoftext|>" ) — The beginning of sequence token.
• eos_token ( str , optional, defaults to "<|endoftext|>" ) — The end of sequence token.

Expand 9 parameters
• pad_token ( str , optional) — The token used for padding, for example when batching sequences of
different lengths.
Construct a GPT-2 tokenizer. Based on byte-level Byte-Pair-Encoding.
This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a
word will
be encoded differently whether it is at the beginning of the sentence (without space) or not:
>>> from transformers import GPT2Tokenizer

>>> tokenizer = GPT2Tokenizer.from_pretrained("openai-community/gpt2")
>>> tokenizer("Hello world")["input_ids"]
[15496, 995]
>>> tokenizer(" Hello world")["input_ids"]

[18435, 995]
You can get around that behavior by passing add_prefix_space=True when instantiating this
tokenizer or when you call it on some text, but since the model was not pretrained this way, it might
yield a decrease in performance.
When used with is_split_into_words=True , this tokenizer will add a space before each word
(even the first one).
This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users
should refer to this superclass for more information regarding those methods.
save_vocabulary < source >
( save_directory: str, filename_prefix: Optional = None )
GPT2TokenizerFast
class transformers.GPT2TokenizerFast < source >
( vocab_file = None, merges_file = None, tokenizer_file = None, unk_token =

'<|endoftext|>', bos_token = '<|endoftext|>', eos_token = '<|endoftext|>',
add_prefix_space = False, **kwargs )
Parameters
• vocab_file ( str , optional) — Path to the vocabulary file.
• merges_file ( str , optional) — Path to the merges file.
• tokenizer_file ( str , optional) — Path to tokenizers file (generally has a .json extension) that
contains everything needed to load the tokenizer.
• unk_token ( str , optional, defaults to "<|endoftext|>" ) — The unknown token. A token that is
not in the vocabulary cannot be converted to an ID and is set to be this token instead.
• bos_token ( str , optional, defaults to "<|endoftext|>" ) — The beginning of sequence token.
• eos_token ( str , optional, defaults to "<|endoftext|>" ) — The end of sequence token.
• add_prefix_space ( bool , optional, defaults to False ) — Whether or not to add an initial space to
the input. This allows to treat the leading word just as any other word. (GPT2 tokenizer detect
beginning of words by the preceding space).
Construct a “fast” GPT-2 tokenizer (backed by HuggingFace’s tokenizers library). Based on byte-level
Byte-Pair-Encoding.
This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a
word will
be encoded differently whether it is at the beginning of the sentence (without space) or not:
>>> from transformers import GPT2TokenizerFast
>>> tokenizer = GPT2TokenizerFast.from_pretrained("openai-community/gpt2")

>>> tokenizer("Hello world")["input_ids"]
[15496, 995]
>>> tokenizer(" Hello world")["input_ids"]

[18435, 995]
You can get around that behavior by passing add_prefix_space=True when instantiating this
tokenizer, but since the model was not pretrained this way, it might yield a decrease in performance.
When used with is_split_into_words=True , this tokenizer needs to be instantiated with

add_prefix_space=True .
This tokenizer inherits from PreTrainedTokenizerFast which contains most of the main methods.
Users should refer to this superclass for more information regarding those methods.
GPT2 specific outputs
class transformers.models.gpt2.modeling_gpt2.GPT2DoubleHeadsModelOut
< source >
put
( loss: Optional = None, mc_loss: Optional = None, logits: FloatTensor = None,
mc_logits: FloatTensor = None, past_key_values: Optional = None, hidden_states:
Optional = None, attentions: Optional = None )
Parameters
• loss ( torch.FloatTensor of shape (1,) , optional, returned when labels is provided) —

Language modeling loss.
• mc_loss ( torch.FloatTensor of shape (1,) , optional, returned when mc_labels is provided) —

Multiple choice classification loss.
• logits ( torch.FloatTensor of shape (batch_size, num_choices, sequence_length,

config.vocab_size) ) — Prediction scores of the language modeling head (scores for each
vocabulary token before SoftMax).
• mc_logits ( torch.FloatTensor of shape (batch_size, num_choices) ) — Prediction scores of

the multiple choice classification head (scores for each choice before SoftMax).
Expand 7 parameters
• past_key_values ( Tuple[Tuple[torch.Tensor]] , optional, returned when use_cache=True is
passed or when config.use_cache=True ) — Tuple of length config.n_layers , containing
tuples of tensors of shape (batch size num heads sequence length
Base class for outputs of models predicting if two sentences are consecutive or not.
class transformers.models.gpt2.modeling_tf_gpt2.TFGPT2DoubleHeadsMod
< source >
elOutput
( logits: tf.Tensor = None, mc_logits: tf.Tensor = None, past_key_values:

List[tf.Tensor] | None = None, hidden_states: Tuple[tf.Tensor] | None = None,
attentions: Tuple[tf.Tensor] | None = None )
Parameters
• logits ( tf.Tensor of shape (batch_size, num_choices, sequence_length,

config.vocab_size) ) — Prediction scores of the language modeling head (scores for each
vocabulary token before SoftMax).
• mc_logits ( tf.Tensor of shape (batch_size, num_choices) ) — Prediction scores of the

multiple choice classification head (scores for each choice before SoftMax).
• past_key_values ( List[tf.Tensor] , optional, returned when use_cache=True is passed or

when config.use_cache=True ) — List of tf.Tensor of length config.n_layers , with each
tensor of shape (2, batch_size, num_heads, sequence_length, embed_size_per_head) ).
Contains pre-computed hidden-statesExpand 5 parameters
(key and values in the attention blocks) that can be used (see
past_key_values input) to speed up sequential decoding.
• hidden states ( tuple(tf Tensor) optional returned when output hidden states True is
Base class for outputs of models predicting if two sentences are consecutive or not.
Pytorch Hide Pytorch content
GPT2Model
class transformers.GPT2Model < source >
( config )
Parameters
• config (GPT2Config) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the from_pretrained() method to load the model weights.
The bare GPT2 Model transformer outputting raw hidden-states without any specific head on
top.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the
input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and
refer to the PyTorch documentation for all matter related to general usage and behavior.
forward < source >
( input_ids: Optional = None, past_key_values: Optional = None,

attention_mask: Optional = None, token_type_ids: Optional = None,
position_ids: Optional = None, head_mask: Optional = None, inputs_embeds:
Optional = None, encoder_hidden_states: Optional = None,
encoder_attention_mask: Optional = None, use_cache: Optional = None,
output_attentions: Optional = None, output_hidden_states: Optional = None,
return_dict: Optional = None ) →
transformers.modeling_outputs.BaseModelOutputWithPastAndCrossAttentions or
tuple(torch.FloatTensor)
Parameters
• input_ids ( torch.LongTensor of shape (batch_size, input_ids_length) ) —

input_ids_length = sequence_length if past_key_values is None else
past_key_values[0][0].shape[-2] ( sequence_length of input past key value states).
Indices of input sequence tokens in the vocabulary.
If past_key_values is used, only input_ids that do not have their past calculated should
be passed as input_ids .
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and

PreTrainedTokenizer.call() for details.
What are input IDs? Expand 11 parameters
• past_key_values ( Tuple[Tuple[torch.Tensor]] of length config.n_layers ) —

Contains precomputed hidden-states (key and values in the attention blocks) as computed
The GPT2Model forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should
call the Module instance afterwards instead of this since the former takes care of running
the pre and post processing steps while the latter silently ignores them.
Example:
>>> from transformers import AutoTokenizer, GPT2Model

>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")

>>> model = GPT2Model.from_pretrained("openai-community/gpt2")
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")

>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state
GPT2LMHeadModel
class transformers.GPT2LMHeadModel < source >
( config )
Parameters
The GPT2 Model transformer with a language modeling head on top (linear layer with weights
tied to the input embeddings).
forward < source >

Optional = None, encoder_hidden_states: Optional = None,
encoder_attention_mask: Optional = None, labels: Optional = None, use_cache:
Optional = None, output_attentions: Optional = None, output_hidden_states:
Optional = None, return_dict: Optional = None ) →
transformers.modeling_outputs.CausalLMOutputWithCrossAttentions or
Parameters


Expand 12 parameters
What are input IDs?
The GPT2LMHeadModel forward method, overrides the __call__ special method.
Example:
>>> import torch

>>> from transformers import AutoTokenizer, GPT2LMHeadModel

>>> model = GPT2LMHeadModel.from_pretrained("openai-community/gpt2")

>>> outputs = model(**inputs, labels=inputs["input_ids"])
>>> loss = outputs.loss
>>> logits = outputs.logits
GPT2DoubleHeadsModel
class transformers.GPT2DoubleHeadsModel < source >
( config )
Parameters
The GPT2 Model transformer with a language modeling and a multiple-choice classification
head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers. The language
modeling head has its weights tied to the input embeddings, the classification head takes as
input the input of a specified classification token index in the input sequence).
forward < source >

Optional = None, mc_token_ids: Optional = None, labels: Optional = None,
mc_labels: Optional = None, use_cache: Optional = None, output_attentions:
Optional = None, output_hidden_states: Optional = None, return_dict: Optional
= None, **kwargs ) →
transformers.models.gpt2.modeling_gpt2.GPT2DoubleHeadsModelOutput or
Parameters



Contains precomputed hidden states (key and values in the attention blocks) as computed
The GPT2DoubleHeadsModel forward method, overrides the __call__ special method.
Example:
>>> import torch

>>> from transformers import AutoTokenizer, GPT2DoubleHeadsModel

>>> model = GPT2DoubleHeadsModel.from_pretrained("openai-community/gpt2")
>>> # Add a [CLS] to the vocabulary (we should train it also!)

>>> num_added_tokens = tokenizer.add_special_tokens({"cls_token": "[CLS]"})
>>> # Update the model embeddings with the new vocabulary size
>>> embedding_layer = model.resize_token_embeddings(len(tokenizer))
>>> choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"]
>>> encoded_choices = [tokenizer.encode(s) for s in choices]
>>> cls_token_location = [tokens.index(tokenizer.cls_token_id) for tokens in
>>> input_ids = torch.tensor(encoded_choices).unsqueeze(0) # Batch size: 1

>>> mc_token_ids = torch.tensor([cls_token_location]) # Batch size: 1
>>> outputs = model(input_ids, mc_token_ids=mc_token_ids)

>>> lm_logits = outputs.logits
>>> mc_logits = outputs.mc_logits
GPT2ForQuestionAnswering
class transformers.GPT2ForQuestionAnswering < source >
( config )
Parameters
The GPT-2 Model transformer with a span classification head on top for extractive question-
answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span
start logits and span end logits ).
forward < source >
( input_ids: Optional = None, attention_mask: Optional = None,

token_type_ids: Optional = None, position_ids: Optional = None, head_mask:
Optional = None, inputs_embeds: Optional = None, start_positions: Optional =
None, end_positions: Optional = None, output_attentions: Optional = None,
output_hidden_states: Optional = None, return_dict: Optional = None ) →
transformers.modeling_outputs.QuestionAnsweringModelOutput or
Parameters



The GPT2ForQuestionAnswering forward method, overrides the __call__ special method.
This example uses a random model as the real ones are all very big. To get proper results,
you should use openai-community/gpt2 instead of openai-community/gpt2. If you get out-
of-memory when loading that checkpoint, you can try adding device_map="auto" in the
from_pretrained call.
Example:
>>> from transformers import AutoTokenizer, GPT2ForQuestionAnswering

>>> import torch

>>> model = GPT2ForQuestionAnswering.from_pretrained("openai-community/gpt2
>>> question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
>>> inputs = tokenizer(question, text, return_tensors="pt")

>>> with torch.no_grad():
... outputs = model(**inputs)
>>> answer_start_index = outputs.start_logits.argmax()

>>> answer_end_index = outputs.end_logits.argmax()
>>> predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_
>>> # target is "nice puppet"

>>> target_start_index = torch.tensor([14])
>>> target_end_index = torch.tensor([15])
>>> outputs = model(**inputs, start_positions=target_start_index, end_positi

>>> loss = outputs.loss
GPT2ForSequenceClassification
class transformers.GPT2ForSequenceClassification < source >
( config )
Parameters
The GPT2 Model transformer with a sequence classification head on top (linear layer).
GPT2ForSequenceClassification uses the last token in order to do the classification, as other

causal models (e.g. GPT-1) do.
Since it does classification on the last token, it requires to know the position of the last token. If
a pad_token_id is defined in the configuration, it finds the last token that is not a padding
token in each row. If no pad_token_id is defined, it simply takes the last value in each row of
the batch. Since it cannot guess the padding tokens when inputs_embeds are passed instead of
input_ids , it does the same (take the last value in each row of the batch).
forward < source >

Optional = None, labels: Optional = None, use_cache: Optional = None,
transformers.modeling_outputs.SequenceClassifierOutputWithPast or
Parameters



The GPT2ForSequenceClassification forward method, overrides the __call__ special

method.
Example of single-label classification:
>>> import torch

>>> from transformers import AutoTokenizer, GPT2ForSequenceClassification
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/DialogRPT-updown")

>>> model = GPT2ForSequenceClassification.from_pretrained("microsoft/DialogR

... logits = model(**inputs).logits
>>> predicted_class_id = logits.argmax().item()
>>> # To train a model on `num_labels` classes, you can pass `num_labels=num

>>> num_labels = len(model.config.id2label)
>>> labels = torch.tensor([1])

>>> loss = model(**inputs, labels=labels).loss
Example of multi-label classification:
>>> import torch

>>> from transformers import AutoTokenizer, GPT2ForSequenceClassification


>>> predicted_class_ids = torch.arange(0, logits.shape[-1])[torch.sigmoid(lo

>>> model = GPT2ForSequenceClassification.from_pretrained(
... "microsoft/DialogRPT-updown", num_labels=num_labels, problem_type="m
... )
>>> labels = torch.sum(

... torch.nn.functional.one_hot(predicted_class_ids[None, :].clone(), nu
... ).to(torch.float)
GPT2ForTokenClassification
class transformers.GPT2ForTokenClassification < source >
( config )
Parameters
GPT2 Model with a token classification head on top (a linear layer on top of the hidden-states
output) e.g. for Named-Entity-Recognition (NER) tasks.
forward < source >

Optional = None, labels: Optional = None, use_cache: Optional = None,
transformers.modeling_outputs.TokenClassifierOutput or
Parameters



The GPT2ForTokenClassification forward method, overrides the __call__ special method.
Example:
>>> from transformers import AutoTokenizer, GPT2ForTokenClassification

>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("brad1141/gpt2-finetuned-comp2

>>> model = GPT2ForTokenClassification.from_pretrained("brad1141/gpt2-finetu
>>> inputs = tokenizer(

... "HuggingFace is a company based in Paris and New York", add_special_
... )

>>> predicted_token_class_ids = logits.argmax(-1)
>>> # Note that tokens are classified rather then input words which means th
>>> # there might be more predicted token classes than words.
>>> # Multiple token classes might account for the same word
>>> predicted_tokens_classes = [model.config.id2label[t.item()] for t in pre
>>> predicted_tokens_classes
['Lead', 'Lead', 'Lead', 'Position', 'Lead', 'Lead', 'Lead', 'Lead', 'Lead'
>>> labels = predicted_token_class_ids

>>> round(loss.item(), 2)
0.25
TensorFlow Hide TensorFlow content
TFGPT2Model
class transformers.TFGPT2Model < source >
( config, *inputs, **kwargs )
Parameters
top.
This model inherits from TFPreTrainedModel. Check the superclass documentation for the
generic methods the library implements for all its model (such as downloading or saving,
resizing the input embeddings, pruning heads etc.)
This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the
TF 2.0 documentation for all matter related to general usage and behavior.
TensorFlow models and layers in transformers accept two formats as input:
•having all inputs as keyword arguments (like PyTorch models), or

•having all inputs as a list, tuple or dict in the first positional argument.
The reason the second format is supported is that Keras methods prefer this format when
passing inputs to models and layers. Because of this support, when using methods like
model.fit() things should “just work” for you - just pass your inputs and labels in any
format that model.fit() supports! If, however, you want to use the second format outside of
Keras methods like fit() and predict() , such as when creating your own layers or models
with the Keras Functional API, there are three possibilities you can use to gather all the
input Tensors in the first positional argument:
•a single Tensor with input_ids only and nothing else: model(input_ids)

•a list of varying length with one or several input Tensors IN THE ORDER given in the
docstring: model([input_ids, attention_mask]) or model([input_ids,
attention_mask, token_type_ids])
•a dictionary with one or several input Tensors associated to the input names given in the
docstring: model({"input_ids": input_ids, "token_type_ids":
token_type_ids})
Note that when creating models and layers with subclassing then you don’t need to worry
about any of this, as you can just pass inputs like you would to any other Python function!
call < source >
( input_ids: TFModelInputType | None = None, past_key_values:

Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None, attention_mask:
np.ndarray | tf.Tensor | None = None, token_type_ids: np.ndarray | tf.Tensor
| None = None, position_ids: np.ndarray | tf.Tensor | None = None, head_mask:
np.ndarray | tf.Tensor | None = None, inputs_embeds: np.ndarray | tf.Tensor |
None = None, encoder_hidden_states: np.ndarray | tf.Tensor | None = None,
encoder_attention_mask: np.ndarray | tf.Tensor | None = None, use_cache:
Optional[bool] = None, output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] =
None, training: Optional[bool] = False ) →
transformers.modeling_tf_outputs.TFBaseModelOutputWithPastAndCrossAttentions
or tuple(tf.Tensor)
Parameters
• input_ids ( Numpy array or tf.Tensor of shape (batch_size, input_ids_length) ) —

past_key_values[0].shape[-2] ( sequence_length of input past key value states).
If past_key_values is used, only input IDs that do not have their past calculated should be
passed as input_ids .
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.call() and

PreTrainedTokenizer.encode() for details.
• past_key_values ( List[tf.Tensor] of length config.n_layers ) — Contains pre-

computed hidden states (key and values in the attention blocks) as computed by the model
The TFGPT2Model forward method, overrides the __call__ special method.
Example:
>>> from transformers import AutoTokenizer, TFGPT2Model

>>> import tensorflow as tf

>>> model = TFGPT2Model.from_pretrained("openai-community/gpt2")
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="tf")

>>> outputs = model(inputs)
TFGPT2LMHeadModel
class transformers.TFGPT2LMHeadModel < source >
Parameters


token_type_ids})
call < source >

None = None, encoder_hidden_states: np.ndarray | tf.Tensor | None = None,
encoder_attention_mask: np.ndarray | tf.Tensor | None = None, use_cache:
None, labels: np.ndarray | tf.Tensor | None = None, training: Optional[bool]
= False ) →
transformers.modeling_tf_outputs.TFCausalLMOutputWithCrossAttentions or
tuple(tf.Tensor)
Parameters


PreTrainedTokenizer.encode() forExpand
details.16 parameters
What are input IDs?
The TFGPT2LMHeadModel forward method, overrides the __call__ special method.
Example:
>>> from transformers import AutoTokenizer, TFGPT2LMHeadModel


>>> model = TFGPT2LMHeadModel.from_pretrained("openai-community/gpt2")

>>> outputs = model(inputs)
>>> logits = outputs.logits
TFGPT2DoubleHeadsModel
class transformers.TFGPT2DoubleHeadsModel < source >
Parameters
The GPT2 Model transformer with a language modeling and a multiple-choice classification
head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers. The language
modeling head has its weights tied to the input embeddings, the classification head takes as
input the input of a specified classification token index in the input sequence).


token_type_ids})
call < source >

None = None, mc_token_ids: np.ndarray | tf.Tensor | None = None, use_cache:
transformers.models.gpt2.modeling_tf_gpt2.TFGPT2DoubleHeadsModelOutput or
tuple(tf.Tensor)
Parameters



The TFGPT2DoubleHeadsModel forward method, overrides the __call__ special method.
Examples:

>>> from transformers import AutoTokenizer, TFGPT2DoubleHeadsModel

>>> model = TFGPT2DoubleHeadsModel.from_pretrained("openai-community/gpt2")
>>> # Add a [CLS] to the vocabulary (we should train it also!)
>>> num_added_tokens = tokenizer.add_special_tokens({"cls_token": "[CLS]"})
>>> embedding_layer = model.resize_token_embeddings(

... len(tokenizer)
... ) # Update the model embeddings with the new vocabulary size
>>> choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"]
>>> encoded_choices = [tokenizer.encode(s) for s in choices]
>>> cls_token_location = [tokens.index(tokenizer.cls_token_id) for tokens in
>>> input_ids = tf.constant(encoded_choices)[None, :] # Batch size: 1, numb

>>> mc_token_ids = tf.constant([cls_token_location]) # Batch size: 1
>>> outputs = model(input_ids, mc_token_ids=mc_token_ids)

>>> lm_prediction_scores, mc_prediction_scores = outputs[:2]
TFGPT2ForSequenceClassification
class transformers.TFGPT2ForSequenceClassification < source >
Parameters
The GPT2 Model transformer with a sequence classification head on top (linear layer).
TFGPT2ForSequenceClassification uses the last token in order to do the classification, as other

causal models (e.g. GPT-1) do.
Since it does classification on the last token, it requires to know the position of the last token. If
a pad_token_id is defined in the configuration, it finds the last token that is not a padding
token in each row. If no pad_token_id is defined, it simply takes the last value in each row of
the batch. Since it cannot guess the padding tokens when inputs_embeds are passed instead of
input_ids , it does the same (take the last value in each row of the batch).


token_type_ids})
call < source >

None = None, use_cache: Optional[bool] = None, output_attentions:
Optional[bool] = None, output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None, labels: np.ndarray | tf.Tensor | None =
transformers.modeling_tf_outputs.TFSequenceClassifierOutputWithPast or
tuple(tf.Tensor)
Parameters



The TFGPT2ForSequenceClassification forward method, overrides the __call__ special

method.
Example:
>>> from transformers import AutoTokenizer, TFGPT2ForSequenceClassification


>>> model = TFGPT2ForSequenceClassification.from_pretrained("microsoft/Dialo

>>> logits = model(**inputs).logits
>>> predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])

>>> model = TFGPT2ForSequenceClassification.from_pretrained("microsoft/Dialo
>>> labels = tf.constant(1)

TFSequenceClassifierOutputWithPast
class transformers.modeling_tf_outputs.TFSequenceClassifierOutputWi
< source >
thPast
( loss: tf.Tensor | None = None, logits: tf.Tensor = None, past_key_values:

List[tf.Tensor] | None = None, hidden_states: Tuple[tf.Tensor] | None = None,
attentions: Tuple[tf.Tensor] | None = None )
Parameters
• loss ( tf.Tensor of shape (batch_size, ) , optional, returned when labels is provided) —

Classification (or regression if config.num_labels==1) loss.
• logits ( tf.Tensor of shape (batch_size, config.num_labels) ) — Classification (or

regression if config.num_labels==1) scores (before SoftMax).
• past_key_values ( List[tf.Tensor] , optional, returned when use_cache=True is passed or

when config.use_cache=True ) — List of tf.Tensor of length config.n_layers , with
each tensor of shape (2, batch_size, num_heads, sequence_length,
embed_size_per_head) ).
Contains pre-computed hidden-states (key and values in the attention blocks) that can be used
Expand 5 parameters
(see past_key_values input) to speed up sequential decoding.
• hidden_states ( tuple(tf.Tensor) , optional, returned when output_hidden_states=True

Base class for outputs of sentence classification models.
TFGPT2Tokenizer
class transformers.TFGPT2Tokenizer < source >
( vocab: Dict, merges: List, max_length: int = None, pad_token_id: int = None )
Parameters
• vocab (Dict[str, int]) — Vocabulary dict for Byte Pair Tokenizer
• merges (List[str]) — Merges list for Byte Pair Tokenizer
This is an in-graph tokenizer for GPT2. It should be initialized similarly to other tokenizers, using
the from_pretrained() method. It can also be initialized with the from_tokenizer()
method, which imports settings from an existing standard tokenizer object.
In-graph tokenizers, unlike other Hugging Face tokenizers, are actually Keras layers and are
designed to be run when the model is called, rather than during preprocessing. As a result, they
have somewhat more limited options than standard tokenizer classes. They are most useful
when you want to create an end-to-end model that goes straight from tf.string inputs to
outputs.
from_config < source >
( config )
Parameters
• config (Dict) — Dictionary with keys such as stated in get_config .
Creates TFGPT2Tokenizer from configurations
from_pretrained < source >
( pretrained_model_name_or_path: Union, *init_inputs, **kwargs )

Parameters
• pretrained_model_name_or_path (Union[str, os.PathLike]) — Path to pretrained model
Creates TFGPT2Tokenizer from pretrained GPT2Tokenizer
Examples:
from transformers import TFGPT2Tokenizer
tf_tokenizer = TFGPT2Tokenizer.from_pretrained("openai-community/gpt2")
from_tokenizer < source >
( tokenizer: GPT2Tokenizer, *args, **kwargs )
Parameters
• tokenizer (GPT2Tokenizer) —
Creates TFGPT2Tokenizer from GPT2Tokenizer
Examples:
from transformers import AutoTokenizer, TFGPT2Tokenizer
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
tf_tokenizer = TFGPT2Tokenizer.from_tokenizer(tokenizer)
JAX Hide JAX content
FlaxGPT2Model
class transformers.FlaxGPT2Model < source >
( config: GPT2Config, input_shape: Tuple = (1, 1), seed: int = 0, dtype: dtype
= <class 'jax.numpy.float32'>, _do_init: bool = True, **kwargs )
Parameters
• dtype ( jax.numpy.dtype , optional, defaults to jax.numpy.float32 ) — The data type of the

computation. Can be one of jax.numpy.float32 , jax.numpy.float16 (on GPUs) and
jax.numpy.bfloat16 (on TPUs).
This can be used to enable mixed-precision training or half-precision inference on GPUs or

TPUs. If specified all the computation will be performed with the given dtype .
Note that this only specifies the dtype of the computation and does not influence the dtype
of model parameters.
If you wish to change the dtype of the model parameters, see to_fp16() and to_bf16().
top.
This model inherits from FlaxPreTrainedModel. Check the superclass documentation for the
This model is also a Flax Linen flax.nn.Module subclass. Use it as a regular Flax Module and refer
to the Flax documentation for all matter related to general usage and behavior.
Finally, this model supports inherent JAX features such as:
• Just-In-Time (JIT) compilation

• Automatic Differentiation
• Vectorization
• Parallelization
__call__ < source >
( input_ids, attention_mask = None, position_ids = None,

encoder_hidden_states: Optional = None, encoder_attention_mask: Optional =
None, params: dict = None, past_key_values: dict = None, dropout_rng: PRNGKey
= None, train: bool = False, output_attentions: Optional = None,
transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPastAndCrossAttenti
ons or tuple(torch.FloatTensor)
Parameters
• input_ids ( numpy.ndarray of shape (batch_size, input_ids_length) ) —

input_ids_length = sequence_length . Indices of input sequence tokens in the
vocabulary.

What are input IDs?
• attention_mask ( numpy.ndarray of shape (batch_size, sequence_length) , optional)

— Mask to avoid performing attention on padding token indices. Mask values selected in
[0, 1] : Expand 7 parameters
•1 for tokens that are not masked,
The FlaxGPT2PreTrainedModel forward method, overrides the __call__ special method.
Example:
>>> from transformers import AutoTokenizer, FlaxGPT2Model

>>> model = FlaxGPT2Model.from_pretrained("openai-community/gpt2")
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="jax")

FlaxGPT2LMHeadModel
class transformers.FlaxGPT2LMHeadModel < source >
( config: GPT2Config, input_shape: Tuple = (1, 1), seed: int = 0, dtype: dtype
= <class 'jax.numpy.float32'>, _do_init: bool = True, **kwargs )
Parameters
• dtype ( jax.numpy.dtype , optional, defaults to jax.numpy.float32 ) — The data type of the

computation. Can be one of jax.numpy.float32 , jax.numpy.float16 (on GPUs) and
jax.numpy.bfloat16 (on TPUs).
This can be used to enable mixed-precision training or half-precision inference on GPUs or

TPUs. If specified all the computation will be performed with the given dtype .
Note that this only specifies the dtype of the computation and does not influence the dtype
of model parameters.
If you wish to change the dtype of the model parameters, see to_fp16() and to_bf16().
This model inherits from FlaxPreTrainedModel. Check the superclass documentation for the
This model is also a Flax Linen flax.nn.Module subclass. Use it as a regular Flax Module and refer
to the Flax documentation for all matter related to general usage and behavior.
Finally, this model supports inherent JAX features such as:
• Just-In-Time (JIT) compilation

• Automatic Differentiation
• Vectorization
• Parallelization
__call__ < source >
( input_ids, attention_mask = None, position_ids = None,

encoder_hidden_states: Optional = None, encoder_attention_mask: Optional =
None, params: dict = None, past_key_values: dict = None, dropout_rng: PRNGKey
= None, train: bool = False, output_attentions: Optional = None,
transformers.modeling_flax_outputs.FlaxCausalLMOutputWithCrossAttentions or
Parameters
• input_ids ( numpy.ndarray of shape (batch_size, input_ids_length) ) —

input_ids_length = sequence_length . Indices of input sequence tokens in the
vocabulary.

What are input IDs?
• attention_mask ( numpy.ndarray of shape (batch_size, sequence_length) , optional)

— Mask to avoid performing attention on padding token indices. Mask values selected in
[0, 1] : Expand 7 parameters
•1 for tokens that are not masked,
The FlaxGPT2PreTrainedModel forward method, overrides the __call__ special method.
Example:
>>> from transformers import AutoTokenizer, FlaxGPT2LMHeadModel

>>> model = FlaxGPT2LMHeadModel.from_pretrained("openai-community/gpt2")
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="np")
>>> # retrieve logts for next token

>>> next_token_logits = outputs.logits[:, -1]
<> Update on GitHub
← GPT-J GPTBigCode →

HuggingFace GPT2

Uploaded by

Copyright:

Available Formats

HuggingFace GPT2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HuggingFace GPT2

Uploaded by

Copyright:

Available Formats

Hugging Face Search models, datasets, users...

Join the Hugging Face community

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

Sign Up to get started

All model pages gpt2 🤗 Hugging Face Spaces

The abstract from the paper is the following:

Enabling the scale_attn_by_inverse_layer_idx and reorder_and_upcast_attn flags will apply the

>>> from transformers import AutoModelForCausalLM, AutoTokenizer

>>> model = AutoModelForCausalLM.from_pretrained("gpt2")

>>> prompt = "GPT2 is a model developed by OpenAI."

>>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids

Using Flash Attention 2

Next, install the latest version of Flash Attention 2:

pip install -U flash-attn --no-build-isolation

To load a model using Flash Attention 2, we can pass the argument

>>> import torch

>>> model = AutoModelForCausalLM.from_pretrained("gpt2", torch_dtype=torch.float16,

>>> model_inputs = tokenizer([prompt], return_tensors="pt").to(device)

>>> generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=Tru

Using Scaled Dot Product Attention (SDPA)

from transformers import AutoModelForCausalLM

Time per Time per

1 128 0.039 0.032 23.042 3482.32 3494.62 -0.352

1 256 0.073 0.059 25.15 3546.66 3552.6 -0.167

1 512 0.155 0.118 30.96 4230.1 3665.59 15.4

1 1024 0.316 0.209 50.839 8682.26 4881.09 77.875

2 128 0.07 0.06 15.324 3557.8 3545.91 0.335

2 256 0.143 0.122 16.53 3901.5 3657.68 6.666

2 512 0.267 0.213 25.626 7062.21 4876.47 44.822

2 1024 OOM 0.404 / OOM 8096.35 SDPA does

4 128 0.134 0.128 4.412 3675.79 3648.72 0.742

4 256 0.243 0.217 12.292 6129.76 4871.12 25.839

4 512 0.494 0.406 21.687 12466.6 8102.64 53.858

4 1024 OOM 0.795 / OOM 14568.2 SDPA does

1 128 7.991 6.968 14.681 1685.2 1701.32 -0.947

1 256 8.462 7.199 17.536 1745.49 1770.78 -1.428

1 512 8.68 7.853 10.529 1907.69 1921.29 -0.708

1 768 9.101 8.365 8.791 2032.93 2068.12 -1.701

2 128 9.169 9.001 1.861 1803.84 1811.4 -0.418

2 256 9.907 9.78 1.294 1907.72 1921.44 -0.714

2 512 11.519 11.644 -1.071 2176.86 2197.75 -0.951

2 768 13.022 13.407 -2.873 2464.3 2491.06 -1.074

4 128 10.097 9.831 2.709 1942.25 1985.13 -2.16

4 256 11.599 11.398 1.764 2177.28 2197.86 -0.937

4 512 14.653 14.45 1.411 2753.16 2772.57 -0.7

4 768 17.846 17.617 1.299 3327.04 3343.97 -0.506

A blog on how to Finetune a non-English GPT-2 Model with Hugging Face.

A blog on Training CodeParrot 🦜 from Scratch, a large GPT-2 model.

TFGPT2LMHeadModel is supported by this causal language modeling example script and

FlaxGPT2LMHeadModel is supported by this causal language modeling example script and

Text classification task guide

Token classification task guide

Causal language modeling task guide

class transformers.GPT2Config < source >

>>> from transformers import GPT2Config, GPT2Model

>>> # Initializing a GPT2 configuration

>>> # Initializing a model (with random weights) from the configuration

>>> # Accessing the model configuration

The GPT2Model forward method, overrides the call special method.

The GPT2LMHeadModel forward method, overrides the call special method.