HuggingFace GPT2
HuggingFace GPT2
HuggingFace GPT2
Transformers documentation
OpenAI GPT2
OpenAI GPT2
Overview
OpenAI GPT-2 model was proposed in Language Models are Unsupervised Multitask Learners by Alec
Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever from OpenAI. It’s a
causal (unidirectional) transformer pretrained using language modeling on a very large corpus of ~40
GB of text data.
GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset[1] of
8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the
previous words within some text. The diversity of the dataset causes this simple goal to contain naturally
occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with
more than 10X the parameters and trained on more than 10X the amount of data.
Write With Transformer is a webapp created and hosted by Hugging Face showcasing the generative
capabilities of several models. GPT-2 is one of them and is available in five different sizes: small,
medium, large, xl and a distilled version of the small checkpoint: distilgpt-2.
This model was contributed by thomwolf. The original code can be found here.
Usage tips
GPT-2 is a model with absolute position embeddings so it’s usually advised to pad the inputs on
the right rather than the left.
GPT-2 was trained with a causal language modeling (CLM) objective and is therefore powerful at
predicting the next token in a sequence. Leveraging this feature allows GPT-2 to generate
syntactically coherent text as it can be observed in the run_generation.py example script.
The model can take the past_key_values (for PyTorch) or past (for TF) as input, which is the
previously computed key/value attention pairs. Using this (past_key_values or past) value prevents
the model from re-computing pre-computed values in the context of text generation. For PyTorch,
see past_key_values argument of the GPT2Model.forward() method, or for TF the past argument of
the TFGPT2Model.call() method for more information on its usage.
Usage example
The generate() method can be used to generate text using GPT2 model.
Flash Attention 2 is a faster, optimized version of the attention scores computation which relies on
cuda kernels.
Installation
First, check whether your hardware is compatible with Flash Attention 2. The latest list of compatible
hardware can be found in the official documentation. If your hardware is not compatible with Flash
Attention 2, you can still benefit from attention kernel optimisations through Better Transformer
support covered above.
Usage
half-precision (e.g. torch.float16 ), since it results in almost no degradation to audio quality but
significantly lower memory usage and faster inference:
Expected speedups
Below is an expected speedup diagram that compares pure inference time between the native
implementation in transformers using gpt2 checkpoint and the Flash Attention 2 version of the model
using a sequence length of 512.
depending on the inputs and the hardware in use. See the official documentation or the GPU Inference
page for more information.
SDPA is used by default for torch>=2.1.1 when an implementation is available, but you may also set
attn_implementation="sdpa" in from_pretrained() to explicitly request SDPA to be used.
For the best speedups, we recommend loading the model in half-precision (e.g. torch.float16 or
torch.bfloat16 ).
On a local benchmark (rtx3080ti-16GB, PyTorch 2.2.1, OS Ubuntu 22.04) using float16 with gpt2-large,
we saw the following speedups during training and inference.
Training
Inference
Per token
Batch Seq Per token latency Speedup Mem Eager Mem SDPA Mem
latency SDPA
size len Eager (ms) (%) (MB) (MB) saved (%)
(ms)
Resources
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with
GPT2. If you’re interested in submitting a resource to be included here, please feel free to open a Pull
Request and we’ll review it! The resource should ideally demonstrate something new instead of
duplicating an existing resource.
Text Generation
A blog on How to generate text: using different decoding methods for language generation with
Transformers with GPT-2.
A blog on Faster Text Generation with TensorFlow and XLA with GPT-2.
A blog on How to train a Language Model with Megatron-LM with a GPT-2 model.
A notebook on how to finetune GPT2 to generate lyrics in the style of your favorite artist. 🌎
A notebook on how to finetune GPT2 to generate tweets in the style of your favorite Twitter user.
🌎
Causal language modeling chapter of the 🤗 Hugging Face Course.
GPT2LMHeadModel is supported by this causal language modeling example script, text generation
example script, and notebook.
GPT2Config
Parameters
• vocab_size ( int , optional, defaults to 50257) — Vocabulary size of the GPT-2 model. Defines the
number of different tokens that can be represented by the inputs_ids passed when calling
GPT2Model or TFGPT2Model.
• n_positions ( int , optional, defaults to 1024) — The maximum sequence length that this model
might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
• n_embd ( int , optional, defaults to 768) — Dimensionality of the embeddings and hidden states.
• n_layer ( int , optional, defaults to 12) — Number of hidden layers in the Transformer encoder.
• n_head ( int , optional, defaults to 12) — Number of attention heads for each attention layer in the
Transformer encoder.
Expand 23 parameters
• n_inner ( int , optional) — Dimensionality of the inner feed-forward layers. None will set it to 4
times n_embd
This is the configuration class to store the configuration of a GPT2Model or a TFGPT2Model. It is used
to instantiate a GPT-2 model according to the specified arguments, defining the model architecture.
Instantiating a configuration with the defaults will yield a similar configuration to that of the GPT-2
openai-community/gpt2 architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs.
Read the documentation from PretrainedConfig for more information.
Example:
GPT2Tokenizer
Parameters
• errors ( str , optional, defaults to "replace" ) — Paradigm to follow when decoding bytes to UTF-8.
See bytes.decode for more information.
• unk_token ( str , optional, defaults to "<|endoftext|>" ) — The unknown token. A token that is
not in the vocabulary cannot be converted to an ID and is set to be this token instead.
This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a
word will
be encoded differently whether it is at the beginning of the sentence (without space) or not:
You can get around that behavior by passing add_prefix_space=True when instantiating this
tokenizer or when you call it on some text, but since the model was not pretrained this way, it might
yield a decrease in performance.
When used with is_split_into_words=True , this tokenizer will add a space before each word
(even the first one).
This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users
should refer to this superclass for more information regarding those methods.
GPT2TokenizerFast
Parameters
• tokenizer_file ( str , optional) — Path to tokenizers file (generally has a .json extension) that
contains everything needed to load the tokenizer.
• unk_token ( str , optional, defaults to "<|endoftext|>" ) — The unknown token. A token that is
not in the vocabulary cannot be converted to an ID and is set to be this token instead.
• bos_token ( str , optional, defaults to "<|endoftext|>" ) — The beginning of sequence token.
• add_prefix_space ( bool , optional, defaults to False ) — Whether or not to add an initial space to
the input. This allows to treat the leading word just as any other word. (GPT2 tokenizer detect
beginning of words by the preceding space).
Construct a “fast” GPT-2 tokenizer (backed by HuggingFace’s tokenizers library). Based on byte-level
Byte-Pair-Encoding.
This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a
word will
be encoded differently whether it is at the beginning of the sentence (without space) or not:
You can get around that behavior by passing add_prefix_space=True when instantiating this
tokenizer, but since the model was not pretrained this way, it might yield a decrease in performance.
This tokenizer inherits from PreTrainedTokenizerFast which contains most of the main methods.
Users should refer to this superclass for more information regarding those methods.
class transformers.models.gpt2.modeling_gpt2.GPT2DoubleHeadsModelOut
< source >
put
( loss: Optional = None, mc_loss: Optional = None, logits: FloatTensor = None,
mc_logits: FloatTensor = None, past_key_values: Optional = None, hidden_states:
Optional = None, attentions: Optional = None )
Parameters
Base class for outputs of models predicting if two sentences are consecutive or not.
class transformers.models.gpt2.modeling_tf_gpt2.TFGPT2DoubleHeadsMod
< source >
elOutput
Parameters
• hidden states ( tuple(tf Tensor) optional returned when output hidden states True is
Base class for outputs of models predicting if two sentences are consecutive or not.
GPT2Model
( config )
Parameters
• config (GPT2Config) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the from_pretrained() method to load the model weights.
The bare GPT2 Model transformer outputting raw hidden-states without any specific head on
top.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the
input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and
refer to the PyTorch documentation for all matter related to general usage and behavior.
Parameters
If past_key_values is used, only input_ids that do not have their past calculated should
be passed as input_ids .
Although the recipe for forward pass needs to be defined within this function, one should
call the Module instance afterwards instead of this since the former takes care of running
the pre and post processing steps while the latter silently ignores them.
Example:
GPT2LMHeadModel
class transformers.GPT2LMHeadModel < source >
( config )
Parameters
• config (GPT2Config) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the from_pretrained() method to load the model weights.
The GPT2 Model transformer with a language modeling head on top (linear layer with weights
tied to the input embeddings).
This model inherits from PreTrainedModel. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the
input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and
refer to the PyTorch documentation for all matter related to general usage and behavior.
Parameters
Although the recipe for forward pass needs to be defined within this function, one should
call the Module instance afterwards instead of this since the former takes care of running
the pre and post processing steps while the latter silently ignores them.
Example:
GPT2DoubleHeadsModel
( config )
Parameters
• config (GPT2Config) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the from_pretrained() method to load the model weights.
The GPT2 Model transformer with a language modeling and a multiple-choice classification
head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers. The language
modeling head has its weights tied to the input embeddings, the classification head takes as
input the input of a specified classification token index in the input sequence).
This model inherits from PreTrainedModel. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the
input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and
refer to the PyTorch documentation for all matter related to general usage and behavior.
Parameters
If past_key_values is used, only input_ids that do not have their past calculated should
be passed as input_ids .
Although the recipe for forward pass needs to be defined within this function, one should
call the Module instance afterwards instead of this since the former takes care of running
the pre and post processing steps while the latter silently ignores them.
Example:
>>> choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"]
>>> encoded_choices = [tokenizer.encode(s) for s in choices]
>>> cls_token_location = [tokens.index(tokenizer.cls_token_id) for tokens in
GPT2ForQuestionAnswering
( config )
Parameters
• config (GPT2Config) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the from_pretrained() method to load the model weights.
The GPT-2 Model transformer with a span classification head on top for extractive question-
answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span
start logits and span end logits ).
This model inherits from PreTrainedModel. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the
input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and
refer to the PyTorch documentation for all matter related to general usage and behavior.
Parameters
If past_key_values is used, only input_ids that do not have their past calculated should
be passed as input_ids .
Although the recipe for forward pass needs to be defined within this function, one should
call the Module instance afterwards instead of this since the former takes care of running
the pre and post processing steps while the latter silently ignores them.
This example uses a random model as the real ones are all very big. To get proper results,
you should use openai-community/gpt2 instead of openai-community/gpt2. If you get out-
of-memory when loading that checkpoint, you can try adding device_map="auto" in the
from_pretrained call.
Example:
>>> question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
GPT2ForSequenceClassification
class transformers.GPT2ForSequenceClassification < source >
( config )
Parameters
• config (GPT2Config) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the from_pretrained() method to load the model weights.
The GPT2 Model transformer with a sequence classification head on top (linear layer).
Since it does classification on the last token, it requires to know the position of the last token. If
a pad_token_id is defined in the configuration, it finds the last token that is not a padding
token in each row. If no pad_token_id is defined, it simply takes the last value in each row of
the batch. Since it cannot guess the padding tokens when inputs_embeds are passed instead of
input_ids , it does the same (take the last value in each row of the batch).
This model inherits from PreTrainedModel. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the
input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and
refer to the PyTorch documentation for all matter related to general usage and behavior.
If past_key_values is used, only input_ids that do not have their past calculated should
be passed as input_ids .
Although the recipe for forward pass needs to be defined within this function, one should
call the Module instance afterwards instead of this since the former takes care of running
the pre and post processing steps while the latter silently ignores them.
GPT2ForTokenClassification
( config )
Parameters
• config (GPT2Config) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the from_pretrained() method to load the model weights.
GPT2 Model with a token classification head on top (a linear layer on top of the hidden-states
output) e.g. for Named-Entity-Recognition (NER) tasks.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the
input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and
refer to the PyTorch documentation for all matter related to general usage and behavior.
Parameters
If past_key_values is used, only input_ids that do not have their past calculated should
be passed as input_ids .
Although the recipe for forward pass needs to be defined within this function, one should
call the Module instance afterwards instead of this since the former takes care of running
the pre and post processing steps while the latter silently ignores them.
Example:
>>> # Note that tokens are classified rather then input words which means th
>>> # there might be more predicted token classes than words.
>>> # Multiple token classes might account for the same word
>>> predicted_tokens_classes = [model.config.id2label[t.item()] for t in pre
>>> predicted_tokens_classes
['Lead', 'Lead', 'Lead', 'Position', 'Lead', 'Lead', 'Lead', 'Lead', 'Lead'
TFGPT2Model
class transformers.TFGPT2Model < source >
Parameters
• config (GPT2Config) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the from_pretrained() method to load the model weights.
The bare GPT2 Model transformer outputting raw hidden-states without any specific head on
top.
This model inherits from TFPreTrainedModel. Check the superclass documentation for the
generic methods the library implements for all its model (such as downloading or saving,
resizing the input embeddings, pruning heads etc.)
This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the
TF 2.0 documentation for all matter related to general usage and behavior.
•a dictionary with one or several input Tensors associated to the input names given in the
docstring: model({"input_ids": input_ids, "token_type_ids":
token_type_ids})
Note that when creating models and layers with subclassing then you don’t need to worry
about any of this, as you can just pass inputs like you would to any other Python function!
Parameters
If past_key_values is used, only input IDs that do not have their past calculated should be
passed as input_ids .
Although the recipe for forward pass needs to be defined within this function, one should
call the Module instance afterwards instead of this since the former takes care of running
the pre and post processing steps while the latter silently ignores them.
Example:
TFGPT2LMHeadModel
Parameters
• config (GPT2Config) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the from_pretrained() method to load the model weights.
The GPT2 Model transformer with a language modeling head on top (linear layer with weights
tied to the input embeddings).
This model inherits from TFPreTrainedModel. Check the superclass documentation for the
generic methods the library implements for all its model (such as downloading or saving,
resizing the input embeddings, pruning heads etc.)
This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the
TF 2.0 documentation for all matter related to general usage and behavior.
•a dictionary with one or several input Tensors associated to the input names given in the
docstring: model({"input_ids": input_ids, "token_type_ids":
token_type_ids})
Note that when creating models and layers with subclassing then you don’t need to worry
about any of this, as you can just pass inputs like you would to any other Python function!
Parameters
If past_key_values is used, only input IDs that do not have their past calculated should be
passed as input_ids .
Although the recipe for forward pass needs to be defined within this function, one should
call the Module instance afterwards instead of this since the former takes care of running
the pre and post processing steps while the latter silently ignores them.
Example:
TFGPT2DoubleHeadsModel
Parameters
• config (GPT2Config) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the from_pretrained() method to load the model weights.
The GPT2 Model transformer with a language modeling and a multiple-choice classification
head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers. The language
modeling head has its weights tied to the input embeddings, the classification head takes as
input the input of a specified classification token index in the input sequence).
This model inherits from TFPreTrainedModel. Check the superclass documentation for the
generic methods the library implements for all its model (such as downloading or saving,
resizing the input embeddings, pruning heads etc.)
This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the
TF 2.0 documentation for all matter related to general usage and behavior.
•a dictionary with one or several input Tensors associated to the input names given in the
docstring: model({"input_ids": input_ids, "token_type_ids":
token_type_ids})
Note that when creating models and layers with subclassing then you don’t need to worry
about any of this, as you can just pass inputs like you would to any other Python function!
Parameters
If past_key_values is used, only input IDs that do not have their past calculated should be
passed as input_ids .
Although the recipe for forward pass needs to be defined within this function, one should
call the Module instance afterwards instead of this since the former takes care of running
the pre and post processing steps while the latter silently ignores them.
Examples:
>>> choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"]
>>> encoded_choices = [tokenizer.encode(s) for s in choices]
>>> cls_token_location = [tokens.index(tokenizer.cls_token_id) for tokens in
TFGPT2ForSequenceClassification
Parameters
• config (GPT2Config) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the from_pretrained() method to load the model weights.
The GPT2 Model transformer with a sequence classification head on top (linear layer).
Since it does classification on the last token, it requires to know the position of the last token. If
a pad_token_id is defined in the configuration, it finds the last token that is not a padding
token in each row. If no pad_token_id is defined, it simply takes the last value in each row of
the batch. Since it cannot guess the padding tokens when inputs_embeds are passed instead of
input_ids , it does the same (take the last value in each row of the batch).
This model inherits from TFPreTrainedModel. Check the superclass documentation for the
generic methods the library implements for all its model (such as downloading or saving,
resizing the input embeddings, pruning heads etc.)
This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the
TF 2.0 documentation for all matter related to general usage and behavior.
•a dictionary with one or several input Tensors associated to the input names given in the
docstring: model({"input_ids": input_ids, "token_type_ids":
token_type_ids})
Note that when creating models and layers with subclassing then you don’t need to worry
about any of this, as you can just pass inputs like you would to any other Python function!
Parameters
If past_key_values is used, only input IDs that do not have their past calculated should be
passed as input_ids .
Although the recipe for forward pass needs to be defined within this function, one should
call the Module instance afterwards instead of this since the former takes care of running
the pre and post processing steps while the latter silently ignores them.
Example:
TFSequenceClassifierOutputWithPast
class transformers.modeling_tf_outputs.TFSequenceClassifierOutputWi
< source >
thPast
Parameters
Contains pre-computed hidden-states (key and values in the attention blocks) that can be used
Expand 5 parameters
(see past_key_values input) to speed up sequential decoding.
TFGPT2Tokenizer
( vocab: Dict, merges: List, max_length: int = None, pad_token_id: int = None )
Parameters
This is an in-graph tokenizer for GPT2. It should be initialized similarly to other tokenizers, using
the from_pretrained() method. It can also be initialized with the from_tokenizer()
method, which imports settings from an existing standard tokenizer object.
In-graph tokenizers, unlike other Hugging Face tokenizers, are actually Keras layers and are
designed to be run when the model is called, rather than during preprocessing. As a result, they
have somewhat more limited options than standard tokenizer classes. They are most useful
when you want to create an end-to-end model that goes straight from tf.string inputs to
outputs.
( config )
Parameters
Examples:
tf_tokenizer = TFGPT2Tokenizer.from_pretrained("openai-community/gpt2")
Parameters
• tokenizer (GPT2Tokenizer) —
Examples:
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
tf_tokenizer = TFGPT2Tokenizer.from_tokenizer(tokenizer)
FlaxGPT2Model
( config: GPT2Config, input_shape: Tuple = (1, 1), seed: int = 0, dtype: dtype
= <class 'jax.numpy.float32'>, _do_init: bool = True, **kwargs )
Parameters
• config (GPT2Config) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the from_pretrained() method to load the model weights.
Note that this only specifies the dtype of the computation and does not influence the dtype
of model parameters.
If you wish to change the dtype of the model parameters, see to_fp16() and to_bf16().
The bare GPT2 Model transformer outputting raw hidden-states without any specific head on
top.
This model inherits from FlaxPreTrainedModel. Check the superclass documentation for the
generic methods the library implements for all its model (such as downloading or saving,
resizing the input embeddings, pruning heads etc.)
This model is also a Flax Linen flax.nn.Module subclass. Use it as a regular Flax Module and refer
to the Flax documentation for all matter related to general usage and behavior.
Parameters
Although the recipe for forward pass needs to be defined within this function, one should
call the Module instance afterwards instead of this since the former takes care of running
the pre and post processing steps while the latter silently ignores them.
Example:
FlaxGPT2LMHeadModel
class transformers.FlaxGPT2LMHeadModel < source >
( config: GPT2Config, input_shape: Tuple = (1, 1), seed: int = 0, dtype: dtype
= <class 'jax.numpy.float32'>, _do_init: bool = True, **kwargs )
Parameters
• config (GPT2Config) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the from_pretrained() method to load the model weights.
Note that this only specifies the dtype of the computation and does not influence the dtype
of model parameters.
If you wish to change the dtype of the model parameters, see to_fp16() and to_bf16().
The GPT2 Model transformer with a language modeling head on top (linear layer with weights
tied to the input embeddings).
This model inherits from FlaxPreTrainedModel. Check the superclass documentation for the
generic methods the library implements for all its model (such as downloading or saving,
resizing the input embeddings, pruning heads etc.)
This model is also a Flax Linen flax.nn.Module subclass. Use it as a regular Flax Module and refer
to the Flax documentation for all matter related to general usage and behavior.
Parameters
Although the recipe for forward pass needs to be defined within this function, one should
call the Module instance afterwards instead of this since the former takes care of running
the pre and post processing steps while the latter silently ignores them.
Example:
← GPT-J GPTBigCode →