HuggingFace GPT2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

Hugging Face Search models, datasets, users...

Transformers documentation
OpenAI GPT2

Join the Hugging Face community


and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

Sign Up to get started

OpenAI GPT2

All model pages gpt2 🤗 Hugging Face Spaces

Overview

OpenAI GPT-2 model was proposed in Language Models are Unsupervised Multitask Learners by Alec
Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever from OpenAI. It’s a
causal (unidirectional) transformer pretrained using language modeling on a very large corpus of ~40
GB of text data.

The abstract from the paper is the following:

GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset[1] of
8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the
previous words within some text. The diversity of the dataset causes this simple goal to contain naturally
occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with
more than 10X the parameters and trained on more than 10X the amount of data.

Write With Transformer is a webapp created and hosted by Hugging Face showcasing the generative
capabilities of several models. GPT-2 is one of them and is available in five different sizes: small,
medium, large, xl and a distilled version of the small checkpoint: distilgpt-2.

This model was contributed by thomwolf. The original code can be found here.

Usage tips

GPT-2 is a model with absolute position embeddings so it’s usually advised to pad the inputs on
the right rather than the left.

GPT-2 was trained with a causal language modeling (CLM) objective and is therefore powerful at
predicting the next token in a sequence. Leveraging this feature allows GPT-2 to generate
syntactically coherent text as it can be observed in the run_generation.py example script.

The model can take the past_key_values (for PyTorch) or past (for TF) as input, which is the
previously computed key/value attention pairs. Using this (past_key_values or past) value prevents
the model from re-computing pre-computed values in the context of text generation. For PyTorch,
see past_key_values argument of the GPT2Model.forward() method, or for TF the past argument of
the TFGPT2Model.call() method for more information on its usage.

Enabling the scale_attn_by_inverse_layer_idx and reorder_and_upcast_attn flags will apply the


training stability improvements from Mistral (for PyTorch only).

Usage example

The generate() method can be used to generate text using GPT2 model.

>>> from transformers import AutoModelForCausalLM, AutoTokenizer

>>> model = AutoModelForCausalLM.from_pretrained("gpt2")


>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")

>>> prompt = "GPT2 is a model developed by OpenAI."

>>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids


>>> gen_tokens = model.generate(
... input_ids,
... do_sample=True,
... temperature=0.9,
... max_length=100,
... )
>>> gen_text = tokenizer.batch_decode(gen_tokens)[0]

Using Flash Attention 2

Flash Attention 2 is a faster, optimized version of the attention scores computation which relies on
cuda kernels.

Installation

First, check whether your hardware is compatible with Flash Attention 2. The latest list of compatible
hardware can be found in the official documentation. If your hardware is not compatible with Flash
Attention 2, you can still benefit from attention kernel optimisations through Better Transformer
support covered above.

Next, install the latest version of Flash Attention 2:

pip install -U flash-attn --no-build-isolation

Usage

To load a model using Flash Attention 2, we can pass the argument


attn_implementation="flash_attention_2" to .from_pretrained. We’ll also load the model in

half-precision (e.g. torch.float16 ), since it results in almost no degradation to audio quality but
significantly lower memory usage and faster inference:

>>> import torch


>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> device = "cuda" # the device to load the model onto

>>> model = AutoModelForCausalLM.from_pretrained("gpt2", torch_dtype=torch.float16,


>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
>>> prompt = "def hello_world():"

>>> model_inputs = tokenizer([prompt], return_tensors="pt").to(device)


>>> model.to(device)

>>> generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=Tru


>>> tokenizer.batch_decode(generated_ids)[0]

Expected speedups

Below is an expected speedup diagram that compares pure inference time between the native
implementation in transformers using gpt2 checkpoint and the Flash Attention 2 version of the model
using a sequence length of 512.

Using Scaled Dot Product Attention (SDPA)


PyTorch includes a native scaled dot-product attention (SDPA) operator as part of
torch.nn.functional . This function encompasses several implementations that can be applied

depending on the inputs and the hardware in use. See the official documentation or the GPU Inference
page for more information.

SDPA is used by default for torch>=2.1.1 when an implementation is available, but you may also set
attn_implementation="sdpa" in from_pretrained() to explicitly request SDPA to be used.

from transformers import AutoModelForCausalLM


model = AutoModelForCausalLM.from_pretrained("gpt2", torch_dtype=torch.float16, attn
...

For the best speedups, we recommend loading the model in half-precision (e.g. torch.float16 or
torch.bfloat16 ).

On a local benchmark (rtx3080ti-16GB, PyTorch 2.2.1, OS Ubuntu 22.04) using float16 with gpt2-large,
we saw the following speedups during training and inference.

Training

Time per Time per


Batch Seq Speedup Eager peak SDPA peak Mem saving
batch (Eager - batch (SDPA -
size len (%) mem (MB) mem (MB) (%)
s) s)

1 128 0.039 0.032 23.042 3482.32 3494.62 -0.352

1 256 0.073 0.059 25.15 3546.66 3552.6 -0.167

1 512 0.155 0.118 30.96 4230.1 3665.59 15.4

1 1024 0.316 0.209 50.839 8682.26 4881.09 77.875

2 128 0.07 0.06 15.324 3557.8 3545.91 0.335

2 256 0.143 0.122 16.53 3901.5 3657.68 6.666

2 512 0.267 0.213 25.626 7062.21 4876.47 44.822

2 1024 OOM 0.404 / OOM 8096.35 SDPA does


not OOM
Time per Time per
Batch Seq Speedup Eager peak SDPA peak Mem saving
batch (Eager - batch (SDPA -
size len (%) mem (MB) mem (MB) (%)
s) s)

4 128 0.134 0.128 4.412 3675.79 3648.72 0.742

4 256 0.243 0.217 12.292 6129.76 4871.12 25.839

4 512 0.494 0.406 21.687 12466.6 8102.64 53.858

4 1024 OOM 0.795 / OOM 14568.2 SDPA does


not OOM

Inference

Per token
Batch Seq Per token latency Speedup Mem Eager Mem SDPA Mem
latency SDPA
size len Eager (ms) (%) (MB) (MB) saved (%)
(ms)

1 128 7.991 6.968 14.681 1685.2 1701.32 -0.947

1 256 8.462 7.199 17.536 1745.49 1770.78 -1.428

1 512 8.68 7.853 10.529 1907.69 1921.29 -0.708

1 768 9.101 8.365 8.791 2032.93 2068.12 -1.701

2 128 9.169 9.001 1.861 1803.84 1811.4 -0.418

2 256 9.907 9.78 1.294 1907.72 1921.44 -0.714

2 512 11.519 11.644 -1.071 2176.86 2197.75 -0.951

2 768 13.022 13.407 -2.873 2464.3 2491.06 -1.074

4 128 10.097 9.831 2.709 1942.25 1985.13 -2.16

4 256 11.599 11.398 1.764 2177.28 2197.86 -0.937

4 512 14.653 14.45 1.411 2753.16 2772.57 -0.7

4 768 17.846 17.617 1.299 3327.04 3343.97 -0.506

Resources
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with
GPT2. If you’re interested in submitting a resource to be included here, please feel free to open a Pull
Request and we’ll review it! The resource should ideally demonstrate something new instead of
duplicating an existing resource.

Text Generation

A blog on how to Finetune a non-English GPT-2 Model with Hugging Face.

A blog on How to generate text: using different decoding methods for language generation with
Transformers with GPT-2.

A blog on Training CodeParrot 🦜 from Scratch, a large GPT-2 model.

A blog on Faster Text Generation with TensorFlow and XLA with GPT-2.

A blog on How to train a Language Model with Megatron-LM with a GPT-2 model.

A notebook on how to finetune GPT2 to generate lyrics in the style of your favorite artist. 🌎

A notebook on how to finetune GPT2 to generate tweets in the style of your favorite Twitter user.
🌎
Causal language modeling chapter of the 🤗 Hugging Face Course.

GPT2LMHeadModel is supported by this causal language modeling example script, text generation
example script, and notebook.

TFGPT2LMHeadModel is supported by this causal language modeling example script and


notebook.

FlaxGPT2LMHeadModel is supported by this causal language modeling example script and


notebook.

Text classification task guide

Token classification task guide

Causal language modeling task guide

GPT2Config

class transformers.GPT2Config < source >


( vocab_size = 50257, n_positions = 1024, n_embd = 768, n_layer = 12, n_head = 12,
n_inner = None, activation_function = 'gelu_new', resid_pdrop = 0.1, embd_pdrop =
0.1, attn_pdrop = 0.1, layer_norm_epsilon = 1e-05, initializer_range = 0.02,
summary_type = 'cls_index', summary_use_proj = True, summary_activation = None,
summary_proj_to_labels = True, summary_first_dropout = 0.1, scale_attn_weights =
True, use_cache = True, bos_token_id = 50256, eos_token_id = 50256,
scale_attn_by_inverse_layer_idx = False, reorder_and_upcast_attn = False, **kwargs
)

Parameters

• vocab_size ( int , optional, defaults to 50257) — Vocabulary size of the GPT-2 model. Defines the
number of different tokens that can be represented by the inputs_ids passed when calling
GPT2Model or TFGPT2Model.

• n_positions ( int , optional, defaults to 1024) — The maximum sequence length that this model
might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

• n_embd ( int , optional, defaults to 768) — Dimensionality of the embeddings and hidden states.

• n_layer ( int , optional, defaults to 12) — Number of hidden layers in the Transformer encoder.

• n_head ( int , optional, defaults to 12) — Number of attention heads for each attention layer in the
Transformer encoder.
Expand 23 parameters
• n_inner ( int , optional) — Dimensionality of the inner feed-forward layers. None will set it to 4
times n_embd

This is the configuration class to store the configuration of a GPT2Model or a TFGPT2Model. It is used
to instantiate a GPT-2 model according to the specified arguments, defining the model architecture.
Instantiating a configuration with the defaults will yield a similar configuration to that of the GPT-2
openai-community/gpt2 architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs.
Read the documentation from PretrainedConfig for more information.

Example:

>>> from transformers import GPT2Config, GPT2Model

>>> # Initializing a GPT2 configuration


>>> configuration = GPT2Config()

>>> # Initializing a model (with random weights) from the configuration


>>> model = GPT2Model(configuration)

>>> # Accessing the model configuration


>>> configuration = model.config

GPT2Tokenizer

class transformers.GPT2Tokenizer < source >

( vocab_file, merges_file, errors = 'replace', unk_token = '<|endoftext|>',


bos_token = '<|endoftext|>', eos_token = '<|endoftext|>', pad_token = None,
add_prefix_space = False, add_bos_token = False, **kwargs )

Parameters

• vocab_file ( str ) — Path to the vocabulary file.

• merges_file ( str ) — Path to the merges file.

• errors ( str , optional, defaults to "replace" ) — Paradigm to follow when decoding bytes to UTF-8.
See bytes.decode for more information.

• unk_token ( str , optional, defaults to "<|endoftext|>" ) — The unknown token. A token that is
not in the vocabulary cannot be converted to an ID and is set to be this token instead.

• bos_token ( str , optional, defaults to "<|endoftext|>" ) — The beginning of sequence token.

• eos_token ( str , optional, defaults to "<|endoftext|>" ) — The end of sequence token.


Expand 9 parameters
• pad_token ( str , optional) — The token used for padding, for example when batching sequences of
different lengths.

Construct a GPT-2 tokenizer. Based on byte-level Byte-Pair-Encoding.

This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a
word will

be encoded differently whether it is at the beginning of the sentence (without space) or not:

>>> from transformers import GPT2Tokenizer


>>> tokenizer = GPT2Tokenizer.from_pretrained("openai-community/gpt2")
>>> tokenizer("Hello world")["input_ids"]
[15496, 995]

>>> tokenizer(" Hello world")["input_ids"]


[18435, 995]

You can get around that behavior by passing add_prefix_space=True when instantiating this
tokenizer or when you call it on some text, but since the model was not pretrained this way, it might
yield a decrease in performance.

When used with is_split_into_words=True , this tokenizer will add a space before each word
(even the first one).

This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users
should refer to this superclass for more information regarding those methods.

save_vocabulary < source >

( save_directory: str, filename_prefix: Optional = None )

GPT2TokenizerFast

class transformers.GPT2TokenizerFast < source >

( vocab_file = None, merges_file = None, tokenizer_file = None, unk_token =


'<|endoftext|>', bos_token = '<|endoftext|>', eos_token = '<|endoftext|>',
add_prefix_space = False, **kwargs )

Parameters

• vocab_file ( str , optional) — Path to the vocabulary file.

• merges_file ( str , optional) — Path to the merges file.

• tokenizer_file ( str , optional) — Path to tokenizers file (generally has a .json extension) that
contains everything needed to load the tokenizer.

• unk_token ( str , optional, defaults to "<|endoftext|>" ) — The unknown token. A token that is
not in the vocabulary cannot be converted to an ID and is set to be this token instead.
• bos_token ( str , optional, defaults to "<|endoftext|>" ) — The beginning of sequence token.

• eos_token ( str , optional, defaults to "<|endoftext|>" ) — The end of sequence token.

• add_prefix_space ( bool , optional, defaults to False ) — Whether or not to add an initial space to
the input. This allows to treat the leading word just as any other word. (GPT2 tokenizer detect
beginning of words by the preceding space).

Construct a “fast” GPT-2 tokenizer (backed by HuggingFace’s tokenizers library). Based on byte-level
Byte-Pair-Encoding.

This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a
word will

be encoded differently whether it is at the beginning of the sentence (without space) or not:

>>> from transformers import GPT2TokenizerFast

>>> tokenizer = GPT2TokenizerFast.from_pretrained("openai-community/gpt2")


>>> tokenizer("Hello world")["input_ids"]
[15496, 995]

>>> tokenizer(" Hello world")["input_ids"]


[18435, 995]

You can get around that behavior by passing add_prefix_space=True when instantiating this
tokenizer, but since the model was not pretrained this way, it might yield a decrease in performance.

When used with is_split_into_words=True , this tokenizer needs to be instantiated with


add_prefix_space=True .

This tokenizer inherits from PreTrainedTokenizerFast which contains most of the main methods.
Users should refer to this superclass for more information regarding those methods.

GPT2 specific outputs

class transformers.models.gpt2.modeling_gpt2.GPT2DoubleHeadsModelOut
< source >
put
( loss: Optional = None, mc_loss: Optional = None, logits: FloatTensor = None,
mc_logits: FloatTensor = None, past_key_values: Optional = None, hidden_states:
Optional = None, attentions: Optional = None )

Parameters

• loss ( torch.FloatTensor of shape (1,) , optional, returned when labels is provided) —


Language modeling loss.

• mc_loss ( torch.FloatTensor of shape (1,) , optional, returned when mc_labels is provided) —


Multiple choice classification loss.

• logits ( torch.FloatTensor of shape (batch_size, num_choices, sequence_length,


config.vocab_size) ) — Prediction scores of the language modeling head (scores for each
vocabulary token before SoftMax).

• mc_logits ( torch.FloatTensor of shape (batch_size, num_choices) ) — Prediction scores of


the multiple choice classification head (scores for each choice before SoftMax).
Expand 7 parameters
• past_key_values ( Tuple[Tuple[torch.Tensor]] , optional, returned when use_cache=True is
passed or when config.use_cache=True ) — Tuple of length config.n_layers , containing
tuples of tensors of shape (batch size num heads sequence length

Base class for outputs of models predicting if two sentences are consecutive or not.

class transformers.models.gpt2.modeling_tf_gpt2.TFGPT2DoubleHeadsMod
< source >
elOutput

( logits: tf.Tensor = None, mc_logits: tf.Tensor = None, past_key_values:


List[tf.Tensor] | None = None, hidden_states: Tuple[tf.Tensor] | None = None,
attentions: Tuple[tf.Tensor] | None = None )

Parameters

• logits ( tf.Tensor of shape (batch_size, num_choices, sequence_length,


config.vocab_size) ) — Prediction scores of the language modeling head (scores for each
vocabulary token before SoftMax).

• mc_logits ( tf.Tensor of shape (batch_size, num_choices) ) — Prediction scores of the


multiple choice classification head (scores for each choice before SoftMax).

• past_key_values ( List[tf.Tensor] , optional, returned when use_cache=True is passed or


when config.use_cache=True ) — List of tf.Tensor of length config.n_layers , with each
tensor of shape (2, batch_size, num_heads, sequence_length, embed_size_per_head) ).
Contains pre-computed hidden-statesExpand 5 parameters
(key and values in the attention blocks) that can be used (see
past_key_values input) to speed up sequential decoding.

• hidden states ( tuple(tf Tensor) optional returned when output hidden states True is

Base class for outputs of models predicting if two sentences are consecutive or not.

Pytorch Hide Pytorch content

GPT2Model

class transformers.GPT2Model < source >

( config )

Parameters

• config (GPT2Config) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the from_pretrained() method to load the model weights.

The bare GPT2 Model transformer outputting raw hidden-states without any specific head on
top.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the
input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and
refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < source >

( input_ids: Optional = None, past_key_values: Optional = None,


attention_mask: Optional = None, token_type_ids: Optional = None,
position_ids: Optional = None, head_mask: Optional = None, inputs_embeds:
Optional = None, encoder_hidden_states: Optional = None,
encoder_attention_mask: Optional = None, use_cache: Optional = None,
output_attentions: Optional = None, output_hidden_states: Optional = None,
return_dict: Optional = None ) →
transformers.modeling_outputs.BaseModelOutputWithPastAndCrossAttentions or
tuple(torch.FloatTensor)

Parameters

• input_ids ( torch.LongTensor of shape (batch_size, input_ids_length) ) —


input_ids_length = sequence_length if past_key_values is None else
past_key_values[0][0].shape[-2] ( sequence_length of input past key value states).
Indices of input sequence tokens in the vocabulary.

If past_key_values is used, only input_ids that do not have their past calculated should
be passed as input_ids .

Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and


PreTrainedTokenizer.call() for details.

What are input IDs? Expand 11 parameters

• past_key_values ( Tuple[Tuple[torch.Tensor]] of length config.n_layers ) —


Contains precomputed hidden-states (key and values in the attention blocks) as computed

The GPT2Model forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should
call the Module instance afterwards instead of this since the former takes care of running
the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import AutoTokenizer, GPT2Model


>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")


>>> model = GPT2Model.from_pretrained("openai-community/gpt2")

>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")


>>> outputs = model(**inputs)

>>> last_hidden_states = outputs.last_hidden_state

GPT2LMHeadModel
class transformers.GPT2LMHeadModel < source >

( config )

Parameters

• config (GPT2Config) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the from_pretrained() method to load the model weights.

The GPT2 Model transformer with a language modeling head on top (linear layer with weights
tied to the input embeddings).

This model inherits from PreTrainedModel. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the
input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and
refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < source >

( input_ids: Optional = None, past_key_values: Optional = None,


attention_mask: Optional = None, token_type_ids: Optional = None,
position_ids: Optional = None, head_mask: Optional = None, inputs_embeds:
Optional = None, encoder_hidden_states: Optional = None,
encoder_attention_mask: Optional = None, labels: Optional = None, use_cache:
Optional = None, output_attentions: Optional = None, output_hidden_states:
Optional = None, return_dict: Optional = None ) →
transformers.modeling_outputs.CausalLMOutputWithCrossAttentions or
tuple(torch.FloatTensor)

Parameters

• input_ids ( torch.LongTensor of shape (batch_size, input_ids_length) ) —


input_ids_length = sequence_length if past_key_values is None else
past_key_values[0][0].shape[-2] ( sequence_length of input past key value states).
Indices of input sequence tokens in the vocabulary.
If past_key_values is used, only input_ids that do not have their past calculated should
be passed as input_ids .

Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and


PreTrainedTokenizer.call() for details.
Expand 12 parameters

What are input IDs?

The GPT2LMHeadModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should
call the Module instance afterwards instead of this since the former takes care of running
the pre and post processing steps while the latter silently ignores them.

Example:

>>> import torch


>>> from transformers import AutoTokenizer, GPT2LMHeadModel

>>> tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")


>>> model = GPT2LMHeadModel.from_pretrained("openai-community/gpt2")

>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")


>>> outputs = model(**inputs, labels=inputs["input_ids"])
>>> loss = outputs.loss
>>> logits = outputs.logits

GPT2DoubleHeadsModel

class transformers.GPT2DoubleHeadsModel < source >

( config )

Parameters

• config (GPT2Config) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the from_pretrained() method to load the model weights.
The GPT2 Model transformer with a language modeling and a multiple-choice classification
head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers. The language
modeling head has its weights tied to the input embeddings, the classification head takes as
input the input of a specified classification token index in the input sequence).

This model inherits from PreTrainedModel. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the
input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and
refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < source >

( input_ids: Optional = None, past_key_values: Optional = None,


attention_mask: Optional = None, token_type_ids: Optional = None,
position_ids: Optional = None, head_mask: Optional = None, inputs_embeds:
Optional = None, mc_token_ids: Optional = None, labels: Optional = None,
mc_labels: Optional = None, use_cache: Optional = None, output_attentions:
Optional = None, output_hidden_states: Optional = None, return_dict: Optional
= None, **kwargs ) →
transformers.models.gpt2.modeling_gpt2.GPT2DoubleHeadsModelOutput or
tuple(torch.FloatTensor)

Parameters

• input_ids ( torch.LongTensor of shape (batch_size, input_ids_length) ) —


input_ids_length = sequence_length if past_key_values is None else
past_key_values[0][0].shape[-2] ( sequence_length of input past key value states).
Indices of input sequence tokens in the vocabulary.

If past_key_values is used, only input_ids that do not have their past calculated should
be passed as input_ids .

Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and


PreTrainedTokenizer.call() for details.

What are input IDs? Expand 14 parameters

• past_key_values ( Tuple[Tuple[torch.Tensor]] of length config.n_layers ) —


Contains precomputed hidden states (key and values in the attention blocks) as computed
The GPT2DoubleHeadsModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should
call the Module instance afterwards instead of this since the former takes care of running
the pre and post processing steps while the latter silently ignores them.

Example:

>>> import torch


>>> from transformers import AutoTokenizer, GPT2DoubleHeadsModel

>>> tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")


>>> model = GPT2DoubleHeadsModel.from_pretrained("openai-community/gpt2")

>>> # Add a [CLS] to the vocabulary (we should train it also!)


>>> num_added_tokens = tokenizer.add_special_tokens({"cls_token": "[CLS]"})
>>> # Update the model embeddings with the new vocabulary size
>>> embedding_layer = model.resize_token_embeddings(len(tokenizer))

>>> choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"]
>>> encoded_choices = [tokenizer.encode(s) for s in choices]
>>> cls_token_location = [tokens.index(tokenizer.cls_token_id) for tokens in

>>> input_ids = torch.tensor(encoded_choices).unsqueeze(0) # Batch size: 1


>>> mc_token_ids = torch.tensor([cls_token_location]) # Batch size: 1

>>> outputs = model(input_ids, mc_token_ids=mc_token_ids)


>>> lm_logits = outputs.logits
>>> mc_logits = outputs.mc_logits

GPT2ForQuestionAnswering

class transformers.GPT2ForQuestionAnswering < source >

( config )

Parameters
• config (GPT2Config) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the from_pretrained() method to load the model weights.

The GPT-2 Model transformer with a span classification head on top for extractive question-
answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span
start logits and span end logits ).

This model inherits from PreTrainedModel. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the
input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and
refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < source >

( input_ids: Optional = None, attention_mask: Optional = None,


token_type_ids: Optional = None, position_ids: Optional = None, head_mask:
Optional = None, inputs_embeds: Optional = None, start_positions: Optional =
None, end_positions: Optional = None, output_attentions: Optional = None,
output_hidden_states: Optional = None, return_dict: Optional = None ) →
transformers.modeling_outputs.QuestionAnsweringModelOutput or
tuple(torch.FloatTensor)

Parameters

• input_ids ( torch.LongTensor of shape (batch_size, input_ids_length) ) —


input_ids_length = sequence_length if past_key_values is None else
past_key_values[0][0].shape[-2] ( sequence_length of input past key value states).
Indices of input sequence tokens in the vocabulary.

If past_key_values is used, only input_ids that do not have their past calculated should
be passed as input_ids .

Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and


PreTrainedTokenizer.call() for details.

What are input IDs? Expand 13 parameters

• past_key_values ( Tuple[Tuple[torch.Tensor]] of length config.n_layers ) —


Contains precomputed hidden-states (key and values in the attention blocks) as computed
The GPT2ForQuestionAnswering forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should
call the Module instance afterwards instead of this since the former takes care of running
the pre and post processing steps while the latter silently ignores them.

This example uses a random model as the real ones are all very big. To get proper results,
you should use openai-community/gpt2 instead of openai-community/gpt2. If you get out-
of-memory when loading that checkpoint, you can try adding device_map="auto" in the
from_pretrained call.

Example:

>>> from transformers import AutoTokenizer, GPT2ForQuestionAnswering


>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")


>>> model = GPT2ForQuestionAnswering.from_pretrained("openai-community/gpt2

>>> question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"

>>> inputs = tokenizer(question, text, return_tensors="pt")


>>> with torch.no_grad():
... outputs = model(**inputs)

>>> answer_start_index = outputs.start_logits.argmax()


>>> answer_end_index = outputs.end_logits.argmax()

>>> predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_

>>> # target is "nice puppet"


>>> target_start_index = torch.tensor([14])
>>> target_end_index = torch.tensor([15])

>>> outputs = model(**inputs, start_positions=target_start_index, end_positi


>>> loss = outputs.loss

GPT2ForSequenceClassification
class transformers.GPT2ForSequenceClassification < source >

( config )

Parameters

• config (GPT2Config) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the from_pretrained() method to load the model weights.

The GPT2 Model transformer with a sequence classification head on top (linear layer).

GPT2ForSequenceClassification uses the last token in order to do the classification, as other


causal models (e.g. GPT-1) do.

Since it does classification on the last token, it requires to know the position of the last token. If
a pad_token_id is defined in the configuration, it finds the last token that is not a padding
token in each row. If no pad_token_id is defined, it simply takes the last value in each row of
the batch. Since it cannot guess the padding tokens when inputs_embeds are passed instead of
input_ids , it does the same (take the last value in each row of the batch).

This model inherits from PreTrainedModel. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the
input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and
refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < source >

( input_ids: Optional = None, past_key_values: Optional = None,


attention_mask: Optional = None, token_type_ids: Optional = None,
position_ids: Optional = None, head_mask: Optional = None, inputs_embeds:
Optional = None, labels: Optional = None, use_cache: Optional = None,
output_attentions: Optional = None, output_hidden_states: Optional = None,
return_dict: Optional = None ) →
transformers.modeling_outputs.SequenceClassifierOutputWithPast or
tuple(torch.FloatTensor)
Parameters

• input_ids ( torch.LongTensor of shape (batch_size, input_ids_length) ) —


input_ids_length = sequence_length if past_key_values is None else
past_key_values[0][0].shape[-2] ( sequence_length of input past key value states).
Indices of input sequence tokens in the vocabulary.

If past_key_values is used, only input_ids that do not have their past calculated should
be passed as input_ids .

Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and


PreTrainedTokenizer.call() for details.

What are input IDs? Expand 12 parameters

• past_key_values ( Tuple[Tuple[torch.Tensor]] of length config.n_layers ) —


Contains precomputed hidden-states (key and values in the attention blocks) as computed

The GPT2ForSequenceClassification forward method, overrides the __call__ special


method.

Although the recipe for forward pass needs to be defined within this function, one should
call the Module instance afterwards instead of this since the former takes care of running
the pre and post processing steps while the latter silently ignores them.

Example of single-label classification:

>>> import torch


>>> from transformers import AutoTokenizer, GPT2ForSequenceClassification

>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/DialogRPT-updown")


>>> model = GPT2ForSequenceClassification.from_pretrained("microsoft/DialogR

>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")

>>> with torch.no_grad():


... logits = model(**inputs).logits

>>> predicted_class_id = logits.argmax().item()

>>> # To train a model on `num_labels` classes, you can pass `num_labels=num


>>> num_labels = len(model.config.id2label)
>>> model = GPT2ForSequenceClassification.from_pretrained("microsoft/DialogR

>>> labels = torch.tensor([1])


>>> loss = model(**inputs, labels=labels).loss

Example of multi-label classification:

>>> import torch


>>> from transformers import AutoTokenizer, GPT2ForSequenceClassification

>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/DialogRPT-updown")


>>> model = GPT2ForSequenceClassification.from_pretrained("microsoft/DialogR

>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")

>>> with torch.no_grad():


... logits = model(**inputs).logits

>>> predicted_class_ids = torch.arange(0, logits.shape[-1])[torch.sigmoid(lo

>>> # To train a model on `num_labels` classes, you can pass `num_labels=num


>>> num_labels = len(model.config.id2label)
>>> model = GPT2ForSequenceClassification.from_pretrained(
... "microsoft/DialogRPT-updown", num_labels=num_labels, problem_type="m
... )

>>> labels = torch.sum(


... torch.nn.functional.one_hot(predicted_class_ids[None, :].clone(), nu
... ).to(torch.float)
>>> loss = model(**inputs, labels=labels).loss

GPT2ForTokenClassification

class transformers.GPT2ForTokenClassification < source >

( config )

Parameters
• config (GPT2Config) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the from_pretrained() method to load the model weights.

GPT2 Model with a token classification head on top (a linear layer on top of the hidden-states
output) e.g. for Named-Entity-Recognition (NER) tasks.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the
input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and
refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < source >

( input_ids: Optional = None, past_key_values: Optional = None,


attention_mask: Optional = None, token_type_ids: Optional = None,
position_ids: Optional = None, head_mask: Optional = None, inputs_embeds:
Optional = None, labels: Optional = None, use_cache: Optional = None,
output_attentions: Optional = None, output_hidden_states: Optional = None,
return_dict: Optional = None ) →
transformers.modeling_outputs.TokenClassifierOutput or
tuple(torch.FloatTensor)

Parameters

• input_ids ( torch.LongTensor of shape (batch_size, input_ids_length) ) —


input_ids_length = sequence_length if past_key_values is None else
past_key_values[0][0].shape[-2] ( sequence_length of input past key value states).
Indices of input sequence tokens in the vocabulary.

If past_key_values is used, only input_ids that do not have their past calculated should
be passed as input_ids .

Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and


PreTrainedTokenizer.call() for details.

What are input IDs? Expand 12 parameters

• past_key_values ( Tuple[Tuple[torch.Tensor]] of length config.n_layers ) —


Contains precomputed hidden-states (key and values in the attention blocks) as computed
The GPT2ForTokenClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should
call the Module instance afterwards instead of this since the former takes care of running
the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import AutoTokenizer, GPT2ForTokenClassification


>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("brad1141/gpt2-finetuned-comp2


>>> model = GPT2ForTokenClassification.from_pretrained("brad1141/gpt2-finetu

>>> inputs = tokenizer(


... "HuggingFace is a company based in Paris and New York", add_special_
... )

>>> with torch.no_grad():


... logits = model(**inputs).logits

>>> predicted_token_class_ids = logits.argmax(-1)

>>> # Note that tokens are classified rather then input words which means th
>>> # there might be more predicted token classes than words.
>>> # Multiple token classes might account for the same word
>>> predicted_tokens_classes = [model.config.id2label[t.item()] for t in pre
>>> predicted_tokens_classes
['Lead', 'Lead', 'Lead', 'Position', 'Lead', 'Lead', 'Lead', 'Lead', 'Lead'

>>> labels = predicted_token_class_ids


>>> loss = model(**inputs, labels=labels).loss
>>> round(loss.item(), 2)
0.25

TensorFlow Hide TensorFlow content

TFGPT2Model
class transformers.TFGPT2Model < source >

( config, *inputs, **kwargs )

Parameters

• config (GPT2Config) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the from_pretrained() method to load the model weights.

The bare GPT2 Model transformer outputting raw hidden-states without any specific head on
top.

This model inherits from TFPreTrainedModel. Check the superclass documentation for the
generic methods the library implements for all its model (such as downloading or saving,
resizing the input embeddings, pruning heads etc.)

This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the
TF 2.0 documentation for all matter related to general usage and behavior.

TensorFlow models and layers in transformers accept two formats as input:

•having all inputs as keyword arguments (like PyTorch models), or


•having all inputs as a list, tuple or dict in the first positional argument.
The reason the second format is supported is that Keras methods prefer this format when
passing inputs to models and layers. Because of this support, when using methods like
model.fit() things should “just work” for you - just pass your inputs and labels in any
format that model.fit() supports! If, however, you want to use the second format outside of
Keras methods like fit() and predict() , such as when creating your own layers or models
with the Keras Functional API, there are three possibilities you can use to gather all the
input Tensors in the first positional argument:

•a single Tensor with input_ids only and nothing else: model(input_ids)


•a list of varying length with one or several input Tensors IN THE ORDER given in the
docstring: model([input_ids, attention_mask]) or model([input_ids,
attention_mask, token_type_ids])

•a dictionary with one or several input Tensors associated to the input names given in the
docstring: model({"input_ids": input_ids, "token_type_ids":
token_type_ids})
Note that when creating models and layers with subclassing then you don’t need to worry
about any of this, as you can just pass inputs like you would to any other Python function!

call < source >

( input_ids: TFModelInputType | None = None, past_key_values:


Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None, attention_mask:
np.ndarray | tf.Tensor | None = None, token_type_ids: np.ndarray | tf.Tensor
| None = None, position_ids: np.ndarray | tf.Tensor | None = None, head_mask:
np.ndarray | tf.Tensor | None = None, inputs_embeds: np.ndarray | tf.Tensor |
None = None, encoder_hidden_states: np.ndarray | tf.Tensor | None = None,
encoder_attention_mask: np.ndarray | tf.Tensor | None = None, use_cache:
Optional[bool] = None, output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] =
None, training: Optional[bool] = False ) →
transformers.modeling_tf_outputs.TFBaseModelOutputWithPastAndCrossAttentions
or tuple(tf.Tensor)

Parameters

• input_ids ( Numpy array or tf.Tensor of shape (batch_size, input_ids_length) ) —


input_ids_length = sequence_length if past_key_values is None else
past_key_values[0].shape[-2] ( sequence_length of input past key value states).
Indices of input sequence tokens in the vocabulary.

If past_key_values is used, only input IDs that do not have their past calculated should be
passed as input_ids .

Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.call() and


PreTrainedTokenizer.encode() for details.

What are input IDs? Expand 15 parameters

• past_key_values ( List[tf.Tensor] of length config.n_layers ) — Contains pre-


computed hidden states (key and values in the attention blocks) as computed by the model

The TFGPT2Model forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should
call the Module instance afterwards instead of this since the former takes care of running
the pre and post processing steps while the latter silently ignores them.
Example:

>>> from transformers import AutoTokenizer, TFGPT2Model


>>> import tensorflow as tf

>>> tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")


>>> model = TFGPT2Model.from_pretrained("openai-community/gpt2")

>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="tf")


>>> outputs = model(inputs)

>>> last_hidden_states = outputs.last_hidden_state

TFGPT2LMHeadModel

class transformers.TFGPT2LMHeadModel < source >

( config, *inputs, **kwargs )

Parameters

• config (GPT2Config) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the from_pretrained() method to load the model weights.

The GPT2 Model transformer with a language modeling head on top (linear layer with weights
tied to the input embeddings).

This model inherits from TFPreTrainedModel. Check the superclass documentation for the
generic methods the library implements for all its model (such as downloading or saving,
resizing the input embeddings, pruning heads etc.)

This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the
TF 2.0 documentation for all matter related to general usage and behavior.

TensorFlow models and layers in transformers accept two formats as input:

•having all inputs as keyword arguments (like PyTorch models), or


•having all inputs as a list, tuple or dict in the first positional argument.
The reason the second format is supported is that Keras methods prefer this format when
passing inputs to models and layers. Because of this support, when using methods like
model.fit() things should “just work” for you - just pass your inputs and labels in any
format that model.fit() supports! If, however, you want to use the second format outside of
Keras methods like fit() and predict() , such as when creating your own layers or models
with the Keras Functional API, there are three possibilities you can use to gather all the
input Tensors in the first positional argument:

•a single Tensor with input_ids only and nothing else: model(input_ids)


•a list of varying length with one or several input Tensors IN THE ORDER given in the
docstring: model([input_ids, attention_mask]) or model([input_ids,
attention_mask, token_type_ids])

•a dictionary with one or several input Tensors associated to the input names given in the
docstring: model({"input_ids": input_ids, "token_type_ids":
token_type_ids})

Note that when creating models and layers with subclassing then you don’t need to worry
about any of this, as you can just pass inputs like you would to any other Python function!

call < source >

( input_ids: TFModelInputType | None = None, past_key_values:


Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None, attention_mask:
np.ndarray | tf.Tensor | None = None, token_type_ids: np.ndarray | tf.Tensor
| None = None, position_ids: np.ndarray | tf.Tensor | None = None, head_mask:
np.ndarray | tf.Tensor | None = None, inputs_embeds: np.ndarray | tf.Tensor |
None = None, encoder_hidden_states: np.ndarray | tf.Tensor | None = None,
encoder_attention_mask: np.ndarray | tf.Tensor | None = None, use_cache:
Optional[bool] = None, output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] =
None, labels: np.ndarray | tf.Tensor | None = None, training: Optional[bool]
= False ) →
transformers.modeling_tf_outputs.TFCausalLMOutputWithCrossAttentions or
tuple(tf.Tensor)

Parameters

• input_ids ( Numpy array or tf.Tensor of shape (batch_size, input_ids_length) ) —


input_ids_length = sequence_length if past_key_values is None else
past_key_values[0].shape[-2] ( sequence_length of input past key value states).
Indices of input sequence tokens in the vocabulary.

If past_key_values is used, only input IDs that do not have their past calculated should be
passed as input_ids .

Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.call() and


PreTrainedTokenizer.encode() forExpand
details.16 parameters

What are input IDs?

The TFGPT2LMHeadModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should
call the Module instance afterwards instead of this since the former takes care of running
the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import AutoTokenizer, TFGPT2LMHeadModel


>>> import tensorflow as tf

>>> tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")


>>> model = TFGPT2LMHeadModel.from_pretrained("openai-community/gpt2")

>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="tf")


>>> outputs = model(inputs)
>>> logits = outputs.logits

TFGPT2DoubleHeadsModel

class transformers.TFGPT2DoubleHeadsModel < source >

( config, *inputs, **kwargs )

Parameters

• config (GPT2Config) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the from_pretrained() method to load the model weights.

The GPT2 Model transformer with a language modeling and a multiple-choice classification
head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers. The language
modeling head has its weights tied to the input embeddings, the classification head takes as
input the input of a specified classification token index in the input sequence).

This model inherits from TFPreTrainedModel. Check the superclass documentation for the
generic methods the library implements for all its model (such as downloading or saving,
resizing the input embeddings, pruning heads etc.)

This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the
TF 2.0 documentation for all matter related to general usage and behavior.

TensorFlow models and layers in transformers accept two formats as input:

•having all inputs as keyword arguments (like PyTorch models), or


•having all inputs as a list, tuple or dict in the first positional argument.
The reason the second format is supported is that Keras methods prefer this format when
passing inputs to models and layers. Because of this support, when using methods like
model.fit() things should “just work” for you - just pass your inputs and labels in any
format that model.fit() supports! If, however, you want to use the second format outside of
Keras methods like fit() and predict() , such as when creating your own layers or models
with the Keras Functional API, there are three possibilities you can use to gather all the
input Tensors in the first positional argument:

•a single Tensor with input_ids only and nothing else: model(input_ids)


•a list of varying length with one or several input Tensors IN THE ORDER given in the
docstring: model([input_ids, attention_mask]) or model([input_ids,
attention_mask, token_type_ids])

•a dictionary with one or several input Tensors associated to the input names given in the
docstring: model({"input_ids": input_ids, "token_type_ids":
token_type_ids})

Note that when creating models and layers with subclassing then you don’t need to worry
about any of this, as you can just pass inputs like you would to any other Python function!

call < source >


( input_ids: TFModelInputType | None = None, past_key_values:
Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None, attention_mask:
np.ndarray | tf.Tensor | None = None, token_type_ids: np.ndarray | tf.Tensor
| None = None, position_ids: np.ndarray | tf.Tensor | None = None, head_mask:
np.ndarray | tf.Tensor | None = None, inputs_embeds: np.ndarray | tf.Tensor |
None = None, mc_token_ids: np.ndarray | tf.Tensor | None = None, use_cache:
Optional[bool] = None, output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] =
None, training: Optional[bool] = False ) →
transformers.models.gpt2.modeling_tf_gpt2.TFGPT2DoubleHeadsModelOutput or
tuple(tf.Tensor)

Parameters

• input_ids ( Numpy array or tf.Tensor of shape (batch_size, input_ids_length) ) —


input_ids_length = sequence_length if past_key_values is None else
past_key_values[0].shape[-2] ( sequence_length of input past key value states).
Indices of input sequence tokens in the vocabulary.

If past_key_values is used, only input IDs that do not have their past calculated should be
passed as input_ids .

Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.call() and


PreTrainedTokenizer.encode() for details.

What are input IDs? Expand 12 parameters

• past_key_values ( List[tf.Tensor] of length config.n_layers ) — Contains pre-


computed hidden states (key and values in the attention blocks) as computed by the model

The TFGPT2DoubleHeadsModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should
call the Module instance afterwards instead of this since the former takes care of running
the pre and post processing steps while the latter silently ignores them.

Examples:

>>> import tensorflow as tf


>>> from transformers import AutoTokenizer, TFGPT2DoubleHeadsModel

>>> tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")


>>> model = TFGPT2DoubleHeadsModel.from_pretrained("openai-community/gpt2")
>>> # Add a [CLS] to the vocabulary (we should train it also!)
>>> num_added_tokens = tokenizer.add_special_tokens({"cls_token": "[CLS]"})

>>> embedding_layer = model.resize_token_embeddings(


... len(tokenizer)
... ) # Update the model embeddings with the new vocabulary size

>>> choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"]
>>> encoded_choices = [tokenizer.encode(s) for s in choices]
>>> cls_token_location = [tokens.index(tokenizer.cls_token_id) for tokens in

>>> input_ids = tf.constant(encoded_choices)[None, :] # Batch size: 1, numb


>>> mc_token_ids = tf.constant([cls_token_location]) # Batch size: 1

>>> outputs = model(input_ids, mc_token_ids=mc_token_ids)


>>> lm_prediction_scores, mc_prediction_scores = outputs[:2]

TFGPT2ForSequenceClassification

class transformers.TFGPT2ForSequenceClassification < source >

( config, *inputs, **kwargs )

Parameters

• config (GPT2Config) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the from_pretrained() method to load the model weights.

The GPT2 Model transformer with a sequence classification head on top (linear layer).

TFGPT2ForSequenceClassification uses the last token in order to do the classification, as other


causal models (e.g. GPT-1) do.

Since it does classification on the last token, it requires to know the position of the last token. If
a pad_token_id is defined in the configuration, it finds the last token that is not a padding
token in each row. If no pad_token_id is defined, it simply takes the last value in each row of
the batch. Since it cannot guess the padding tokens when inputs_embeds are passed instead of
input_ids , it does the same (take the last value in each row of the batch).

This model inherits from TFPreTrainedModel. Check the superclass documentation for the
generic methods the library implements for all its model (such as downloading or saving,
resizing the input embeddings, pruning heads etc.)

This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the
TF 2.0 documentation for all matter related to general usage and behavior.

TensorFlow models and layers in transformers accept two formats as input:

•having all inputs as keyword arguments (like PyTorch models), or


•having all inputs as a list, tuple or dict in the first positional argument.
The reason the second format is supported is that Keras methods prefer this format when
passing inputs to models and layers. Because of this support, when using methods like
model.fit() things should “just work” for you - just pass your inputs and labels in any
format that model.fit() supports! If, however, you want to use the second format outside of
Keras methods like fit() and predict() , such as when creating your own layers or models
with the Keras Functional API, there are three possibilities you can use to gather all the
input Tensors in the first positional argument:

•a single Tensor with input_ids only and nothing else: model(input_ids)


•a list of varying length with one or several input Tensors IN THE ORDER given in the
docstring: model([input_ids, attention_mask]) or model([input_ids,
attention_mask, token_type_ids])

•a dictionary with one or several input Tensors associated to the input names given in the
docstring: model({"input_ids": input_ids, "token_type_ids":
token_type_ids})

Note that when creating models and layers with subclassing then you don’t need to worry
about any of this, as you can just pass inputs like you would to any other Python function!

call < source >

( input_ids: TFModelInputType | None = None, past_key_values:


Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None, attention_mask:
np.ndarray | tf.Tensor | None = None, token_type_ids: np.ndarray | tf.Tensor
| None = None, position_ids: np.ndarray | tf.Tensor | None = None, head_mask:
np.ndarray | tf.Tensor | None = None, inputs_embeds: np.ndarray | tf.Tensor |
None = None, use_cache: Optional[bool] = None, output_attentions:
Optional[bool] = None, output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None, labels: np.ndarray | tf.Tensor | None =
None, training: Optional[bool] = False ) →
transformers.modeling_tf_outputs.TFSequenceClassifierOutputWithPast or
tuple(tf.Tensor)

Parameters

• input_ids ( Numpy array or tf.Tensor of shape (batch_size, input_ids_length) ) —


input_ids_length = sequence_length if past_key_values is None else
past_key_values[0].shape[-2] ( sequence_length of input past key value states).
Indices of input sequence tokens in the vocabulary.

If past_key_values is used, only input IDs that do not have their past calculated should be
passed as input_ids .

Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.call() and


PreTrainedTokenizer.encode() for details.

What are input IDs? Expand 12 parameters

• past_key_values ( List[tf.Tensor] of length config.n_layers ) — Contains pre-


computed hidden states (key and values in the attention blocks) as computed by the model

The TFGPT2ForSequenceClassification forward method, overrides the __call__ special


method.

Although the recipe for forward pass needs to be defined within this function, one should
call the Module instance afterwards instead of this since the former takes care of running
the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import AutoTokenizer, TFGPT2ForSequenceClassification


>>> import tensorflow as tf

>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/DialogRPT-updown")


>>> model = TFGPT2ForSequenceClassification.from_pretrained("microsoft/Dialo

>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="tf")


>>> logits = model(**inputs).logits

>>> predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])

>>> # To train a model on `num_labels` classes, you can pass `num_labels=num


>>> num_labels = len(model.config.id2label)
>>> model = TFGPT2ForSequenceClassification.from_pretrained("microsoft/Dialo

>>> labels = tf.constant(1)


>>> loss = model(**inputs, labels=labels).loss

TFSequenceClassifierOutputWithPast

class transformers.modeling_tf_outputs.TFSequenceClassifierOutputWi
< source >
thPast

( loss: tf.Tensor | None = None, logits: tf.Tensor = None, past_key_values:


List[tf.Tensor] | None = None, hidden_states: Tuple[tf.Tensor] | None = None,
attentions: Tuple[tf.Tensor] | None = None )

Parameters

• loss ( tf.Tensor of shape (batch_size, ) , optional, returned when labels is provided) —


Classification (or regression if config.num_labels==1) loss.

• logits ( tf.Tensor of shape (batch_size, config.num_labels) ) — Classification (or


regression if config.num_labels==1) scores (before SoftMax).

• past_key_values ( List[tf.Tensor] , optional, returned when use_cache=True is passed or


when config.use_cache=True ) — List of tf.Tensor of length config.n_layers , with
each tensor of shape (2, batch_size, num_heads, sequence_length,
embed_size_per_head) ).

Contains pre-computed hidden-states (key and values in the attention blocks) that can be used
Expand 5 parameters
(see past_key_values input) to speed up sequential decoding.

• hidden_states ( tuple(tf.Tensor) , optional, returned when output_hidden_states=True


Base class for outputs of sentence classification models.

TFGPT2Tokenizer

class transformers.TFGPT2Tokenizer < source >

( vocab: Dict, merges: List, max_length: int = None, pad_token_id: int = None )

Parameters

• vocab (Dict[str, int]) — Vocabulary dict for Byte Pair Tokenizer

• merges (List[str]) — Merges list for Byte Pair Tokenizer

This is an in-graph tokenizer for GPT2. It should be initialized similarly to other tokenizers, using
the from_pretrained() method. It can also be initialized with the from_tokenizer()
method, which imports settings from an existing standard tokenizer object.

In-graph tokenizers, unlike other Hugging Face tokenizers, are actually Keras layers and are
designed to be run when the model is called, rather than during preprocessing. As a result, they
have somewhat more limited options than standard tokenizer classes. They are most useful
when you want to create an end-to-end model that goes straight from tf.string inputs to
outputs.

from_config < source >

( config )

Parameters

• config (Dict) — Dictionary with keys such as stated in get_config .

Creates TFGPT2Tokenizer from configurations

from_pretrained < source >

( pretrained_model_name_or_path: Union, *init_inputs, **kwargs )


Parameters

• pretrained_model_name_or_path (Union[str, os.PathLike]) — Path to pretrained model

Creates TFGPT2Tokenizer from pretrained GPT2Tokenizer

Examples:

from transformers import TFGPT2Tokenizer

tf_tokenizer = TFGPT2Tokenizer.from_pretrained("openai-community/gpt2")

from_tokenizer < source >

( tokenizer: GPT2Tokenizer, *args, **kwargs )

Parameters

• tokenizer (GPT2Tokenizer) —

Creates TFGPT2Tokenizer from GPT2Tokenizer

Examples:

from transformers import AutoTokenizer, TFGPT2Tokenizer

tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
tf_tokenizer = TFGPT2Tokenizer.from_tokenizer(tokenizer)

JAX Hide JAX content

FlaxGPT2Model

class transformers.FlaxGPT2Model < source >

( config: GPT2Config, input_shape: Tuple = (1, 1), seed: int = 0, dtype: dtype
= <class 'jax.numpy.float32'>, _do_init: bool = True, **kwargs )
Parameters

• config (GPT2Config) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the from_pretrained() method to load the model weights.

• dtype ( jax.numpy.dtype , optional, defaults to jax.numpy.float32 ) — The data type of the


computation. Can be one of jax.numpy.float32 , jax.numpy.float16 (on GPUs) and
jax.numpy.bfloat16 (on TPUs).

This can be used to enable mixed-precision training or half-precision inference on GPUs or


TPUs. If specified all the computation will be performed with the given dtype .

Note that this only specifies the dtype of the computation and does not influence the dtype
of model parameters.

If you wish to change the dtype of the model parameters, see to_fp16() and to_bf16().

The bare GPT2 Model transformer outputting raw hidden-states without any specific head on
top.

This model inherits from FlaxPreTrainedModel. Check the superclass documentation for the
generic methods the library implements for all its model (such as downloading or saving,
resizing the input embeddings, pruning heads etc.)

This model is also a Flax Linen flax.nn.Module subclass. Use it as a regular Flax Module and refer
to the Flax documentation for all matter related to general usage and behavior.

Finally, this model supports inherent JAX features such as:

• Just-In-Time (JIT) compilation


• Automatic Differentiation
• Vectorization
• Parallelization

__call__ < source >

( input_ids, attention_mask = None, position_ids = None,


encoder_hidden_states: Optional = None, encoder_attention_mask: Optional =
None, params: dict = None, past_key_values: dict = None, dropout_rng: PRNGKey
= None, train: bool = False, output_attentions: Optional = None,
output_hidden_states: Optional = None, return_dict: Optional = None ) →
transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPastAndCrossAttenti
ons or tuple(torch.FloatTensor)

Parameters

• input_ids ( numpy.ndarray of shape (batch_size, input_ids_length) ) —


input_ids_length = sequence_length . Indices of input sequence tokens in the
vocabulary.

Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and


PreTrainedTokenizer.call() for details.

What are input IDs?

• attention_mask ( numpy.ndarray of shape (batch_size, sequence_length) , optional)


— Mask to avoid performing attention on padding token indices. Mask values selected in
[0, 1] : Expand 7 parameters

•1 for tokens that are not masked,

The FlaxGPT2PreTrainedModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should
call the Module instance afterwards instead of this since the former takes care of running
the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import AutoTokenizer, FlaxGPT2Model

>>> tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")


>>> model = FlaxGPT2Model.from_pretrained("openai-community/gpt2")

>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="jax")


>>> outputs = model(**inputs)

>>> last_hidden_states = outputs.last_hidden_state

FlaxGPT2LMHeadModel
class transformers.FlaxGPT2LMHeadModel < source >

( config: GPT2Config, input_shape: Tuple = (1, 1), seed: int = 0, dtype: dtype
= <class 'jax.numpy.float32'>, _do_init: bool = True, **kwargs )

Parameters

• config (GPT2Config) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the from_pretrained() method to load the model weights.

• dtype ( jax.numpy.dtype , optional, defaults to jax.numpy.float32 ) — The data type of the


computation. Can be one of jax.numpy.float32 , jax.numpy.float16 (on GPUs) and
jax.numpy.bfloat16 (on TPUs).

This can be used to enable mixed-precision training or half-precision inference on GPUs or


TPUs. If specified all the computation will be performed with the given dtype .

Note that this only specifies the dtype of the computation and does not influence the dtype
of model parameters.

If you wish to change the dtype of the model parameters, see to_fp16() and to_bf16().

The GPT2 Model transformer with a language modeling head on top (linear layer with weights
tied to the input embeddings).

This model inherits from FlaxPreTrainedModel. Check the superclass documentation for the
generic methods the library implements for all its model (such as downloading or saving,
resizing the input embeddings, pruning heads etc.)

This model is also a Flax Linen flax.nn.Module subclass. Use it as a regular Flax Module and refer
to the Flax documentation for all matter related to general usage and behavior.

Finally, this model supports inherent JAX features such as:

• Just-In-Time (JIT) compilation


• Automatic Differentiation
• Vectorization
• Parallelization
__call__ < source >

( input_ids, attention_mask = None, position_ids = None,


encoder_hidden_states: Optional = None, encoder_attention_mask: Optional =
None, params: dict = None, past_key_values: dict = None, dropout_rng: PRNGKey
= None, train: bool = False, output_attentions: Optional = None,
output_hidden_states: Optional = None, return_dict: Optional = None ) →
transformers.modeling_flax_outputs.FlaxCausalLMOutputWithCrossAttentions or
tuple(torch.FloatTensor)

Parameters

• input_ids ( numpy.ndarray of shape (batch_size, input_ids_length) ) —


input_ids_length = sequence_length . Indices of input sequence tokens in the
vocabulary.

Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and


PreTrainedTokenizer.call() for details.

What are input IDs?

• attention_mask ( numpy.ndarray of shape (batch_size, sequence_length) , optional)


— Mask to avoid performing attention on padding token indices. Mask values selected in
[0, 1] : Expand 7 parameters

•1 for tokens that are not masked,

The FlaxGPT2PreTrainedModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should
call the Module instance afterwards instead of this since the former takes care of running
the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import AutoTokenizer, FlaxGPT2LMHeadModel

>>> tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")


>>> model = FlaxGPT2LMHeadModel.from_pretrained("openai-community/gpt2")
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="np")
>>> outputs = model(**inputs)

>>> # retrieve logts for next token


>>> next_token_logits = outputs.logits[:, -1]

<> Update on GitHub

← GPT-J GPTBigCode →

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy