Common Crawl
Common Crawl
To create a model that performs well, you need to train it using a specific set of variables
called parameters. The process of determining the ideal parameters for your model is
called training. The model assimilates parameter values through successive training
iterations.
A deep learning model takes a lot of time to find these ideal parameters. Training is a
lengthy process that depending on the task, can last from a few hours to a few months and
requires a tremendous amount of computing power. Reusing some of that long learning
process for other tasks would significantly help. And this is where the pre-trained models
come in.
A pre-trained model, keeping with Gladwell’s 10,000 hours theory, is the first skill you
develop to help you acquire another faster. For example, mastering the craft of solving
math problems can allow you to acquire the skill of solving engineering problems faster.
A pre-trained model is trained (by you or someone else) for a more general task and can
be fine-tuned for different tasks. Instead of creating a brand new model to address your
issue, you can use a pre-trained model that has already been trained on a more general
problem. The pre-trained model can be fine-tuned to address your specific needs by
providing additional training with a tailored dataset. This approach is faster and more
efficient and allows for improved performance compared to building a model from
scratch.
In machine learning, a model is trained on a dataset. The size and type of data samples
vary depending on the task you want to solve. GPT-3 is pre-trained on a corpus of text
from five datasets: Common Crawl, WebText2, Books1, Books2, and Wikipedia.
Common Crawl
The Common Crawl corpus comprises petabytes of data, including raw web page
data, metadata, and text data collected over eight years of web crawling. OpenAI
researchers use a curated, filtered version of this dataset.
WebText2
WebText2 is an expanded version of the WebText dataset, an internal OpenAI
corpus created by scraping particularly high-quality web pages. To vet for quality,
the authors scraped all outbound links from Reddit, which received at least three
karma (an indicator for whether other users found the link interesting, educational,
or just funny). WebText contains 40 gigabytes of text from these 45 million links,
and over 8 million documents.