Skip to content

cxcscmu/Craw4LLM

Repository files navigation

Craw4LLM

This repo contains the code for the paper "Craw4LLM: Efficient Web Crawling for LLM Pretraining".

Prerequisite

  1. Request the ClueWeb22 dataset.
  2. Create a virtual environment with python >= 3.10 and install the following requirements:
numpy
tqdm
fasttext
pyyaml
wandb
  1. Download the DCLM fastText classifier to fasttext_scorers/.

Important

To run the crawler efficiently, the ClueWeb22 data should be placed on an SSD.

Run the Crawler

To run a (simulated) crawl, first create a yaml configuration file under configs/, and run the following command:

python crawl.py crawl --config <path_to_your_config_file>

Craw4LLM

Create a yaml file in configs/ with the following content:

cw22_root_path: <path_to_clueweb22_a>
seed_docs_file: seed.txt
output_dir: crawl_results/seed_10k_crawl_20m_dclm_fasttext
num_selected_docs_per_iter: 10000
num_workers: 16  # set to a number that fits your machine
save_state_every: -1  # set to a positive number to save the state (queue & visited set) of the crawler every certain steps
max_num_docs: 20000000
selection_method: dclm_fasttext_score
order: desc  # desc for descending, asc for ascending
wandb: true  # set to false to disable wandb logging
wandb_project: crawler
wandb_run_name: seed_10k_crawl_20m_dclm_fasttext
rating_methods:
    - 
        type: length
    - 
        type: fasttext_score
        rater_name: dclm_fasttext_score
        model_path: fasttext_scorers/openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train.bin

Documents are scored by all scorers in rating_methods. In the above configuration file, we set a length scorer, which scores a document by its length, and a fasttext_score scorer which uses the DCLM fastText model to score a document. The final ranking is determined by selection_method which is set to dclm_fasttext_score, the name of the fasttext_score scorer.

Baseline Crawlers

Random Crawler

cw22_root_path: <path_to_clueweb22_a>
seed_docs_file: seed.txt
output_dir: crawl_results/seed_10k_crawl_20m_random
num_selected_docs_per_iter: 10000
num_workers: 16
save_state_every: -1
max_num_docs: 20000000
selection_method: random_score
order: desc
wandb: true
wandb_project: crawler
wandb_run_name: seed_10k_crawl_20m_random
rating_methods:
    - 
        type: random_score

Indegree-based Crawler

cw22_root_path: <path_to_clueweb22_a>
seed_docs_file: seed.txt
output_dir: crawl_results/seed_10k_crawl_20m_indegree
num_selected_docs_per_iter: 10000
num_workers: 16
save_state_every: -1
max_num_docs: 20000000
selection_method: inlink_count
order: desc
wandb: true
wandb_project: crawler
wandb_run_name: seed_10k_crawl_20m_indegree
rating_methods:
    - 
        type: inlink_count

Pretraining and Evaluation

After running the crawler, the crawled document ids will be placed in output_dir in the configuration file. Run the following command to get the document texts:

python fetch_docs.py  --input_dir <document_ids_dir>  --output_dir <document_texts_dir>  --num_workers <num_workers>

Then you can use the DCLM framework to run LLM pretraining and evaluation.

Miscellaneous

Browse the Data

Run the following command to print a document and its outlinks by its id:

python access_data.py <path_to_clueweb22> <document_id>

About

Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy