Content-Length: 278464 | pFad | https://github.com/abeja-inc/Megatron-LM/blob/main/docs/dataset/tutorials.md

BB Megatron-LM/docs/dataset/tutorials.md at main · abeja-inc/Megatron-LM · GitHub
Skip to content

Latest commit

 

History

History
112 lines (81 loc) · 2.88 KB

tutorials.md

File metadata and controls

112 lines (81 loc) · 2.88 KB

We will introduce the llm finetuning with huggngface.

0. Preparation

  • Setting aws-cli

https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-welcome.html

  • Python and huggingface environment

https://huggingface.co/docs/transformers/ja/installation

https://hub.docker.com/r/huggingface/transformers-pytorch-gpu

1. Download the dataset

At here, we use only 1 file.

aws s3 cp s3://abeja-cc-ja/common_crawl_01.jsonl data/

If you will use all data, it is recommended that using aws s3 sync command.

2.Check dataset

You can check the data after download.

head -n 1 data/common_crawl_01.jsonl

You can find such output.

{"url":"https:\/\/xxxxx/","content":"### 今日のご飯\n こんにちは、今日のご飯についてご紹介します。..."}

3. Training

import json

import torch
from datasets import Dataset
from transformers import (
    AutoConfig,
    AutoModelForCausalLM,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
)

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

# Function to load JSONL dataset
def load_jsonl_dataset(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        lines = f.readlines()
    data = [json.loads(line) for line in lines]
    return Dataset.from_list(data)

# Paths to the dataset files
# for small data check we sampled data
# by `head -n 100000 common_crawl_01.jsonl > sampled_common_crawl_01.jsonl`
train_file_path = 'sampled_common_crawl_01.jsonl'

# Load datasets
train_dataset = load_jsonl_dataset(train_file_path)

# Define tokenizer and model
model_name = 'abeja/gpt2-large-japanese'  # Example using GPT-2
tokenizer = AutoTokenizer.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, config=config)
model.to(device)
tokenizer.pad_token = tokenizer.eos_token

# Preprocess the dataset
def preprocess_function(examples):
    # Tokenize the text and truncate to max length
    return tokenizer(examples['content'], truncation=True, padding='max_length', max_length=128)

train_dataset = train_dataset.map(preprocess_function, batched=True)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    dataloader_pin_memory=False
)

# Define the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

# Train the model
trainer.train()

In recent LLMs, distributed learning using multiple machine nodes has become common due to the large size of the models. When using distributed learning, libraries such as Megatron-LM and DeepSpeed are utilized.

However, even in such cases, this dataset should be just as easy to handle.









ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: https://github.com/abeja-inc/Megatron-LM/blob/main/docs/dataset/tutorials.md

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy