|
|
--- |
|
|
library_name: transformers |
|
|
datasets: |
|
|
- fanlino/lol-champion-qa |
|
|
language: |
|
|
- ko |
|
|
base_model: |
|
|
- google/gemma-2-2b-it |
|
|
--- |
|
|
|
|
|
# Model Card for Model ID |
|
|
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
|
|
|
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
|
|
This model is a fine-tuned version of google/gemma-2-2b-it, designed to answer questions related to champions from the online game League of Legends. By using a custom dataset of champion stories and lore, the model is optimized to generate responses in Korean. |
|
|
|
|
|
- **Developed by:** Dohyun Kim, Jongbong Lee, Jaehoon Kim |
|
|
- **Model type:** LLM Finetuned Model |
|
|
- **Language(s) (NLP):** Korean |
|
|
- **Finetuned from model [optional]:** google/gemma-2-2b-it |
|
|
|
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
The dataset was created by scraping champion lore from the official League of Legends website, transforming the content into Q&A format using large language models. You can find the dataset at fanlino/lol-champion-qa. |
|
|
|
|
|
``` |
|
|
# List of champions |
|
|
champions = [ |
|
|
"aatrox", "ahri", "akali", "akshan", "alistar", "amumu", "anivia", "annie", "aphelios", "ashe", |
|
|
"aurelionsol", "azir", "bard", "belveth", "blitzcrank", "brand", "braum", "caitlyn", "camille", |
|
|
"cassiopeia", "chogath", "corki", "darius", "diana", "drmundo", "draven", "ekko", "elise", |
|
|
"evelynn", "ezreal", "fiddlesticks", "fiora", "fizz", "galio", "gangplank", "garen", "gnar", |
|
|
"gragas", "graves", "gwen", "hecarim", "heimerdinger", "illaoi", "irelia", "ivern", "janna", |
|
|
"jarvaniv", "jax", "jayce", "jhin", "jinx", "kaisa", "kalista", "karma", "karthus", "kassadin", |
|
|
"katarina", "kayle", "kayn", "kennen", "khazix", "kindred", "kled", "kogmaw", "leblanc", "leesin", |
|
|
"leona", "lillia", "lissandra", "lucian", "lulu", "lux", "malphite", "malzahar", "maokai", |
|
|
"masteryi", "milio", "missfortune", "mordekaiser", "morgana", "naafiri", "nami", "nasus", |
|
|
"nautilus", "neeko", "nidalee", "nilah", "nocturne", "nunu", "olaf", "orianna", "ornn", |
|
|
"pantheon", "poppy", "pyke", "qiyana", "quinn", "rakan", "rammus", "reksai", "rell", "renataglasc", |
|
|
"renekton", "rengar", "riven", "rumble", "ryze", "samira", "sejuani", "senna", "seraphine", "sett", |
|
|
"shaco", "shen", "shyvana", "singed", "sion", "sivir", "skarner", "sona", "soraka", "swain", |
|
|
"sylas", "syndra", "tahmkench", "taliyah", "talon", "taric", "teemo", "thresh", "tristana", |
|
|
"trundle", "tryndamere", "twistedfate", "twitch", "udyr", "urgot", "varus", "vayne", "veigar", |
|
|
"velkoz", "vex", "vi", "viego", "viktor", "vladimir", "volibear", "warwick", "monkeyking", "xayah", |
|
|
"xerath", "xinzhao", "yasuo", "yone", "yorick", "yuumi", "zac", "zed", "ziggs", "zilean", "zoe", "zyra" |
|
|
] |
|
|
|
|
|
print(f"The total number of champions: {len(champions)}") |
|
|
|
|
|
# Base URL for the champion story in Korean |
|
|
base_url = "https://universe.leagueoflegends.com/ko_KR/story/champion/" |
|
|
|
|
|
# Function to scrape the Korean name and background story of a champion |
|
|
def scrape_champion_data(champion): |
|
|
url = base_url + champion + "/" |
|
|
response = requests.get(url) |
|
|
|
|
|
if response.status_code == 200: |
|
|
soup = BeautifulSoup(response.content, 'html.parser') |
|
|
|
|
|
# Extract the Korean name from the <title> tag |
|
|
korean_name = soup.find('title').text.split('-')[0].strip() |
|
|
|
|
|
# Extract the background story from the meta description |
|
|
meta_description = soup.find('meta', {'name': 'description'}) |
|
|
if meta_description: |
|
|
background_story = meta_description.get('content').replace('\n', ' ').strip() |
|
|
else: |
|
|
background_story = "No background story available" |
|
|
|
|
|
return korean_name, background_story |
|
|
else: |
|
|
return None, None |
|
|
|
|
|
# Open the CSV file for writing |
|
|
with open("champion_bs.csv", "w", newline='', encoding='utf-8') as csvfile: |
|
|
# Define the column headers |
|
|
fieldnames = ['url-name', 'korean-name', 'background-story'] |
|
|
|
|
|
# Create a CSV writer object |
|
|
writer = csv.DictWriter(csvfile, fieldnames=fieldnames) |
|
|
|
|
|
# Write the header |
|
|
writer.writeheader() |
|
|
|
|
|
# Scrape data for each champion and write to CSV |
|
|
for champion in champions: |
|
|
korean_name, background_story = scrape_champion_data(champion) |
|
|
if korean_name and background_story: |
|
|
writer.writerow({ |
|
|
'url-name': champion, |
|
|
'korean-name': korean_name, |
|
|
'background-story': background_story |
|
|
}) |
|
|
print(f"Scraped data for {champion}: {korean_name}") |
|
|
else: |
|
|
print(f"Failed to scrape data for {champion}") |
|
|
|
|
|
print("Data scraping complete. Saved to champion_bs.csv") |
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
|
|
|
|
**Environment Setup** |
|
|
|
|
|
The model was fine-tuned using a quantization-aware training approach to optimize memory usage and computational efficiency. The environment was set up with 4-bit quantization using torch and transformers, and the LoRA (Low-Rank Adaptation) method was applied to specific layers of the model to improve task performance. |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig |
|
|
|
|
|
model_id = "google/gemma-2-2b-it" |
|
|
|
|
|
bnb_config = BitsAndBytesConfig( |
|
|
load_in_4bit=True, |
|
|
bnb_4bit_use_double_quant=True, |
|
|
bnb_4bit_quant_type="nf4", |
|
|
bnb_4bit_compute_dtype=torch.bfloat16 |
|
|
) |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id, |
|
|
quantization_config=qlora_config, |
|
|
device_map="auto", |
|
|
attn_implementation=attn_implementation |
|
|
) |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) |
|
|
``` |
|
|
|
|
|
**QLoRA Setting** |
|
|
|
|
|
```python |
|
|
from peft import LoraConfig, get_peft_model |
|
|
|
|
|
def find_linear_layers(model): |
|
|
linear_layers = set() |
|
|
for name, module in model.named_modules(): |
|
|
if isinstance(module, bnb.nn.Linear4bit): |
|
|
names = name.split('.') |
|
|
layer_name = names[-1] |
|
|
if layer_name != 'lm_head': |
|
|
linear_layers.add(layer_name) |
|
|
return list(linear_layers) |
|
|
|
|
|
lora_target_modules = find_linear_layers(model) |
|
|
|
|
|
lora_config = LoraConfig( |
|
|
r=64, |
|
|
lora_alpha=32, |
|
|
target_modules=lora_target_modules, |
|
|
lora_dropout=0.05, |
|
|
bias="none", |
|
|
task_type="CAUSAL_LM" |
|
|
) |
|
|
|
|
|
model = get_peft_model(model, lora_config) |
|
|
``` |
|
|
|
|
|
**Loading Training Datasets** |
|
|
|
|
|
To prepare the training data, the champion stories were converted into a question-answer format. The dataset was structured using a chat-style template to ensure compatibility with the Gemma2 modelβs architecture. |
|
|
|
|
|
```python |
|
|
data = [ |
|
|
{ "q": "λλΆλΆμ νλ©Έμκ° μκ³ μλ νμ€ μ°¨μμ 무μμΈκ°?", "a": "λλΆλΆμ νλ©Έμλ λ¬Όμ§ μΈκ³λΌλ νλμ νμ€ μ°¨μλ§ μκ³ μλ€." }, |
|
|
{ "q": "μ€λ‘λΌκ° μ λ
μμ μ λ³΄λΈ κ³³μ μ΄λμΈκ°?", "a": "μ€λ‘λΌλ λΈλ€Όλ λΆμ‘±μ κ³ ν₯μ΄μ μΈλ΄ λ§μμΈ μ무μ°μμ μ λ
μμ μ 보λλ€." }, |
|
|
{ "q": "μ€λ‘λΌκ° μμ μ μ΄ν΄ν΄μ€ μ μΌν κ°μ‘± ꡬμ±μμ λꡬμΈκ°?", "a": "μ€λ‘λΌμ μ΄λͺ¨ν λ¨Έλ νλΆμ°κ° μ€λ‘λΌλ₯Ό μ§μ¬μΌλ‘ λ°μλ€μλ€." }, |
|
|
...] |
|
|
|
|
|
qa_df = pd.DataFrame(data, columns=["q", "a"]) |
|
|
qa_dataset = Dataset.from_pandas(qa_df) |
|
|
``` |
|
|
|
|
|
We use gemma2's chat format template. |
|
|
|
|
|
|
|
|
```python |
|
|
<start_of_turn>user |
|
|
{Qustion}<end_of_turn> |
|
|
<start_of_turn>model |
|
|
{Answer} |
|
|
<end_of_turn> |
|
|
``` |
|
|
|
|
|
And we write a function to structure a dataset. |
|
|
|
|
|
```python |
|
|
def format_chat_prompt(example): |
|
|
chat_data = [ |
|
|
{"role": "user", "content": example["q"]}, |
|
|
{"role": "assistant", "content": example["a"]} |
|
|
] |
|
|
example["text"] = tokenizer.apply_chat_template(chat_data, tokenize=False) |
|
|
return example |
|
|
|
|
|
dataset = dataset.map(format_chat_prompt, num_proc=4) |
|
|
``` |
|
|
|
|
|
The actual format results in the following text. |
|
|
``` |
|
|
<bos> |
|
|
<start_of_turn>user |
|
|
μνΈλ‘μ€κ° νμ΄λ κ³³μ μ΄λμΈκ°?<end_of_turn> |
|
|
<start_of_turn>model |
|
|
μνΈλ‘μ€λ μ리λ§μμ νμ΄λ¬λ€.<end_of_turn>'} |
|
|
``` |
|
|
|
|
|
**Training Model** |
|
|
|
|
|
The model was then trained using the SFTTrainer class, with settings such as a batch size of 1, 10 gradient accumulation steps, and 10 epochs. The optimizer used was paged_adamw_32bit. |
|
|
|
|
|
```python |
|
|
import transformers |
|
|
from trl import SFTTrainer |
|
|
|
|
|
# Training arguments |
|
|
training_args = TrainingArguments( |
|
|
output_dir=OUTPUT_MODEL_PATH, |
|
|
per_device_train_batch_size=1, # steps_per_epoch = ceil(total_samples / (batch_size * gradient_accumulation_steps)) |
|
|
gradient_accumulation_steps=10, # total_samples means len(dataset) |
|
|
num_train_epochs=10, |
|
|
learning_rate=2e-4, |
|
|
fp16=False, |
|
|
bf16=False, |
|
|
logging_steps=len(dataset)//10, |
|
|
optim="paged_adamw_32bit", |
|
|
logging_dir="./logs", |
|
|
save_strategy="epoch", |
|
|
evaluation_strategy="no", |
|
|
do_eval=False, |
|
|
group_by_length=True, |
|
|
report_to="none" |
|
|
) |
|
|
|
|
|
# Initialize trainer |
|
|
trainer = SFTTrainer( |
|
|
model=model, |
|
|
train_dataset=dataset, |
|
|
peft_config=lora_config, |
|
|
dataset_text_field="text", |
|
|
max_seq_length=512, |
|
|
tokenizer=tokenizer, |
|
|
args=training_args, |
|
|
packing=False, |
|
|
) |
|
|
|
|
|
# Train the model |
|
|
trainer.train() |
|
|
``` |
|
|
|
|
|
**Testing Model** |
|
|
|
|
|
We created a helper function to ask the question in the format. |
|
|
|
|
|
|
|
|
```python |
|
|
def generate_response(prompt, model, tokenizer, temperature=0.1): |
|
|
formatted_prompt=f"""<start_of_turn>user |
|
|
{prompt}<end_of_turn> |
|
|
<start_of_turn>model |
|
|
""" |
|
|
inputs = tokenizer(formatted_prompt, return_tensors="pt").to("cuda") |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=256, |
|
|
do_sample=temperature > 0, |
|
|
temperature=temperature |
|
|
) |
|
|
return tokenizer.decode(outputs[0], skip_special_tokens=False) |
|
|
``` |
|
|
|
|
|
**Question** |
|
|
```python |
|
|
prompt = "μ‘°μ΄λ μμ°λ λ¦¬μ¨ μνν
λ¬΄μ¨ μ½μμ νμ΄?" |
|
|
response = generate_response(prompt, model, tokenizer) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
**μμ λ΅λ³** |
|
|
``` |
|
|
μ‘°μ΄λ μμ°λ λ¦¬μ¨ μμ μ§ν€κΈ° μν΄ ν μ μλ κ²μ 무μμ΄λ ν΄μ£Όλ¦¬λΌ μ½μνλ€. |
|
|
``` |
|
|
|
|
|
**κ²°κ³Ό(Finetuned Model)** |
|
|
``` |
|
|
<bos><start_of_turn>user |
|
|
μ‘°μ΄λ μμ°λ λ¦¬μ¨ μνν
λ¬΄μ¨ μ½μμ νμ΄?<end_of_turn> |
|
|
<start_of_turn>model |
|
|
μ‘°μ΄λ μμ°λ λ¦¬μ¨ μμ μ§ν€κΈ° μν΄ ν μ μλ κ²μ 무μμ΄λ ν΄μ£Όλ¦¬λΌ μ½μνλ€.<end_of_turn> |
|
|
``` |
|
|
|
|
|
|
|
|
**κ²°κ³Ό(Base Model)** |
|
|
``` |
|
|
<bos><start_of_turn>user |
|
|
μ‘°μ΄λ μμ°λ λ¦¬μ¨ μνν
λ¬΄μ¨ μ½μμ νμ΄?<end_of_turn> |
|
|
<start_of_turn>model |
|
|
μ‘°μ΄λ μμ°λ λ¦¬μ¨ μνν
**λ¬΄μ¨ μ½μμ νλμ§**μ λν μ 보λ μμ§ μλ €μ§μ§ μμμ΅λλ€. |
|
|
|
|
|
μ‘°μ΄λ μμ°λ λ¦¬μ¨ μνν
μ½μμ νλμ§μ λν μ΄μΌκΈ°λ λͺ κ°μ§ μ ννλ λ°κ³Ό κ΄λ ¨λ κ²μΌλ‘ 보μ
λλ€. |
|
|
|
|
|
* **μμ°λ λ¦¬μ¨ μ:** μ΄κ²μ 2023λ
1μμ μΆμλ μμ°λ λ¦¬μ¨ μμ μ΄λ¦μ
λλ€. |
|
|
* **μ‘°μ΄:** μ΄κ²μ 2023λ
1μμ μΆμλ μμ°λ λ¦¬μ¨ μμ μ΄λ¦μ
λλ€. |
|
|
|
|
|
μ΄λ¬ν λ°λ€μ ν₯λ―Έλ‘μ§λ§, μ€μ λ‘ μ‘°μ΄λ μμ°λ λ¦¬μ¨ μνν
λ¬΄μ¨ μ½μμ νλμ§μ λν μ νν μ 보λ μμ§ μλ €μ§μ§ μμμ΅λλ€. |
|
|
|
|
|
|
|
|
<end_of_turn> |
|
|
``` |
|
|
|
|
|
In contrast, the base modelβs response was less accurate, highlighting the improvements made through fine-tuning. |
|
|
|
|
|
#### Summary |
|
|
|
|
|
The code discussed above can be found at the following link: [lol_lore.ipynb](https://github.com/star-bits/mlb-gemma/blob/main/lol_lore.ipynb) |