gemma-2-2b-lol / README.md

Update README.md

3546378 verified about 1 year ago

11.7 kB

	---
	library_name: transformers
	datasets:
	- fanlino/lol-champion-qa
	language:
	- ko
	base_model:
	- google/gemma-2-2b-it
	---

	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->



	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	This model is a fine-tuned version of google/gemma-2-2b-it, designed to answer questions related to champions from the online game League of Legends. By using a custom dataset of champion stories and lore, the model is optimized to generate responses in Korean.

	- Developed by: Dohyun Kim, Jongbong Lee, Jaehoon Kim
	- Model type: LLM Finetuned Model
	- Language(s) (NLP): Korean
	- Finetuned from model [optional]: google/gemma-2-2b-it


	## Training Details

	### Training Data

	<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
	The dataset was created by scraping champion lore from the official League of Legends website, transforming the content into Q&A format using large language models. You can find the dataset at fanlino/lol-champion-qa.

	```
	# List of champions
	champions = [
	"aatrox", "ahri", "akali", "akshan", "alistar", "amumu", "anivia", "annie", "aphelios", "ashe",
	"aurelionsol", "azir", "bard", "belveth", "blitzcrank", "brand", "braum", "caitlyn", "camille",
	"cassiopeia", "chogath", "corki", "darius", "diana", "drmundo", "draven", "ekko", "elise",
	"evelynn", "ezreal", "fiddlesticks", "fiora", "fizz", "galio", "gangplank", "garen", "gnar",
	"gragas", "graves", "gwen", "hecarim", "heimerdinger", "illaoi", "irelia", "ivern", "janna",
	"jarvaniv", "jax", "jayce", "jhin", "jinx", "kaisa", "kalista", "karma", "karthus", "kassadin",
	"katarina", "kayle", "kayn", "kennen", "khazix", "kindred", "kled", "kogmaw", "leblanc", "leesin",
	"leona", "lillia", "lissandra", "lucian", "lulu", "lux", "malphite", "malzahar", "maokai",
	"masteryi", "milio", "missfortune", "mordekaiser", "morgana", "naafiri", "nami", "nasus",
	"nautilus", "neeko", "nidalee", "nilah", "nocturne", "nunu", "olaf", "orianna", "ornn",
	"pantheon", "poppy", "pyke", "qiyana", "quinn", "rakan", "rammus", "reksai", "rell", "renataglasc",
	"renekton", "rengar", "riven", "rumble", "ryze", "samira", "sejuani", "senna", "seraphine", "sett",
	"shaco", "shen", "shyvana", "singed", "sion", "sivir", "skarner", "sona", "soraka", "swain",
	"sylas", "syndra", "tahmkench", "taliyah", "talon", "taric", "teemo", "thresh", "tristana",
	"trundle", "tryndamere", "twistedfate", "twitch", "udyr", "urgot", "varus", "vayne", "veigar",
	"velkoz", "vex", "vi", "viego", "viktor", "vladimir", "volibear", "warwick", "monkeyking", "xayah",
	"xerath", "xinzhao", "yasuo", "yone", "yorick", "yuumi", "zac", "zed", "ziggs", "zilean", "zoe", "zyra"
	]

	print(f"The total number of champions: {len(champions)}")

	# Base URL for the champion story in Korean
	base_url = "https://universe.leagueoflegends.com/ko_KR/story/champion/"

	# Function to scrape the Korean name and background story of a champion
	def scrape_champion_data(champion):
	url = base_url + champion + "/"
	response = requests.get(url)

	if response.status_code == 200:
	soup = BeautifulSoup(response.content, 'html.parser')

	# Extract the Korean name from the <title> tag
	korean_name = soup.find('title').text.split('-')[0].strip()

	# Extract the background story from the meta description
	meta_description = soup.find('meta', {'name': 'description'})
	if meta_description:
	background_story = meta_description.get('content').replace('\n', ' ').strip()
	else:
	background_story = "No background story available"

	return korean_name, background_story
	else:
	return None, None

	# Open the CSV file for writing
	with open("champion_bs.csv", "w", newline='', encoding='utf-8') as csvfile:
	# Define the column headers
	fieldnames = ['url-name', 'korean-name', 'background-story']

	# Create a CSV writer object
	writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

	# Write the header
	writer.writeheader()

	# Scrape data for each champion and write to CSV
	for champion in champions:
	korean_name, background_story = scrape_champion_data(champion)
	if korean_name and background_story:
	writer.writerow({
	'url-name': champion,
	'korean-name': korean_name,
	'background-story': background_story
	})
	print(f"Scraped data for {champion}: {korean_name}")
	else:
	print(f"Failed to scrape data for {champion}")

	print("Data scraping complete. Saved to champion_bs.csv")
	```



	### Training Procedure

	<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

	Environment Setup

	The model was fine-tuned using a quantization-aware training approach to optimize memory usage and computational efficiency. The environment was set up with 4-bit quantization using torch and transformers, and the LoRA (Low-Rank Adaptation) method was applied to specific layers of the model to improve task performance.

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

	model_id = "google/gemma-2-2b-it"

	bnb_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_use_double_quant=True,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_compute_dtype=torch.bfloat16
	)

	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	quantization_config=qlora_config,
	device_map="auto",
	attn_implementation=attn_implementation
	)

	tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
	```

	QLoRA Setting

	```python
	from peft import LoraConfig, get_peft_model

	def find_linear_layers(model):
	linear_layers = set()
	for name, module in model.named_modules():
	if isinstance(module, bnb.nn.Linear4bit):
	names = name.split('.')
	layer_name = names[-1]
	if layer_name != 'lm_head':
	linear_layers.add(layer_name)
	return list(linear_layers)

	lora_target_modules = find_linear_layers(model)

	lora_config = LoraConfig(
	r=64,
	lora_alpha=32,
	target_modules=lora_target_modules,
	lora_dropout=0.05,
	bias="none",
	task_type="CAUSAL_LM"
	)

	model = get_peft_model(model, lora_config)
	```

	Loading Training Datasets

	To prepare the training data, the champion stories were converted into a question-answer format. The dataset was structured using a chat-style template to ensure compatibility with the Gemma2 model’s architecture.

	```python
	data = [
	{ "q": "대부분의 필멸자가 알고 있는 현실 차원은 무엇인가?", "a": "대부분의 필멸자는 물질 세계라는 하나의 현실 차원만 알고 있다." },
	{ "q": "오로라가 유년 시절을 보낸 곳은 어디인가?", "a": "오로라는 브뤼니 부족의 고향이자 외딴 마을인 아무우에서 유년 시절을 보냈다." },
	{ "q": "오로라가 자신을 이해해준 유일한 가족 구성원은 누구인가?", "a": "오로라의 이모할머니 하부우가 오로라를 진심으로 받아들였다." },
	...]

	qa_df = pd.DataFrame(data, columns=["q", "a"])
	qa_dataset = Dataset.from_pandas(qa_df)
	```

	We use gemma2's chat format template.


	```python
	<start_of_turn>user
	{Qustion}<end_of_turn>
	<start_of_turn>model
	{Answer}
	<end_of_turn>
	```

	And we write a function to structure a dataset.

	```python
	def format_chat_prompt(example):
	chat_data = [
	{"role": "user", "content": example["q"]},
	{"role": "assistant", "content": example["a"]}
	]
	example["text"] = tokenizer.apply_chat_template(chat_data, tokenize=False)
	return example

	dataset = dataset.map(format_chat_prompt, num_proc=4)
	```

	The actual format results in the following text.
	```
	<bos>
	<start_of_turn>user
	아트록스가 태어난 곳은 어디인가?<end_of_turn>
	<start_of_turn>model
	아트록스는 슈리마에서 태어났다.<end_of_turn>'}
	```

	Training Model

	The model was then trained using the SFTTrainer class, with settings such as a batch size of 1, 10 gradient accumulation steps, and 10 epochs. The optimizer used was paged_adamw_32bit.

	```python
	import transformers
	from trl import SFTTrainer

	# Training arguments
	training_args = TrainingArguments(
	output_dir=OUTPUT_MODEL_PATH,
	per_device_train_batch_size=1, # steps_per_epoch = ceil(total_samples / (batch_size * gradient_accumulation_steps))
	gradient_accumulation_steps=10, # total_samples means len(dataset)
	num_train_epochs=10,
	learning_rate=2e-4,
	fp16=False,
	bf16=False,
	logging_steps=len(dataset)//10,
	optim="paged_adamw_32bit",
	logging_dir="./logs",
	save_strategy="epoch",
	evaluation_strategy="no",
	do_eval=False,
	group_by_length=True,
	report_to="none"
	)

	# Initialize trainer
	trainer = SFTTrainer(
	model=model,
	train_dataset=dataset,
	peft_config=lora_config,
	dataset_text_field="text",
	max_seq_length=512,
	tokenizer=tokenizer,
	args=training_args,
	packing=False,
	)

	# Train the model
	trainer.train()
	```

	Testing Model

	We created a helper function to ask the question in the format.


	```python
	def generate_response(prompt, model, tokenizer, temperature=0.1):
	formatted_prompt=f"""<start_of_turn>user
	{prompt}<end_of_turn>
	<start_of_turn>model
	"""
	inputs = tokenizer(formatted_prompt, return_tensors="pt").to("cuda")
	outputs = model.generate(
	**inputs,
	max_new_tokens=256,
	do_sample=temperature > 0,
	temperature=temperature
	)
	return tokenizer.decode(outputs[0], skip_special_tokens=False)
	```

	Question
	```python
	prompt = "조이는 아우렐리온 솔한테 무슨 약속을 했어?"
	response = generate_response(prompt, model, tokenizer)
	print(response)
	```

	예상 답변
	```
	조이는 아우렐리온 솔을 지키기 위해 할 수 있는 것은 무엇이든 해주리라 약속했다.
	```

	결과(Finetuned Model)
	```
	<bos><start_of_turn>user
	조이는 아우렐리온 솔한테 무슨 약속을 했어?<end_of_turn>
	<start_of_turn>model
	조이는 아우렐리온 솔을 지키기 위해 할 수 있는 것은 무엇이든 해주리라 약속했다.<end_of_turn>
	```


	결과(Base Model)
	```
	<bos><start_of_turn>user
	조이는 아우렐리온 솔한테 무슨 약속을 했어?<end_of_turn>
	<start_of_turn>model
	조이는 아우렐리온 솔한테 무슨 약속을 했는지에 대한 정보는 아직 알려지지 않았습니다.

	조이는 아우렐리온 솔한테 약속을 했는지에 대한 이야기는 몇 가지 유행하는 밈과 관련된 것으로 보입니다.

	* 아우렐리온 솔: 이것은 2023년 1월에 출시된 아우렐리온 솔의 이름입니다.
	* 조이: 이것은 2023년 1월에 출시된 아우렐리온 솔의 이름입니다.

	이러한 밈들은 흥미롭지만, 실제로 조이는 아우렐리온 솔한테 무슨 약속을 했는지에 대한 정확한 정보는 아직 알려지지 않았습니다.


	<end_of_turn>
	```

	In contrast, the base model’s response was less accurate, highlighting the improvements made through fine-tuning.

	#### Summary

	The code discussed above can be found at the following link: [lol_lore.ipynb](https://github.com/star-bits/mlb-gemma/blob/main/lol_lore.ipynb)