--- language: ko tags: - causal-lm - byteetm license: mit datasets: - roneneldan/TinyStories - HAERAE-HUB/KOREAN-WEBTEXT inference: parameters: max_new_tokens: 100 temperature: 0.8 top_k: 200 inference_providers: - cpu - gpu - t4 - a10g library_name: transformers widget: - text: "오늘은 날씨가" --- # ByteETM-Korean 소형 바이트-레벨 텍스트 디코더 LM - 133 MB byte-level causal LM trained on Korean web text. - 학습 데이터: roneneldan/TinyStories, HAERAE-HUB/KOREAN-WEBTEXT 일부 - HAERAE-HUB/KOREAN-WEBTEXT 데이터셋 최종 val ppl ≈ 3.4 ## Example ```python # %% ByteETM Inference (바이트 기반 추론) import torch from transformers import AutoModelForCausalLM # 1️⃣ 모델 로드 repo_id = "idah4/byteetm-korean-tiny" device = "cuda" if torch.cuda.is_available() else "cpu" model = AutoModelForCausalLM.from_pretrained( repo_id, trust_remote_code=True ).to(device).eval() # 2️⃣ 바이트 기반 인코더 / 디코더 def encode_bytes(text: str): return torch.tensor([[b for b in text.encode("utf-8")]], dtype=torch.long, device=device) def decode_bytes(ids: torch.Tensor): seq = [i for i in ids.tolist() if 0 <= i < 256] return bytes(seq).decode("utf-8", errors="ignore") # 3️⃣ 텍스트 생성 함수 @torch.no_grad() def generate_text(prompt: str, max_new_tokens=200, temperature=0.8, top_k=200): input_ids = encode_bytes(prompt) out = model.generate( input_ids, max_new_tokens=max_new_tokens, temperature=temperature, top_k=top_k ) return decode_bytes(out[0]) # 4️⃣ 시연 prompt = "오늘은 날씨가 좋아서" print(generate_text(prompt, max_new_tokens=150, temperature=0.9, top_k=150))