A 0.6B parameter draft (speculative decoding) model for use with Kimi-K2-Instruct.
See Kimi-K2-Instruct-DRAFT-0.6B-v3.0-GGUF for the models in gguf format for use with llama.cpp.
Extending the context above 32k
The current config.json is set for context length up to 32k tokens. Add the "rope_scaling" section to config.json to enable YaRN, eg:
To extend the context to 64k:
"max_position_embeddings": 65536,
...
"rope_scaling": {
"factor": 2.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
},
To extend the context to 128k:
"max_position_embeddings": 131072,
...
"rope_scaling": {
"factor": 4.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
},
NOTE: Because llama.cpp uses "static-YaRN" the scaling factor remains constant regardless of input length! Only add the rope_scaling configuration when processing long contexts is required...
How this model was created
1. The initial model was created from Qwen2.5-0.5B-Instruct using transplant-vocab:
python ./transplant_vocab.py \
./Qwen2.5-0.5B-Instruct \
./Kimi-K2-Instruct \
./Kimi-K2-Instruct-DRAFT-0.6B-UNTRAINED \
--trust-remote-code \
--override "[BOS]" "<|endoftext|>" \
--override "[EOS]" "<|im_end|>" \
--override "<|im_end|>" "<|im_end|>" \
--override "<|im_user|>" "<|im_start|>user" \
--override "<|im_assistant|>" "<|im_start|>assistant" \
--override "<|start_header_id|>" "<|im_start|>" \
--override "<|end_header_id|>" "<|im_end|>" \
--override "[EOT]" "<|endoftext|>" \
--override "<|im_system|>" "<|im_start|>system" \
--override "<|tool_calls_section_begin|>" "<tool_call>" \
--override "<|tool_calls_section_end|>" "</tool_call>" \
--override "<|tool_call_begin|>" "<tool_call>" \
--override "<|tool_call_argument_begin|>" "<tool_call>" \
--override "<|tool_call_end|>" "</tool_call>" \
--override "<|im_middle|>" "\\n" \
--override "[UNK]" "<|endoftext|>" \
--override "[PAD]" "<|endoftext|>"
Loading config from 'Qwen2.5-0.5B-Instruct'... Done.
Loading config from 'Kimi-K2-Instruct'... Done.
Loading tokenizer from 'Qwen2.5-0.5B-Instruct'... Done.
Loading tokenizer from 'Kimi-K2-Instruct'... Done.
Loading model from 'Qwen2.5-0.5B-Instruct'... Done.
Input model configuration:
- Target vocabulary size : 163840 (used = 163840, unused = 0)
- Donor vocabulary size : 151936
- Donor num layers : 24 (tied embeddings = True)
- Donor hidden size : 896
- Donor attention heads : 14
- Donor intermediate size : 4864 (ratio = 1:5.4)
- Donor total parameters : 494032768 (0.49B)
-- Embedding parameters : 136134656 (0.14B)
-- Non-embedding parameters : 357898112 (0.36B)
Processing 3 automatic token overrides:
✔ 'bos_token_id' : 163584 '[BOS]' → [151643] '<|endoftext|>'
✔ 'eos_token_id' : 163585 '[EOS]' → [151645] '<|im_end|>'
✔ 'pad_token_id' : 163839 '[PAD]' → [151643] '<|endoftext|>'
Processing 17 manual token overrides:
✔ 163584 : '[BOS]' → [151643] '<|endoftext|>'
✔ 163585 : '[EOS]' → [151645] '<|im_end|>'
✔ 163586 : '<|im_end|>' → [151645] '<|im_end|>'
✔ 163587 : '<|im_user|>' → [151644, 872] '<|im_start|>user'
✔ 163588 : '<|im_assistant|>' → [151644, 77091] '<|im_start|>assistant'
✔ 163590 : '<|start_header_id|>' → [151644] '<|im_start|>'
✔ 163591 : '<|end_header_id|>' → [151645] '<|im_end|>'
✔ 163593 : '[EOT]' → [151643] '<|endoftext|>'
✔ 163594 : '<|im_system|>' → [151644, 8948] '<|im_start|>system'
✔ 163595 : '<|tool_calls_section_begin|>' → [151657] '<tool_call>'
✔ 163596 : '<|tool_calls_section_end|>' → [151658] '</tool_call>'
✔ 163597 : '<|tool_call_begin|>' → [151657] '<tool_call>'
✔ 163598 : '<|tool_call_argument_begin|>' → [151657] '<tool_call>'
✔ 163599 : '<|tool_call_end|>' → [151658] '</tool_call>'
✔ 163601 : '<|im_middle|>' → [198] '\n'
✔ 163838 : '[UNK]' → [151643] '<|endoftext|>'
✔ 163839 : '[PAD]' → [151643] '<|endoftext|>'
NOTE: Using an "untied" copy of 'embed_tokens.weight' as new 'lm_head.weight' tensor...
Transplanting tokens: 100%|████████████████████████████████████████████████████████████| 163840/163840 [01:08<00:00, 2406.47token/s]
Transplant mappings:
- 1 to 1 : 95449 (58%)
- 2 to 1 : 61938 (38%)
- 3 to 1 : 4995 (3%)
- 4 to 1 : 980 (0.6%)
- 5 to 1 : 147 (0.09%)
- 6 to 1 : 52 (0.032%)
- 7 to 1 : 15 (0.0092%)
- 8 to 1 : 17 (0.01%)
- 9 to 1 : 2 (0.0012%)
- 10 to 1 : 5 (0.0031%)
- 11 to 1 : 1 (0.00061%)
- 13 to 1 : 239 (0.15%)
Head initialized with:
- Copies : 95449 (58%)
- Means : 68391 (42%)
- Zeros : 0 (0%)
Output model configuration:
- Output vocabulary size : 163840
- Output num layers : 24 (tied embeddings = False)
- Output hidden size : 896
- Output attention heads : 14
- Output intermediate size : 4864 (ratio = 1:5.4)
- Output total parameters : 651499392 (0.65B)
-- Embedding parameters : 293601280 (0.29B)
-- Non-embedding parameters : 357898112 (0.36B)
Saving model and tokenizer to 'Kimi-K2-Instruct-DRAFT-0.6B-UNTRAINED' folder
[2025-08-07 15:47:15,620] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Patching 'torch_dtype' in 'Kimi-K2-Instruct-DRAFT-0.6B-UNTRAINED/config.json' based on actual saved tensors
- Updated 'torch_dtype' to 'bfloat16' based on actual tensor dtype
Operation completed successfully (ignore any 'segmentation fault' that follows!!!)
NOTE: Due to the non-standard tokenizer, this needs the --trust-remote-code option.
NOTE: I had to manually delete "pad_token_id": 163839 from config.json to get it to match the tokeniser when used in llama.cpp as a draft model.
2. The following datasets were used to create a fine-tuning dataset of ~2.3B tokens:
- agentlans/common-crawl-sample
- bigcode/the-stack-smol-xl
- rombodawg/Everything_Instruct (NOTE:
outputfield only)
formatted just between [EOS] tags.
3. The model was then trained using qlora-pipe-lite for 1 epoch with a batch size of 60 and a sequence length of 32k (~2M tokens per step):
# ==============================
# MODEL AND OUTPUT CONFIGURATION
# ==============================
model_dir = 'models/Kimi-K2-Instruct-DRAFT-0.6B-UNTRAINED'
output_dir = 'finetuned'
# ===========================
# TRAINING TYPE CONFIGURATION
# ===========================
full_fine_tune = true
# =======================
# OPTIMIZER CONFIGURATION
# =======================
lr = 5e-5
# ======================
# TRAINING CONFIGURATION
# ======================
sequence_len = 32768
gradient_accumulation_steps = 10 # 10×6 = batch size 60, 10×6×32768 = ~2M tokens per step
# =====================
# DATASET CONFIGURATION
# =====================
[[datasets]]
dataset_path = 'datasets/common-crawl-sample/*.json'
drop_tails = true
[[datasets]]
dataset_path = 'datasets/the-stack-smol-xl/*.jsonl'
drop_tails = true
[[datasets]]
dataset_path = 'datasets/rombodawg-Everything-Instruct/*.json'
drop_tails = true
NOTE: Due to the non-standard tokenizer, this needs the --trust-remote-code option passing on the deepspeed call to train.py.
I used six RTX A6000 GPUs over three nodes and hence the 60 batch size (6 x 10 gradient accumulation steps = 60):
4. Fixing the TikToken / SentencePiece tokenizer mismatch in llama.cpp
I had to temporarily hack this change into convert_hf_to_gguf.py:
@ModelBase.register("Qwen2Model", "Qwen2ForCausalLM", "Qwen2AudioForConditionalGeneration")
class Qwen2Model(TextModel):
model_arch = gguf.MODEL_ARCH.QWEN2
#def set_vocab(self):
# try:
# self._set_vocab_sentencepiece()
# except FileNotFoundError:
# self._set_vocab_gpt2()
def set_vocab(self):
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(self.dir_model, trust_remote_code=True)
tokpre = self.get_vocab_base_pre(tokenizer)
# Build merges list using the approach similar to HunYuanMoE
merges = []
vocab = {}
mergeable_ranks = tokenizer.model._mergeable_ranks
for token, rank in mergeable_ranks.items():
vocab[QwenModel.token_bytes_to_string(token)] = rank
if len(token) == 1:
continue
merged = QwenModel.bpe(mergeable_ranks, token, max_rank=rank)
if len(merged) == 2:
merges.append(' '.join(map(QwenModel.token_bytes_to_string, merged)))
# Build token list
vocab_size = self.hparams["vocab_size"]
special_tokens = tokenizer.special_tokens
reverse_vocab = {id_ : encoded_tok for encoded_tok, id_ in {**vocab, **special_tokens}.items()}
tokens: list[str] = []
toktypes: list[int] = []
for i in range(vocab_size):
if i not in reverse_vocab:
tokens.append(f"[PAD{i}]")
toktypes.append(gguf.TokenType.UNUSED)
else:
token = reverse_vocab[i]
tokens.append(token)
if i in special_tokens.values():
toktypes.append(gguf.TokenType.CONTROL)
else:
toktypes.append(gguf.TokenType.NORMAL)
self.gguf_writer.add_tokenizer_model("gpt2")
self.gguf_writer.add_tokenizer_pre(tokpre)
self.gguf_writer.add_token_list(tokens)
self.gguf_writer.add_token_types(toktypes)
self.gguf_writer.add_token_merges(merges)
special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=False)
special_vocab.add_to_gguf(self.gguf_writer)
This then let me run:
~/llama.cpp/convert_hf_to_gguf.py --outtype auto --outfile Kimi-K2-Instruct-DRAFT-0.6B-BF16.gguf Kimi-K2-Instruct-DRAFT-0.6B
and then it quantized OK:
~/llama.cpp/build/bin/llama-quantize Kimi-K2-Instruct-DRAFT-0.6B-BF16.gguf Kimi-K2-Instruct-DRAFT-0.6B-Q4_0.gguf Q4_0 44
- Downloads last month
- 5
