ArtusDev commited on
Commit
d161a9c
·
verified ·
1 Parent(s): cf831d8

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,227 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - mistralai/Mistral-Nemo-Base-2407
4
+ license: apache-2.0
5
+ tags:
6
+ - writing
7
+ - creative-writing
8
+ ---
9
+
10
+ # Koto 22B (Pretrained)
11
+
12
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/cnBQlWjMTKGLOKMudPBVj.png)
13
+
14
+ Koto-22B-PT is a [depth-upscaled](https://arxiv.org/abs/2312.15166) version of Mistral-Nemo-Base-2407, healed and trained on almost a billion tokens of creative writing data.
15
+
16
+ ## Usage
17
+
18
+ This model is not intended for use outside of raw text completion settings, such as cowriting. Instruct will *not* work. Multi-turn roleplay will *not* work.
19
+
20
+ It was trained at 32k, but as not all samples were this long, we expect that in the best case you can get ~16k effective context.
21
+
22
+ We found that 1.5-1.55 temperature and 0.05-0.1 min_p worked best, but YMMV!
23
+
24
+ ## Datasets
25
+
26
+ Some of the data used to train this model includes:
27
+
28
+ - Most of [The Anarchist Library](https://theanarchistlibrary.org/), a repository for anarchist manifestos and writing (see [allura-org/the-anarchist-library](https://huggingface.co/datasets/allura-org/the-anarchist-library))
29
+ - A random sample of public domain books from Project Gutenberg
30
+ - Furry (anthro and feral) storytelling and smut
31
+ - A small subset of known high-quality books and story data
32
+
33
+ ## Acknowledgements
34
+
35
+ - thank you to [@takeshimaxfj](https://x.com/takeshimaxfj) on twitter for drawing the art used in the model card!
36
+ - thank you very much to [mango/deltavector](https://huggingface.co/Delta-Vector) for providing the compute used to train this model
37
+ - thanks to curse for testing, ideas
38
+ - thanks to toasty for some data, ideas
39
+ - thanks to everyone else in allura for moral support
40
+
41
+ ilya <3
42
+
43
+ ## Technical Appendix
44
+
45
+ <details>
46
+
47
+ ### Training Notes
48
+
49
+ This model was trained over the course of ~14 hours on an 8xB200 node. We used 8-bit AdamW and the REX LR scheduler, as well as both gradient clipping and weight decay for regularization.
50
+
51
+ There *was* a very odd loss spike ~60% of the way through training, but it recovered and the model seems fine? So? Eh? If it works it works :3
52
+
53
+ ### WandB
54
+
55
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/6XFFhkQD8lUFGerBrOAyd.png)
56
+
57
+
58
+ ### Finetuning Notes
59
+
60
+ This model has had ChatML tokens already added if you prefer to tune using that chat format. Please do not readd them to maintain the vocab size for (possible) usage on places like Featherless
61
+
62
+ ### Axolotl Config
63
+ ```yaml
64
+ ## model
65
+ base_model: allura-forge/nemo-upscaled-2
66
+ #tokenizer_use_mistral_common: true
67
+
68
+ ## qlora COPE!!!
69
+ load_in_8bit: false
70
+ load_in_4bit: false
71
+ strict: false
72
+
73
+ ## data
74
+ datasets:
75
+ datasets:
76
+ - path: estrogen/bookscpt2
77
+ type: completion
78
+ field: text
79
+
80
+
81
+ shuffle_merged_datasets: true
82
+ dataset_prepared_path: dataset_preparedss
83
+ val_set_size: 0.0
84
+ output_dir: ./Pretrain
85
+
86
+ ## Liger + CCE
87
+ plugins:
88
+ - axolotl.integrations.liger.LigerPlugin
89
+ - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
90
+ liger_rope: true
91
+ liger_rms_norm: true
92
+ liger_layer_norm: true
93
+ liger_glu_activation: true
94
+ liger_fused_linear_cross_entropy: false
95
+ cut_cross_entropy: true
96
+
97
+ ## CTX settings
98
+ sequence_len: 32768
99
+ sample_packing: true
100
+ eval_sample_packing: false
101
+ pad_to_sequence_len: true
102
+
103
+ ## max grad norm
104
+ max_grad_norm: 1.0
105
+
106
+
107
+ ## WandB
108
+ wandb_project: NeMo-Upscale
109
+ wandb_entity:
110
+ wandb_watch:
111
+ wandb_name: Pretrain-22B
112
+ wandb_log_model:
113
+
114
+ ## hoe params
115
+ gradient_accumulation_steps: 4
116
+ micro_batch_size: 4
117
+ num_epochs: 1
118
+ optimizer: adamw_bnb_8bit
119
+ lr_scheduler: rex
120
+ learning_rate: 2e-5
121
+
122
+ train_on_inputs: false
123
+ group_by_length: false
124
+ bf16: auto
125
+ fp16:
126
+ tf32: false
127
+
128
+ gradient_checkpointing: true
129
+ early_stopping_patience:
130
+ resume_from_checkpoint:
131
+ local_rank:
132
+ logging_steps: 1
133
+ xformers_attention:
134
+ flash_attention: true
135
+ s2_attention:
136
+
137
+ warmup_steps: 50
138
+ saves_per_epoch: 2
139
+ debug:
140
+ deepspeed: ./deepspeed_configs/zero3_bf16.json
141
+ weight_decay: 0.0025
142
+ fsdp:
143
+ fsdp_config:
144
+ special_tokens:
145
+ pad_token: <pad>
146
+ ```
147
+
148
+ ### Mergekit Config
149
+ ```yaml
150
+ dtype: bfloat16
151
+ merge_method: passthrough
152
+
153
+ slices:
154
+ # untouched intro
155
+ - sources:
156
+ - layer_range: [0, 8]
157
+ model: mistralai/Mistral-Nemo-Base-2407
158
+
159
+ - sources:
160
+ - layer_range: [8, 12]
161
+ model: mistralai/Mistral-Nemo-Base-2407
162
+ # 8–16 baseline
163
+ - sources:
164
+ - layer_range: [8, 16]
165
+ model: mistralai/Mistral-Nemo-Base-2407
166
+ # 8–16 duplicate with projections nulled
167
+ - sources:
168
+ - layer_range: [8, 16]
169
+ model: mistralai/Mistral-Nemo-Base-2407
170
+ parameters:
171
+ scale:
172
+ - filter: o_proj
173
+ value: 0.0
174
+ - filter: down_proj
175
+ value: 0.0
176
+ - value: 1.0
177
+
178
+ # 16–24 duplicate
179
+ - sources:
180
+ - layer_range: [16, 24]
181
+ model: mistralai/Mistral-Nemo-Base-2407
182
+ parameters:
183
+ scale:
184
+ - filter: o_proj
185
+ value: 0.0
186
+ - filter: down_proj
187
+ value: 0.0
188
+ - value: 1.0
189
+ # 16–24 baseline
190
+ - sources:
191
+ - layer_range: [16, 24]
192
+ model: mistralai/Mistral-Nemo-Base-2407
193
+ # 16–24 duplicate
194
+ - sources:
195
+ - layer_range: [16, 24]
196
+ model: mistralai/Mistral-Nemo-Base-2407
197
+ parameters:
198
+ scale:
199
+ - filter: o_proj
200
+ value: 0.0
201
+ - filter: down_proj
202
+ value: 0.0
203
+ - value: 1.0
204
+
205
+ # 24–32 baseline
206
+ - sources:
207
+ - layer_range: [24, 32]
208
+ model: mistralai/Mistral-Nemo-Base-2407
209
+ # 24–32 duplicate
210
+ - sources:
211
+ - layer_range: [24, 32]
212
+ model: mistralai/Mistral-Nemo-Base-2407
213
+ parameters:
214
+ scale:
215
+ - filter: o_proj
216
+ value: 0.0
217
+ - filter: down_proj
218
+ value: 0.0
219
+ - value: 1.0
220
+
221
+ # untouched tail
222
+ - sources:
223
+ - layer_range: [32, 40]
224
+ model: mistralai/Mistral-Nemo-Base-2407
225
+ ```
226
+
227
+ </details>
config.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "MistralForCausalLM"
4
+ ],
5
+ "attention_dropout": 0.0,
6
+ "bos_token_id": 1,
7
+ "eos_token_id": 2,
8
+ "head_dim": 128,
9
+ "hidden_act": "silu",
10
+ "hidden_size": 5120,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 14336,
13
+ "max_position_embeddings": 131072,
14
+ "model_type": "mistral",
15
+ "num_attention_heads": 32,
16
+ "num_hidden_layers": 76,
17
+ "num_key_value_heads": 8,
18
+ "rms_norm_eps": 1e-05,
19
+ "rope_theta": 1000000.0,
20
+ "sliding_window": null,
21
+ "tie_word_embeddings": false,
22
+ "torch_dtype": "bfloat16",
23
+ "transformers_version": "4.54.1",
24
+ "use_cache": false,
25
+ "vocab_size": 131074,
26
+ "quantization_config": {
27
+ "quant_method": "exl3",
28
+ "version": "0.0.5",
29
+ "bits": 4.5,
30
+ "head_bits": 6,
31
+ "calibration": {
32
+ "rows": 100,
33
+ "cols": 2048
34
+ },
35
+ "out_scales": "auto"
36
+ }
37
+ }
generation_config.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "do_sample": true,
5
+ "eos_token_id": 2,
6
+ "transformers_version": "4.54.1",
7
+ "use_cache": false
8
+ }
model-00001-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:399b609ebfcc6511f2717ac34cfbb57cd32e6641096e76531a293e5b947d37d6
3
+ size 8559303976
model-00002-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2ea78f97c4d8016ff26b7102912bf4dd386d490aa46a9cceeebcf5ec3727cde4
3
+ size 4957196440
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
quantization_config.json ADDED
The diff for this file is too large to render. See raw diff
 
special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<pad>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "unk_token": {
24
+ "content": "<unk>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ }
30
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:48130f8c042761b84abbfbf10ad07efa7c26108a14e7a2a0402daa06e447a47a
3
+ size 17078668
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff