anihitk07 commited on
Commit
fffbe69
Β·
verified Β·
1 Parent(s): dcb07d5

Add model card

Browse files
Files changed (1) hide show
  1. README.md +387 -0
README.md ADDED
@@ -0,0 +1,387 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: transformers
4
+ pipeline_tag: text-generation
5
+ language:
6
+ - en
7
+ tags:
8
+ - gpt2
9
+ - historical
10
+ - london
11
+ - slm
12
+ - small-language-model
13
+ - text-generation
14
+ - history
15
+ - english
16
+ - safetensors
17
+ ---
18
+
19
+ # London Historical LLM – Small Language Model (SLM)
20
+
21
+ A compact GPT-2 Small model (~117M params) **trained from scratch** on historical London texts (1500–1850). Fast to run on CPU, and supports NVIDIA (CUDA) and AMD (ROCm) GPUs.
22
+
23
+ > **Note**: This model was **trained from scratch** - not fine-tuned from existing models.
24
+
25
+ > This page includes simple **virtual-env setup**, **install choices for CPU/CUDA/ROCm**, and an **auto-device inference** example so anyone can get going quickly.
26
+
27
+ ---
28
+
29
+ ## πŸ”Ž Model Description
30
+
31
+ This is a **Small Language Model (SLM)** version of the London Historical LLM, **trained from scratch** using GPT-2 Small architecture on historical London texts with a custom historical tokenizer. The model was built from the ground up, not fine-tuned from existing models.
32
+
33
+ ### Key Features
34
+ - ~117M parameters (vs ~354M in the full model)
35
+ - Custom historical tokenizer (β‰ˆ30k vocab)
36
+ - London-specific context awareness and historical language patterns (e.g., *thou, thee, hath*)
37
+ - Lower memory footprint and faster inference on commodity hardware
38
+ - **Trained from scratch** - not fine-tuned from existing models
39
+
40
+ ---
41
+
42
+ ## πŸ§ͺ Intended Use & Limitations
43
+
44
+ **Use cases:** historical-style narrative generation, prompt-based exploration of London themes (1500–1850), creative writing aids.
45
+ **Limitations:** may produce anachronisms or historically inaccurate statements; smaller models have less complex reasoning than larger LLMs. Validate outputs before downstream use.
46
+
47
+ ---
48
+
49
+ ## 🐍 Set up a virtual environment (Linux/macOS/Windows)
50
+
51
+ > Virtual environments isolate project dependencies. Official Python docs: `venv`.
52
+
53
+ **Check Python & pip**
54
+ ```bash
55
+ # Linux/macOS
56
+ python3 --version && python3 -m pip --version
57
+ ```
58
+
59
+ ```powershell
60
+ # Windows (PowerShell)
61
+ python --version; python -m pip --version
62
+ ```
63
+
64
+ **Create the env**
65
+
66
+ ```bash
67
+ # Linux/macOS
68
+ python3 -m venv helloLondon
69
+ ```
70
+
71
+ ```powershell
72
+ # Windows (PowerShell)
73
+ python -m venv helloLondon
74
+ ```
75
+
76
+ ```cmd
77
+ :: Windows (Command Prompt)
78
+ python -m venv helloLondon
79
+ ```
80
+
81
+ > **Note**: You can name your virtual environment anything you like, e.g., `.venv`, `my_env`, `london_env`.
82
+
83
+ **Activate**
84
+
85
+ ```bash
86
+ # Linux/macOS
87
+ source helloLondon/bin/activate
88
+ ```
89
+
90
+ ```powershell
91
+ # Windows (PowerShell)
92
+ .\helloLondon\Scripts\Activate.ps1
93
+ ```
94
+
95
+ ```cmd
96
+ :: Windows (CMD)
97
+ .\helloLondon\Scripts\activate.bat
98
+ ```
99
+
100
+ > If PowerShell blocks activation (*"running scripts is disabled"*), set the policy then retry activation:
101
+
102
+ ```powershell
103
+ Set-ExecutionPolicy -Scope CurrentUser -ExecutionPolicy RemoteSigned
104
+ # or just for this session:
105
+ Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass
106
+ ```
107
+
108
+ ---
109
+
110
+ ## πŸ“¦ Install libraries
111
+
112
+ Upgrade basics, then install Hugging Face libs:
113
+
114
+ ```bash
115
+ python -m pip install -U pip setuptools wheel
116
+ python -m pip install "transformers" "accelerate" "safetensors"
117
+ ```
118
+
119
+ ---
120
+
121
+ ## Install **one** PyTorch variant (CPU / NVIDIA / AMD)
122
+
123
+ Use **one** of the commands below. For the most accurate command per OS/accelerator and version, prefer PyTorch's **Get Started** selector.
124
+
125
+ ### A) CPU-only (Linux/Windows/macOS)
126
+
127
+ ```bash
128
+ pip install torch --index-url https://download.pytorch.org/whl/cpu
129
+ ```
130
+
131
+ ### B) NVIDIA GPU (CUDA)
132
+
133
+ Pick the CUDA series that matches your system (examples below):
134
+
135
+ ```bash
136
+ # CUDA 12.6
137
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
138
+
139
+ # CUDA 12.4
140
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
141
+
142
+ # CUDA 11.8
143
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
144
+ ```
145
+
146
+ ### C) AMD GPU (ROCm, **Linux-only**)
147
+
148
+ Install the ROCm build matching your ROCm runtime (examples):
149
+
150
+ ```bash
151
+ # ROCm 6.3
152
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3
153
+
154
+ # ROCm 6.2 (incl. 6.2.x)
155
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2.4
156
+
157
+ # ROCm 6.1
158
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1
159
+ ```
160
+
161
+ **Quick sanity check**
162
+
163
+ ```bash
164
+ python - <<'PY'
165
+ import torch
166
+ print("torch:", torch.__version__)
167
+ print("GPU available:", torch.cuda.is_available())
168
+ if torch.cuda.is_available():
169
+ print("device:", torch.cuda.get_device_name(0))
170
+ PY
171
+ ```
172
+
173
+ ---
174
+
175
+ ## πŸš€ Inference (auto-detect device)
176
+
177
+ This snippet picks the best device (CUDA/ROCm if available, else CPU) and uses sensible generation defaults for this SLM.
178
+
179
+ ```python
180
+ from transformers import AutoTokenizer, AutoModelForCausalLM
181
+ import torch
182
+
183
+ model_id = "bahree/london-historical-slm"
184
+
185
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
186
+ model = AutoModelForCausalLM.from_pretrained(model_id)
187
+
188
+ device = "cuda" if torch.cuda.is_available() else "cpu"
189
+ model = model.to(device)
190
+
191
+ prompt = "In the year 1834, I walked through the streets of London and witnessed"
192
+ inputs = tokenizer(prompt, return_tensors="pt").to(device)
193
+
194
+ outputs = model.generate(
195
+ inputs["input_ids"],
196
+ max_new_tokens=50,
197
+ do_sample=True,
198
+ temperature=0.8,
199
+ top_p=0.95,
200
+ top_k=40,
201
+ repetition_penalty=1.2,
202
+ no_repeat_ngram_size=3,
203
+ pad_token_id=tokenizer.eos_token_id,
204
+ eos_token_id=tokenizer.eos_token_id,
205
+ early_stopping=True,
206
+ )
207
+
208
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
209
+ ```
210
+
211
+ ## πŸ§ͺ **Testing Your Model**
212
+
213
+ ### **Quick Testing (10 Automated Prompts)**
214
+ ```bash
215
+ # Test with 10 automated historical prompts
216
+ python 06_inference/test_published_models.py --model_type slm
217
+ ```
218
+
219
+ **Expected Output:**
220
+ ```
221
+ πŸ§ͺ Testing SLM Model: bahree/london-historical-slm
222
+ ============================================================
223
+ πŸ“‚ Loading model...
224
+ βœ… Model loaded in 8.91 seconds
225
+ πŸ“Š Model Info:
226
+ Type: SLM
227
+ Description: Small Language Model (117M parameters)
228
+ Device: cuda
229
+ Vocabulary size: 30,000
230
+ Max length: 512
231
+
232
+ 🎯 Testing generation with 10 prompts...
233
+ [10 automated tests with historical text generation]
234
+ ```
235
+
236
+ ### **Interactive Testing**
237
+ ```bash
238
+ # Interactive mode for custom prompts
239
+ python 06_inference/inference_unified.py --published --model_type slm --interactive
240
+
241
+ # Single prompt test
242
+ python 06_inference/inference_unified.py --published --model_type slm --prompt "In the year 1834, I walked through the streets of London and witnessed"
243
+ ```
244
+
245
+ **Need more headroom later?** Load with πŸ€— Accelerate and `device_map="auto"` to spread layers across available devices/CPU automatically.
246
+
247
+ ```python
248
+ from transformers import AutoTokenizer, AutoModelForCausalLM
249
+ tok = AutoTokenizer.from_pretrained(model_id)
250
+ model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
251
+ ```
252
+
253
+ ---
254
+
255
+ ## πŸͺŸ Windows Terminal one-liners
256
+
257
+ **PowerShell**
258
+
259
+ ```powershell
260
+ python -c "from transformers import AutoTokenizer,AutoModelForCausalLM; m='bahree/london-historical-slm'; t=AutoTokenizer.from_pretrained(m); model=AutoModelForCausalLM.from_pretrained(m); p='In the year 1834, I walked through the streets of London and witnessed'; i=t(p,return_tensors='pt'); print(t.decode(model.generate(i['input_ids'],max_new_tokens=50,do_sample=True)[0],skip_special_tokens=True))"
261
+ ```
262
+
263
+ **Command Prompt (CMD)**
264
+
265
+ ```cmd
266
+ python -c "from transformers import AutoTokenizer, AutoModelForCausalLM ^&^& import torch ^&^& m='bahree/london-historical-slm' ^&^& t=AutoTokenizer.from_pretrained(m) ^&^& model=AutoModelForCausalLM.from_pretrained(m) ^&^& p='In the year 1834, I walked through the streets of London and witnessed' ^&^& i=t(p, return_tensors='pt') ^&^& print(t.decode(model.generate(i['input_ids'], max_new_tokens=50, do_sample=True)[0], skip_special_tokens=True))"
267
+ ```
268
+
269
+ ---
270
+
271
+ ## πŸ’‘ Basic Usage (Python)
272
+
273
+ ```python
274
+ from transformers import AutoTokenizer, AutoModelForCausalLM
275
+
276
+ tokenizer = AutoTokenizer.from_pretrained("bahree/london-historical-slm")
277
+ model = AutoModelForCausalLM.from_pretrained("bahree/london-historical-slm")
278
+
279
+ if tokenizer.pad_token is None:
280
+ tokenizer.pad_token = tokenizer.eos_token
281
+
282
+ prompt = "In the year 1834, I walked through the streets of London and witnessed"
283
+ inputs = tokenizer(prompt, return_tensors="pt")
284
+ outputs = model.generate(
285
+ inputs["input_ids"],
286
+ max_new_tokens=50,
287
+ do_sample=True,
288
+ temperature=0.8,
289
+ top_p=0.95,
290
+ top_k=40,
291
+ repetition_penalty=1.2,
292
+ no_repeat_ngram_size=3,
293
+ pad_token_id=tokenizer.pad_token_id,
294
+ eos_token_id=tokenizer.eos_token_id,
295
+ early_stopping=True,
296
+ )
297
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
298
+ ```
299
+
300
+ ---
301
+
302
+ ## 🧰 Example Prompts
303
+
304
+ * **Tudor (1558):** "On this day in 1558, Queen Mary has died and …"
305
+ * **Stuart (1666):** "The Great Fire of London has consumed much of the city, and …"
306
+ * **Georgian/Victorian:** "As I journeyed through the streets of London, I observed …"
307
+ * **London specifics:** "Parliament sat in Westminster Hall …", "The Thames flowed dark and mysterious …"
308
+
309
+ ---
310
+
311
+ ## πŸ› οΈ Training Details
312
+
313
+ * **Architecture:** GPT-2 Small (12 layers, hidden size 768)
314
+ * **Params:** ~117M
315
+ * **Tokenizer:** custom historical tokenizer (~30k vocab) with London-specific and historical tokens
316
+ * **Data:** historical London corpus (1500–1850)
317
+ * **Steps/Epochs:** 30,000 steps (extended training for better convergence)
318
+ * **Batch/LR:** 32, 3e-4 (optimized for segmented data)
319
+ * **Hardware:** 2Γ— GPU training with Distributed Data Parallel
320
+ * **Final Training Loss:** 1.395 (43% improvement from 20K steps)
321
+ * **Model Flops Utilization:** 3.5% (excellent efficiency)
322
+ * **Training Method:** **Trained from scratch** - not fine-tuned
323
+ * **Context Length:** 256 tokens (optimized for historical text segments)
324
+ * **Status:** βœ… **Successfully published and tested** - ready for production use
325
+
326
+ ---
327
+
328
+ ## πŸ”€ Historical Tokenizer
329
+
330
+ * Compact 30k vocab targeting 1500–1850 English
331
+ * Tokens for **year/date/name/place/title**, plus **thames**, **westminster**, etc.; includes **thou/thee/hath/doth** style markers
332
+
333
+ ---
334
+
335
+ ## ⚠️ Troubleshooting
336
+
337
+ * **`ImportError: AutoModelForCausalLM requires the PyTorch library`**
338
+ β†’ Install PyTorch with the correct accelerator variant (see CPU/CUDA/ROCm above or use the official selector).
339
+
340
+ * **AMD GPU not used**
341
+ β†’ Ensure you installed a ROCm build and you're on Linux (`pip install ... --index-url https://download.pytorch.org/whl/rocmX.Y`). Verify with `torch.cuda.is_available()` and check the device name. ROCm wheels are Linux-only.
342
+
343
+ * **Running out of VRAM**
344
+ β†’ Try smaller batch/sequence lengths, or load with `device_map="auto"` via πŸ€— Accelerate to offload layers to CPU/disk.
345
+
346
+ ---
347
+
348
+ ## πŸ“š Citation
349
+
350
+ If you use this model, please cite:
351
+
352
+ ```bibtex
353
+ @misc{london-historical-slm,
354
+ title = {London Historical LLM - Small Language Model: A Compact GPT-2 for Historical Text Generation},
355
+ author = {Amit Bahree},
356
+ year = {2025},
357
+ url = {https://huggingface.co/bahree/london-historical-slm}
358
+ }
359
+ ```
360
+
361
+ ---
362
+
363
+ ## Repository
364
+
365
+ The complete source code, training scripts, and documentation for this model are available on GitHub:
366
+
367
+ **πŸ”— [https://github.com/bahree/helloLondon](https://github.com/bahree/helloLondon)**
368
+
369
+ This repository includes:
370
+ - Complete data collection pipeline for 1500-1850 historical English
371
+ - Custom tokenizer optimized for historical text
372
+ - Training infrastructure with GPU optimization
373
+ - Evaluation and deployment tools
374
+ - Comprehensive documentation and examples
375
+
376
+ ### Quick Start with Repository
377
+ ```bash
378
+ git clone https://github.com/bahree/helloLondon.git
379
+ cd helloLondon
380
+ python 06_inference/test_published_models.py --model_type slm
381
+ ```
382
+
383
+ ---
384
+
385
+ ## 🧾 License
386
+
387
+ MIT (see [LICENSE](https://github.com/bahree/helloLondon/blob/main/LICENSE) in repo).