omarkamali commited on
Commit
ab8c355
·
verified ·
1 Parent(s): c53f797

Upload all models and assets for ang (latest)

Browse files
Files changed (40) hide show
  1. README.md +93 -89
  2. models/embeddings/aligned/ang_128d.bin +1 -1
  3. models/embeddings/aligned/ang_128d.projection.npy +1 -1
  4. models/embeddings/aligned/ang_32d.bin +1 -1
  5. models/embeddings/aligned/ang_32d.projection.npy +1 -1
  6. models/embeddings/aligned/ang_64d.bin +1 -1
  7. models/embeddings/aligned/ang_64d.projection.npy +1 -1
  8. models/embeddings/monolingual/ang_128d.bin +1 -1
  9. models/embeddings/monolingual/ang_32d.bin +1 -1
  10. models/embeddings/monolingual/ang_64d.bin +1 -1
  11. models/subword_markov/ang_markov_ctx1_subword.parquet +2 -2
  12. models/subword_markov/ang_markov_ctx2_subword.parquet +2 -2
  13. models/subword_markov/ang_markov_ctx3_subword.parquet +2 -2
  14. models/subword_markov/ang_markov_ctx4_subword.parquet +2 -2
  15. models/subword_ngram/ang_2gram_subword.parquet +2 -2
  16. models/subword_ngram/ang_3gram_subword.parquet +2 -2
  17. models/subword_ngram/ang_4gram_subword.parquet +2 -2
  18. models/subword_ngram/ang_5gram_subword.parquet +2 -2
  19. models/tokenizer/ang_tokenizer_16k.model +1 -1
  20. models/tokenizer/ang_tokenizer_32k.model +1 -1
  21. models/tokenizer/ang_tokenizer_64k.model +1 -1
  22. models/tokenizer/ang_tokenizer_8k.model +1 -1
  23. models/word_markov/ang_markov_ctx1_word.parquet +2 -2
  24. models/word_markov/ang_markov_ctx2_word.parquet +2 -2
  25. models/word_markov/ang_markov_ctx3_word.parquet +2 -2
  26. models/word_markov/ang_markov_ctx4_word.parquet +2 -2
  27. models/word_ngram/ang_2gram_word.parquet +2 -2
  28. models/word_ngram/ang_3gram_word.parquet +2 -2
  29. models/word_ngram/ang_4gram_word.parquet +2 -2
  30. models/word_ngram/ang_5gram_word.parquet +2 -2
  31. visualizations/embedding_alignment_quality.png +0 -0
  32. visualizations/embedding_isotropy.png +0 -0
  33. visualizations/embedding_norms.png +0 -0
  34. visualizations/embedding_similarity.png +2 -2
  35. visualizations/embedding_tsne_multilingual.png +2 -2
  36. visualizations/ngram_perplexity.png +0 -0
  37. visualizations/performance_dashboard.png +2 -2
  38. visualizations/position_encoding_comparison.png +2 -2
  39. visualizations/tsne_sentences.png +2 -2
  40. visualizations/tsne_words.png +2 -2
README.md CHANGED
@@ -36,7 +36,7 @@ metrics:
36
  value: 4.012
37
  - name: best_isotropy
38
  type: isotropy
39
- value: 0.8005
40
  - name: vocabulary_size
41
  type: vocab
42
  value: 0
@@ -99,32 +99,32 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
99
 
100
  Below are sample sentences tokenized with each vocabulary size:
101
 
102
- **Sample 1:** `Valladolid is burg on Spēnum. Valladolid hæfþ 319,943 lēoda. on Castillan`
103
 
104
  | Vocab | Tokens | Count |
105
  |-------|--------|-------|
106
- | 8k | `▁val lad ol idisburg ▁onspēnum .val ... (+18 more)` | 28 |
107
- | 16k | `▁val lad ol idisburg ▁onspēnum .val ... (+17 more)` | 27 |
108
- | 32k | `▁val ladol idisburg ▁onspēnum .val ladol ... (+14 more)` | 24 |
109
- | 64k | `▁valladolidis ▁burgonspēnum .valladolid ▁hæfþ3 ... (+10 more)` | 20 |
110
 
111
- **Sample 2:** `Cicġan () is burg on Cile. Þǣr oneardiaþ 161,953 lēoda (þæs gēares). His mearc i...`
112
 
113
  | Vocab | Tokens | Count |
114
  |-------|--------|-------|
115
- | 8k | `▁c ic ġ an () ▁isburgoncile . ... (+33 more)` | 43 |
116
- | 16k | `▁cic ġan ▁()isburgoncile . ▁þǣr ▁oneardiaþ ... (+31 more)` | 41 |
117
- | 32k | `▁cic ġan()isburgoncile . ▁þǣroneardiaþ ... (+30 more)` | 40 |
118
- | 64k | `▁cicġan()isburgoncile . ▁þǣroneardiaþ ▁ ... (+28 more)` | 38 |
119
 
120
- **Sample 3:** `Welwīc () is þorp in þæm East Þriding, se is Eoferƿicscire dǣl, on Englum. Heo h...`
121
 
122
  | Vocab | Tokens | Count |
123
  |-------|--------|-------|
124
- | 8k | `▁wel wīc () ▁is ▁þorp ▁in ▁þæmeast ▁þriding , ... (+21 more)` | 31 |
125
- | 16k | `▁wel wīc () ▁is ▁þorpin ▁þæmeast ▁þriding , ... (+21 more)` | 31 |
126
- | 32k | `▁wel wīc () ▁is ▁þorpin ▁þæmeast ▁þriding , ... (+21 more)` | 31 |
127
- | 64k | `▁welwīc()is ▁þorpin ▁þæmeast ▁þriding , ▁se ... (+20 more)` | 30 |
128
 
129
 
130
  ### Key Findings
@@ -274,27 +274,27 @@ Below are text samples generated from each word-based Markov chain model:
274
 
275
  **Context Size 1:**
276
 
277
- 1. `and strætham is eoferƿicscire dǣl on wyrse ġearum ac bowser jr americanisc auto maker grady thomas`
278
- 2. `on pictocusċīre`
279
- 3. `is burg þes geþoftede rīce and gescyldnesse kowane mutum yana da vinci chapter 93 dead rǣdinge`
280
 
281
  **Context Size 2:**
282
 
283
- 1. `on þǣm ġerēfscipe þæs fōresittende sam dōnde swā þurh sūþsið fǣmneland wǣre and ǣrende genōg land fo...`
284
- 2. `in þǣm geānlǣhtum rīcum fram þǣm sericus gārsecge hit stent is 60 mīle geddoburg be sūðƿesten on`
285
- 3. `in þæm suþernan dæle þæs geānedan cynerīces grēatre brytene cynerīce ƿæs þēodland in ƿesternre europ...`
286
 
287
  **Context Size 3:**
288
 
289
- 1. `td valign top efencasere mid numeriane murdered td valign top td valign top td 269 td valign`
290
- 2. `is þorp in þæm east þriding se is eoferƿicscire dǣl on englum on eoferwicscīre þæs geānedan cynerīce...`
291
- 3. `eoferwicscīre þæs geānedan cynerīces on wēalum þrēoscyte is hēo and hæfþ twegen ecgas onmang beorgum...`
292
 
293
  **Context Size 4:**
294
 
295
  1. `on eoferwicscīre þæs geānedan cynerīces`
296
- 2. `is eoferƿicscire dǣl on englum hit hæfþ 318 buend on eoferwicscīre þæs geānedan cynerīces`
297
- 3. `eoferƿicscire dǣl on englum on eoferwicscīre þæs geānedan cynerīces`
298
 
299
 
300
  ### Generated Text Samples (Subword-based)
@@ -303,27 +303,27 @@ Below are text samples generated from each subword-based Markov chain model:
303
 
304
  **Context Size 1:**
305
 
306
- 1. `_bor.rm_frsh_ded`
307
- 2. `es_a_urīþofophob`
308
- 3. `nftaner,_a_gavst`
309
 
310
  **Context Size 2:**
311
 
312
- 1. `e_wærr_ofiret_be_`
313
- 2. `and:_gēac_(144_75`
314
- 3. `n_en_polan_on_þæs`
315
 
316
  **Context Size 3:**
317
 
318
- 1. `and_womericanada_r`
319
- 2. `nd_þā_illinge._þā_`
320
- 3. `an_60_1,000_ause_ƿ`
321
 
322
  **Context Size 4:**
323
 
324
- 1. `and_(oððe_stre_in_n`
325
- 2. `_and_scot_ƿæs_þēodi`
326
- 3. `_on_mererīca,_a_cli`
327
 
328
 
329
  ### Key Findings
@@ -428,18 +428,18 @@ Below are text samples generated from each subword-based Markov chain model:
428
 
429
  | Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
430
  |-------|-----------|----------|------------------|---------------|----------------|
431
- | **mono_32d** | 32 | 0.8005 🏆 | 0.3579 | N/A | N/A |
432
- | **mono_64d** | 64 | 0.4615 | 0.3163 | N/A | N/A |
433
- | **mono_128d** | 128 | 0.1318 | 0.3128 | N/A | N/A |
434
- | **aligned_32d** | 32 | 0.8005 | 0.3430 | 0.0360 | 0.2720 |
435
- | **aligned_64d** | 64 | 0.4615 | 0.3200 | 0.0780 | 0.3300 |
436
- | **aligned_128d** | 128 | 0.1318 | 0.2993 | 0.0840 | 0.3840 |
437
 
438
  ### Key Findings
439
 
440
- - **Best Isotropy:** mono_32d with 0.8005 (more uniform distribution)
441
- - **Semantic Density:** Average pairwise similarity of 0.3249. Lower values indicate better semantic separation.
442
- - **Alignment Quality:** Aligned models achieve up to 8.4% R@1 in cross-lingual retrieval.
443
  - **Recommendation:** 128d aligned for best cross-lingual performance
444
 
445
  ---
@@ -461,17 +461,19 @@ These are the most productive prefixes and suffixes identified by sampling the v
461
  #### Productive Prefixes
462
  | Prefix | Examples |
463
  |--------|----------|
464
- | `-ge` | geƿinnes, geendebyrded, genered |
465
 
466
  #### Productive Suffixes
467
  | Suffix | Examples |
468
  |--------|----------|
469
- | `-e` | ohthere, tide, crulande |
470
- | `-an` | iuliscan, weardan, ċiriċeburnan |
471
- | `-es` | geƿinnes, laurentes, yankees |
472
- | `-um` | stōwum, sǣfōrum, betweonum |
473
- | `-de` | tide, crulande, īeglande |
474
- | `-ng` | hūselhālgung, āmang, rising |
 
 
475
 
476
  ### 6.3 Bound Stems (Lexical Roots)
477
 
@@ -479,18 +481,18 @@ Bound stems are high-frequency subword units that are semantically cohesive but
479
 
480
  | Stem | Cohesion | Substitutability | Examples |
481
  |------|----------|------------------|----------|
482
- | `mani` | 2.03x | 43 contexts | amani, maniġ, maniȝ |
483
- | `enne` | 1.97x | 48 contexts | fenne, agenne, etenne |
484
- | `ster` | 1.84x | 59 contexts | buster, easter, noster |
485
- | `wear` | 1.97x | 43 contexts | weard, wearþ, wearð |
486
- | `unge` | 1.85x | 46 contexts | tunge, tunges, hunger |
487
- | `tion` | 2.26x | 19 contexts | action, motion, nation |
488
- | `inga` | 1.78x | 34 contexts | þinga, minga, ðinga |
489
- | `ning` | 1.70x | 35 contexts | cyning, ininga, cining |
490
- | `aste` | 1.76x | 27 contexts | ēaste, easte, taste |
491
- | `ynin` | 2.28x | 11 contexts | cynin, cyning, cyninȝ |
492
- | `afod` | 1.85x | 18 contexts | heafod, ƿafode, hēafod |
493
- | `nisc` | 1.56x | 27 contexts | denisc, cinisc, dēnisc |
494
 
495
  ### 6.4 Affix Compatibility (Co-occurrence)
496
 
@@ -498,12 +500,14 @@ This table shows which prefixes and suffixes most frequently co-occur on the sam
498
 
499
  | Prefix | Suffix | Frequency | Examples |
500
  |--------|--------|-----------|----------|
501
- | `-ge` | `-e` | 75 words | gearore, gegaderode |
502
- | `-ge` | `-de` | 27 words | gegaderode, gelytlode |
503
- | `-ge` | `-um` | 20 words | getremincum, gearum |
504
- | `-ge` | `-an` | 19 words | gearan, gesecan |
505
- | `-ge` | `-es` | 18 words | gearƿes, geferscipes |
506
- | `-ge` | `-ng` | 7 words | geswutelung, geþrang |
 
 
507
 
508
  ### 6.5 Recursive Morpheme Segmentation
509
 
@@ -511,21 +515,21 @@ Using **Recursive Hierarchical Substitutability**, we decompose complex words in
511
 
512
  | Word | Suggested Split | Confidence | Stem |
513
  |------|-----------------|------------|------|
514
- | gereordes | **`ge-reord-es`** | 6.0 | `reord` |
515
- | geƿealdes | **`ge-ƿeald-es`** | 6.0 | `ƿeald` |
516
- | gehealdan | **`ge-heald-an`** | 6.0 | `heald` |
517
- | gefeohtes | **`ge-feoht-es`** | 6.0 | `feoht` |
518
- | foresittendlican | **`foresittendlic-an`** | 4.5 | `foresittendlic` |
519
- | hundgēares | **`hundgēar-es`** | 4.5 | `hundgēar` |
520
- | ƿīntrēoƿum | **`ƿīntrēoƿ-um`** | 4.5 | `ƿīntrēoƿ` |
521
- | cræftigum | **`cræftig-um`** | 4.5 | `cræftig` |
522
- | bryttiscan | **`bryttisc-an`** | 4.5 | `bryttisc` |
523
- | speliġendhūses | **`speliġendhūs-es`** | 4.5 | `speliġendhūs` |
524
- | russiscum | **`russisc-um`** | 4.5 | `russisc` |
525
- | regollicum | **`regollic-um`** | 4.5 | `regollic` |
526
- | atlantiscan | **`atlantisc-an`** | 4.5 | `atlantisc` |
527
- | ƿealdende | **`ƿealden-de`** | 4.5 | `ƿealden` |
528
- | gegaderunge | **`ge-gaderunge`** | 4.5 | `gaderunge` |
529
 
530
  ### 6.6 Linguistic Interpretation
531
 
@@ -759,4 +763,4 @@ MIT License - Free for academic and commercial use.
759
  ---
760
  *Generated by Wikilangs Models Pipeline*
761
 
762
- *Report Date: 2026-01-03 14:10:12*
 
36
  value: 4.012
37
  - name: best_isotropy
38
  type: isotropy
39
+ value: 0.7896
40
  - name: vocabulary_size
41
  type: vocab
42
  value: 0
 
99
 
100
  Below are sample sentences tokenized with each vocabulary size:
101
 
102
+ **Sample 1:** `Grēat Coldūn () is þorp in þæm East Þriding, se is Eoferƿicscire dǣl, on Englum....`
103
 
104
  | Vocab | Tokens | Count |
105
  |-------|--------|-------|
106
+ | 8k | `▁grēat ▁c old ūn()is ▁þorpin ▁þæmeast ... (+15 more)` | 25 |
107
+ | 16k | `▁grēat ▁c old ūn()is ▁þorpin ▁þæmeast ... (+15 more)` | 25 |
108
+ | 32k | `▁grēat ▁cold ūn()is ▁þorpin ▁þæmeast ▁þriding ... (+14 more)` | 24 |
109
+ | 64k | `▁grēatcold ūn()is ▁þorpin ▁þæmeast ▁þriding ... (+14 more)` | 24 |
110
 
111
+ **Sample 2:** `Lingua Franca Nova is gehugod sprǣc. Utweardlice bendas elefen.org gereord`
112
 
113
  | Vocab | Tokens | Count |
114
  |-------|--------|-------|
115
+ | 8k | `▁l ing uafranc anov a isgeh ug ... (+11 more)` | 21 |
116
+ | 16k | `▁l ing uafranc a novaisgeh ug od ... (+10 more)` | 20 |
117
+ | 32k | `▁ling uafrancanovaisgehugodsprǣc . ▁utweardlicebendas ... (+5 more)` | 15 |
118
+ | 64k | `▁linguafrancanovaisgehugodsprǣc . ▁utweardlicebendasele ... (+4 more)` | 14 |
119
 
120
+ **Sample 3:** `Andreas Iǣxcūn ƿæs se seofoða Foresittend þāra Geānlǣhtra Rīca, fram þǣm gēare ō...`
121
 
122
  | Vocab | Tokens | Count |
123
  |-------|--------|-------|
124
+ | 8k | `▁andreasi ǣ x c ūn ▁ƿæs se ▁seof oða ... (+17 more)` | 27 |
125
+ | 16k | `▁andreasiǣx c ūn ▁ƿæs se ▁seofoðaforesittend ▁þāra ▁geānlǣhtra ... (+14 more)` | 24 |
126
+ | 32k | `▁andreasiǣx c ūn ▁ƿæs se ▁seofoðaforesittend ▁þāra ▁geānlǣhtra ... (+14 more)` | 24 |
127
+ | 64k | `▁andreasiǣxcūn ▁ƿæs se ▁seofoðaforesittend ▁þārageānlǣhtra ▁rīca , ... (+12 more)` | 22 |
128
 
129
 
130
  ### Key Findings
 
274
 
275
  **Context Size 1:**
276
 
277
+ 1. `and bedældede hine in þǣm geānedum rīcum þā protest sang rocc and sīþe hrēðcyninges hām to`
278
+ 2. `on francum in þæm miclum burgum and his ƿæter hit hê hê willgesweostor shes laid back`
279
+ 3. `is unesco æfter déaðe drepe þrōƿade heorosƿeng heardn ond sēo hēafodmearc iesuitisces rǣses it was f...`
280
 
281
  **Context Size 2:**
282
 
283
+ 1. `on þǣm fylle þǣm þe nāhwæþer ne þā ġeānedan land sculon ne ǣniġ land sceal ætfōn oþþe`
284
+ 2. `in þǣm indiscum lande uttar pradesh þæt land þæt ƿæs corēan independence activist politicians and jo...`
285
+ 3. `in þæm east þriding se is eoferƿicscire dǣl on englum hit hæfþ 11 351 būendas on eoferwicscīre`
286
 
287
  **Context Size 3:**
288
 
289
+ 1. `td valign top ualentinianus ii td valign top td to 297 td valign top co emperor with honorius`
290
+ 2. `is þorp in soria on castile and leóne in spēonlande and þorpas on sorie`
291
+ 3. `eoferwicscīre þæs geānedan cynerīces and hēafodman þæs behealdenda hēapes siþðan mǣdmōnaþ he is gebē...`
292
 
293
  **Context Size 4:**
294
 
295
  1. `on eoferwicscīre þæs geānedan cynerīces`
296
+ 2. `is eoferƿicscire dǣl on englalande on eoferwicscīre þæs geānedan cynerīces`
297
+ 3. `eoferƿicscire dǣl on englum mid grēatum hǣþfelda ġesċieppaþ hie þone burgsċipe of hǣþfelda on eoferw...`
298
 
299
 
300
  ### Generated Text Samples (Subword-based)
 
303
 
304
  **Context Size 1:**
305
 
306
+ 1. `_htofunes_anōre_`
307
+ 2. `e_c_weaþǣfyn_sca`
308
+ 3. `n_þeal_wun_berie`
309
 
310
  **Context Size 2:**
311
 
312
+ 1. `e_of_fi_94oðbe_tw`
313
+ 2. `an_thoseadand_īeg`
314
+ 3. `n_nīƿ_mesprytt,_þ`
315
 
316
  **Context Size 3:**
317
 
318
+ 1. `and_und_ofher_mā_s`
319
+ 2. `nd_titutede_him._h`
320
+ 3. `an_asscran_betwa_ǣ`
321
 
322
  **Context Size 4:**
323
 
324
+ 1. `and_belalan_(mother`
325
+ 2. `_and_ġecosta_tƿiste`
326
+ 3. `_on_þā_habbað_nofgo`
327
 
328
 
329
  ### Key Findings
 
428
 
429
  | Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
430
  |-------|-----------|----------|------------------|---------------|----------------|
431
+ | **mono_32d** | 32 | 0.7896 | 0.3585 | N/A | N/A |
432
+ | **mono_64d** | 64 | 0.4746 | 0.3175 | N/A | N/A |
433
+ | **mono_128d** | 128 | 0.1353 | 0.3004 | N/A | N/A |
434
+ | **aligned_32d** | 32 | 0.7896 🏆 | 0.3555 | 0.0300 | 0.2480 |
435
+ | **aligned_64d** | 64 | 0.4746 | 0.3090 | 0.0860 | 0.3400 |
436
+ | **aligned_128d** | 128 | 0.1353 | 0.3041 | 0.1280 | 0.4020 |
437
 
438
  ### Key Findings
439
 
440
+ - **Best Isotropy:** aligned_32d with 0.7896 (more uniform distribution)
441
+ - **Semantic Density:** Average pairwise similarity of 0.3242. Lower values indicate better semantic separation.
442
+ - **Alignment Quality:** Aligned models achieve up to 12.8% R@1 in cross-lingual retrieval.
443
  - **Recommendation:** 128d aligned for best cross-lingual performance
444
 
445
  ---
 
461
  #### Productive Prefixes
462
  | Prefix | Examples |
463
  |--------|----------|
464
+ | `-ge` | geondrīcisce, gebold, gemyndgung |
465
 
466
  #### Productive Suffixes
467
  | Suffix | Examples |
468
  |--------|----------|
469
+ | `-e` | ārwurðnysse, cǣġe, farende |
470
+ | `-s` | celebrations, villages, annivs |
471
+ | `-es` | villages, ides, missiles |
472
+ | `-an` | þēodacynewīsan, hāligan, europiscan |
473
+ | `-um` | dorsætum, maniȝum, elpendum |
474
+ | `-de` | farende, ungeƿilde, bestandende |
475
+ | `-en` | ƿriten, eċġen, hyrneġen |
476
+ | `-on` | edmonton, huffington, aragon |
477
 
478
  ### 6.3 Bound Stems (Lexical Roots)
479
 
 
481
 
482
  | Stem | Cohesion | Substitutability | Examples |
483
  |------|----------|------------------|----------|
484
+ | `enne` | 2.04x | 48 contexts | fenne, etenne, cenneþ |
485
+ | `mani` | 2.03x | 43 contexts | amani, maniȝ, maniġ |
486
+ | `wear` | 1.91x | 43 contexts | wearð, wearg, weard |
487
+ | `ster` | 1.67x | 59 contexts | sister, ēaster, faster |
488
+ | `unge` | 1.77x | 46 contexts | tunge, tunges, jungen |
489
+ | `tion` | 2.19x | 19 contexts | motion, nation, action |
490
+ | `inga` | 1.72x | 34 contexts | þinga, minga, ðinga |
491
+ | `ning` | 1.64x | 35 contexts | mining, cining, cyning |
492
+ | `aste` | 1.69x | 27 contexts | taste, easte, ēaste |
493
+ | `ynin` | 2.21x | 11 contexts | cynin, cyning, cyninȝ |
494
+ | `afod` | 1.82x | 18 contexts | hēafod, heafod, ƿafode |
495
+ | `nisc` | 1.49x | 27 contexts | rūnisc, denisc, dēnisc |
496
 
497
  ### 6.4 Affix Compatibility (Co-occurrence)
498
 
 
500
 
501
  | Prefix | Suffix | Frequency | Examples |
502
  |--------|--------|-----------|----------|
503
+ | `-ge` | `-e` | 79 words | geƿorhte, geƿǣre |
504
+ | `-ge` | `-en` | 35 words | getimbroden, geferræden |
505
+ | `-ge` | `-de` | 35 words | geanede, gehiersomode |
506
+ | `-ge` | `-s` | 29 words | genus, geardas |
507
+ | `-ge` | `-an` | 20 words | gegildan, gemæccan |
508
+ | `-ge` | `-um` | 20 words | gerādum, germanicum |
509
+ | `-ge` | `-es` | 17 words | geofones, geānlǣhtes |
510
+ | `-ge` | `-on` | 9 words | gestaðoledon, gestrēon |
511
 
512
  ### 6.5 Recursive Morpheme Segmentation
513
 
 
515
 
516
  | Word | Suggested Split | Confidence | Stem |
517
  |------|-----------------|------------|------|
518
+ | gehƿilcum | **`ge-hƿilc-um`** | 6.0 | `hƿilc` |
519
+ | gefeahten | **`ge-feaht-en`** | 6.0 | `feaht` |
520
+ | underbyrigum | **`underbyrig-um`** | 4.5 | `underbyrig` |
521
+ | geþoftscipe | **`ge-þoftscipe`** | 4.5 | `þoftscipe` |
522
+ | sanghordes | **`sanghord-es`** | 4.5 | `sanghord` |
523
+ | gesweoster | **`ge-sweoster`** | 4.5 | `sweoster` |
524
+ | russlandes | **`russland-es`** | 4.5 | `russland` |
525
+ | þēodisclandes | **`þēodiscland-es`** | 4.5 | `þēodiscland` |
526
+ | gestrēonum | **`ge-strē-on-um`** | 4.5 | `strē` |
527
+ | drȳġelandes | **`drȳġeland-es`** | 4.5 | `drȳġeland` |
528
+ | drēamhordes | **`drēamhord-es`** | 4.5 | `drēamhord` |
529
+ | andweardum | **`andweard-um`** | 4.5 | `andweard` |
530
+ | engliscan | **`englisc-an`** | 4.5 | `englisc` |
531
+ | stǣrlican | **`stǣrlic-an`** | 4.5 | `stǣrlic` |
532
+ | bedæleden | **`bedæled-en`** | 4.5 | `bedæled` |
533
 
534
  ### 6.6 Linguistic Interpretation
535
 
 
763
  ---
764
  *Generated by Wikilangs Models Pipeline*
765
 
766
+ *Report Date: 2026-01-03 16:22:13*
models/embeddings/aligned/ang_128d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:916924f4cae8ca68c9b328270e6c8f5d750e1f5f8a934a8f14b2a9ed44bfa5e9
3
  size 1034408961
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4042fcd252c39668ffd282b54a7959049dca6a2fa2494d370faeffe4781a8f96
3
  size 1034408961
models/embeddings/aligned/ang_128d.projection.npy CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:25502450dc6c859927a71989b5700de37d1474c28657905e5d57935efab0ce21
3
  size 65664
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8cef629d215fae2a8d4f6d0757920e964a035b8e6ddc4ce58bc93bd2a48cd3ee
3
  size 65664
models/embeddings/aligned/ang_32d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:bf60165a1c710ff1513a37905cbab9e4938cc02aa22cdf5c36d13a954072c8da
3
  size 258728193
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7386137c71165a0deab23014e63a9dab73ec6a8ba0a1f42c0ee53ddb3189b496
3
  size 258728193
models/embeddings/aligned/ang_32d.projection.npy CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ddf7e9dc22e45732b340414482d0aa30bca9a7225905b66b922786d632cfb556
3
  size 4224
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0ccb3f646a1f22afcafedaa07234038b8eb0a81333e2639d062d8e005227b599
3
  size 4224
models/embeddings/aligned/ang_64d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ca061364183fd6d9b77c2b10561dad103ed397dde669eca49fcc861ac7e34566
3
  size 517288449
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:43f6d782da9fc3ba7f04cc21fd7e473b4bf33cf249fa39d5f48d9b3fd8683cdd
3
  size 517288449
models/embeddings/aligned/ang_64d.projection.npy CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a318bf6e939ba9d157a6589439149031e127db782751369e9a4fb9bca1368f69
3
  size 16512
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:32cfa410833fe9bfa29b598a89a4320b14c48b98059aa50a02d59ad523514af2
3
  size 16512
models/embeddings/monolingual/ang_128d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:916924f4cae8ca68c9b328270e6c8f5d750e1f5f8a934a8f14b2a9ed44bfa5e9
3
  size 1034408961
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4042fcd252c39668ffd282b54a7959049dca6a2fa2494d370faeffe4781a8f96
3
  size 1034408961
models/embeddings/monolingual/ang_32d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:bf60165a1c710ff1513a37905cbab9e4938cc02aa22cdf5c36d13a954072c8da
3
  size 258728193
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7386137c71165a0deab23014e63a9dab73ec6a8ba0a1f42c0ee53ddb3189b496
3
  size 258728193
models/embeddings/monolingual/ang_64d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ca061364183fd6d9b77c2b10561dad103ed397dde669eca49fcc861ac7e34566
3
  size 517288449
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:43f6d782da9fc3ba7f04cc21fd7e473b4bf33cf249fa39d5f48d9b3fd8683cdd
3
  size 517288449
models/subword_markov/ang_markov_ctx1_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3c51cb623755a1753881542eeb56f63d0e3cca6b26b36201d37c4673561c4964
3
- size 66968
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a3d0e9296297f8dedcde600098131e69215f012f86ea97793a8e1080e8128c44
3
+ size 66921
models/subword_markov/ang_markov_ctx2_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:75c24d757e1f8ff20404bf43a0494cdc7025f3099ecf2c5fd0ded6196982f719
3
- size 360335
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:29bbca0b68aa28374aeeb514ff149d22cae2ccc2374dcc906d83f03a4bc5ba01
3
+ size 365734
models/subword_markov/ang_markov_ctx3_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:92be48c01deb0d03b413930a4303e8ea68fa19ad34847e3364801ebb65d661f3
3
- size 1378085
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:395c5ce0a7f4fcc034376cb9661ac1b61aa5fe4083f091797c2c5f7a19ac9da3
3
+ size 1376530
models/subword_markov/ang_markov_ctx4_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:63d5bae5a9064ea8d0b55a7323a00e10fd3cc817a992d548d92596656acd9a23
3
- size 3780418
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eb6b7720d371619b051d6a22c2b70ac3c30be0259c80a7665e5d7cb3583fd0d2
3
+ size 3782214
models/subword_ngram/ang_2gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3de4ae1fde9e711415a305f0cff7d028d9d6dbb8e9784696204bb787c90dd202
3
- size 40740
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2c29aed8334b7b6162a0bf48ffc4670ac15eaee132268818982da5a677631114
3
+ size 40739
models/subword_ngram/ang_3gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3c21515a66cb038262024984e5d43dfebb139b67620b893fda4f9839a52c0dc7
3
- size 290609
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1afdcf65c813b813b9f7167b71a86e2ed39b8472d93eb94894396d88f272f0c9
3
+ size 290617
models/subword_ngram/ang_4gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6a666f8631a2c316853af5db1a673829c50f10adee5b2020606062eb57ed52f6
3
- size 1209850
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:66b000da624816424df2f6e6a6553cfc1cc8c61a7abb8f07d8a679486bfaecdc
3
+ size 1209861
models/subword_ngram/ang_5gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7c3056ca0bf5781dd09261541479f028b29a9cc8d5a5a6f2bd27261e1805e569
3
- size 2555553
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d02d41d0aeffee68af91ecb738ccd5e09afea80746ffbd0c2b4587fe7124c75c
3
+ size 2557033
models/tokenizer/ang_tokenizer_16k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:040cdd2ba84e292b44535ad7e47be3d9d4dea1f384c04801879d77cd0895da82
3
  size 507562
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7acdfd4c40e61701b3c752556981df6787ce11c50ebb3526c7a9fc756f4ad69f
3
  size 507562
models/tokenizer/ang_tokenizer_32k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:11310c4f4880a83c7e0121967baf71b21fdb39d161532062e4d76550c057a832
3
  size 785162
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1b247988006b8533aac3130c934db31dad60edd556804ac0cb222338b7f1a996
3
  size 785162
models/tokenizer/ang_tokenizer_64k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d5e8fac427c909ddf6ed2f1d118ab76e763479cf812ec71e84f5a78d4c94f50d
3
  size 1395192
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3e8a3d118894e1a975ad1fff32919ea29007260b0c52dade56cef8a1339ad50e
3
  size 1395192
models/tokenizer/ang_tokenizer_8k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:de84f48cb9fa027b88c36715acd20bb110a824676f2c89684ff9ac4bc44afca9
3
  size 372530
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8cda0d4bd94f33872e22279abe24e04c56a98809d2caef495cf50e1306614ba7
3
  size 372530
models/word_markov/ang_markov_ctx1_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f9404cf8e75c7dafaa37bb38566f58ecbfbcde523666ebe80d07fb25ebd8ff43
3
- size 2834503
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2d984506f93bc12a615d11f8673288f8b486297370aea808062a825b039b007b
3
+ size 2832246
models/word_markov/ang_markov_ctx2_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:dd85fc8f75ac79620c6acdaafd11303ab1e84fccbf3610c168b1fc640627a885
3
- size 5756661
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:44a93c0e23326da84f4adfb9922498a0150b7776c64ad13512e2c8cac1f0b09b
3
+ size 5756090
models/word_markov/ang_markov_ctx3_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:fcce366c6306e35d8fc2a154588f0dc2d60b1cad2e9b8745210fd703040cf2ec
3
- size 7380283
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:58b09f9cf4546dfd1942066e7985ab66b49cd7d5de114ab4d137b46a255cb02d
3
+ size 7368188
models/word_markov/ang_markov_ctx4_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:baebcc70f7df9c5b12bac7c718db37f7325d96c1bbf341f4e669cc2d6ab223f8
3
- size 8221898
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e5e3d039cc0295413db5595736159f713ab902083e1161c02a9fbebec5add409
3
+ size 8219077
models/word_ngram/ang_2gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c152efe031001c5114e91ed2cef367af2fdd50d4da5544809c5ae8cbcc106504
3
- size 108294
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:53842c7854c682bf6b33d37af0123022da18d7f0a3209576dbc3578b43182725
3
+ size 107141
models/word_ngram/ang_3gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:01ccbf2a73755bb852ec26c495f01344430dc76928525d7aa4c2ceb4cf736bdd
3
- size 110659
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b5a29ef33847dff6003ed529dd6a0eeb6991cb931c8c1d52f463c3daeca18c95
3
+ size 110953
models/word_ngram/ang_4gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:bcc923ecbf09b8736c1afc198a7cd786085e0b469ccc980707742b62183aa38a
3
- size 221788
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f53e17ffa9d4a1f899d7ed060df75a23e530911203da730f48b1050f37a35fc8
3
+ size 222381
models/word_ngram/ang_5gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:506caac829f6d1545248ae1d0df0166b41dbef522f6beb4b192ce2672c70d194
3
- size 165316
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c9fb9acd2ed808ea08685fbe13baaef8c3a0ebbd981d5dd368178cbd489e818e
3
+ size 165202
visualizations/embedding_alignment_quality.png CHANGED
visualizations/embedding_isotropy.png CHANGED
visualizations/embedding_norms.png CHANGED
visualizations/embedding_similarity.png CHANGED

Git LFS Details

  • SHA256: 7d986283c7bc60b30d45fcb719cfcb283dfd6affc58ee63f0089bce8bc3eefaf
  • Pointer size: 131 Bytes
  • Size of remote file: 147 kB

Git LFS Details

  • SHA256: 8aab09020e0ab7535fe218df86de541a9ad50788adffc49722971de3fd806a0f
  • Pointer size: 131 Bytes
  • Size of remote file: 144 kB
visualizations/embedding_tsne_multilingual.png CHANGED

Git LFS Details

  • SHA256: 30f37e07936a11fecbcd4c90f93e725f3ea3b23bf2d1b23eb4bb6bf9e707feb2
  • Pointer size: 131 Bytes
  • Size of remote file: 226 kB

Git LFS Details

  • SHA256: 6315ae531e8210b1f3ca0301d3be9843a0c172ad498d24e7e707a0d4ed74d9f0
  • Pointer size: 131 Bytes
  • Size of remote file: 232 kB
visualizations/ngram_perplexity.png CHANGED
visualizations/performance_dashboard.png CHANGED

Git LFS Details

  • SHA256: 6cec30eb22d1713e3aed5a990c30edae616d051153e8896ab72aefeac4628ba7
  • Pointer size: 131 Bytes
  • Size of remote file: 386 kB

Git LFS Details

  • SHA256: 65d792863347cdd5fe8be4e4f46e145c85056aa7b6031620d90f3c716bc6bfb6
  • Pointer size: 131 Bytes
  • Size of remote file: 377 kB
visualizations/position_encoding_comparison.png CHANGED

Git LFS Details

  • SHA256: e09d834aee3242de0e952272ef58e16d6986fd1e2aa085b7bb9fd7a5d7b16cf9
  • Pointer size: 131 Bytes
  • Size of remote file: 109 kB

Git LFS Details

  • SHA256: b31a5adf642259b9956155636126a21de081cb643377de30dbcde8f4927ae330
  • Pointer size: 131 Bytes
  • Size of remote file: 104 kB
visualizations/tsne_sentences.png CHANGED

Git LFS Details

  • SHA256: 3c3c7ded9dd7b6f18fbe00c84f45f22c6dc875add60422483126ea88dcf61513
  • Pointer size: 131 Bytes
  • Size of remote file: 278 kB

Git LFS Details

  • SHA256: d007665d6911284fee9ed5628099c7faab8337e2aeefa2e56de3cd6d29bb24fb
  • Pointer size: 131 Bytes
  • Size of remote file: 277 kB
visualizations/tsne_words.png CHANGED

Git LFS Details

  • SHA256: e534e626c9c5678a842af2bef16514154ebe9e0a08a3a4a3ce060ad9552608cb
  • Pointer size: 131 Bytes
  • Size of remote file: 682 kB

Git LFS Details

  • SHA256: cd3ed773c1130793ffb6ad9975229c8f4db1ff6cbc940f27465e0775320a58d2
  • Pointer size: 131 Bytes
  • Size of remote file: 662 kB