Upload Maaza-MLM-135M-JSON-v1 - v1.0.0 production release

Browse files

Files changed (9) hide show

README.md +302 -0
adapter_config.json +46 -0
adapter_model.safetensors +3 -0
merges.txt +0 -0
special_tokens_map.json +43 -0
tokenizer.json +0 -0
tokenizer_config.json +169 -0
training_metadata.json +29 -0
vocab.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,302 @@

+# CycleCore Maaza MLM-135M-JSON v1.0.0
+Micro Language Model (135M parameters) specialized for JSON extraction on edge devices.
+## Model Details
+- **Developer**: CycleCore Technologies
+- **Model Name**: CycleCore Maaza MLM-135M-JSON
+- **Version**: v1.0.0
+- **Base Model**: SmolLM2-135M (HuggingFaceTB)
+- **Training Method**: LoRA fine-tuning (r=16, alpha=32)
+- **Task**: Structured JSON extraction
+- **License**: Apache 2.0
+- **Parameters**: 135M total, 4.88M trainable (3.5%)
+- **Model Size**: ~270MB (FP16), ~70MB (Q4 quantized)
+- **Context Length**: 2048 tokens
+## Intended Use
+### Primary Use Cases
+- IoT sensor data extraction and structuring
+- API response parsing and validation
+- Form field extraction from documents
+- Database record structuring from natural language
+- Log file parsing and structuring
+### Target Hardware
+- **Edge Devices**: Raspberry Pi 5, embedded systems
+- **Laptop CPU**: x86/ARM, 16GB RAM, CPU-only
+- **Browser**: WebGPU (via ONNX Runtime)
+- **Server**: Optional GPU acceleration
+### Out of Scope
+- Open-ended conversation or creative writing
+- Complex reasoning or multi-hop logic
+- Math problem solving
+- General-purpose chat applications
+## Benchmark Performance
+### EdgeJSON v3 Benchmark
+Evaluated on 158 test cases across 24 schema types:
+| Metric | Score |
+|--------|-------|
+| **JSONExact** | 24.7% |
+| **Field F1** | 0.520 |
+| **Schema Compliance** | 41.1% |
+| **Latency (CPU)** | 18.5 tokens/sec |
+| **Training Time** | 48.7 seconds |
+### By Complexity Level
+| Complexity | Fields | Nesting | JSONExact | Field F1 |
+|------------|--------|---------|-----------|----------|
+| Simple | 2-4 | Flat | 44.7% | 0.698 |
+| Medium | 4-8 | 1-2 levels | 13.5% | 0.456 |
+| Complex | 8+ | 2+ levels | 0.0% | 0.234 |
+### Perfect Schemas (100% JSONExact)
+- `product_info` (2 fields, simple)
+- `sensor_reading` (4 fields, simple)
+### Training Improvement
+- **Base SmolLM2-135M**: 1.9% JSONExact
+- **Fine-tuned (this model)**: 24.7% JSONExact
+- **Training Multiplier**: 13.0× improvement
+## Training Data
+### Dataset: EdgeJSON v3
+- **Total Examples**: 787 (100% validated)
+- **Train Split**: 629 examples (80%)
+- **Test Split**: 158 examples (20%)
+- **Validation Rate**: 100% (all examples pass schema validation)
+- **Schema Count**: 24 unique schemas
+- **Complexity Distribution**: 38 simple, 74 medium, 46 complex
+### Data Generation
+- **Teacher Model**: Qwen2.5-7B-Instruct
+- **Method**: Synthetic generation with validation
+- **Quality Control**: 100% schema compliance, manual review sampling
+### Prompt Template
+```
+Extract the structured JSON data from the following text.
+Input: {prompt}
+Output:
+```
+## Training Procedure
+### Hardware
+- **GPU**: NVIDIA RTX 4080 SUPER (16GB)
+- **Training Time**: 48.7 seconds
+- **Effective Batch Size**: 32 (4 per device × 8 gradient accumulation)
+### Hyperparameters
+- **Method**: LoRA (Low-Rank Adaptation)
+- **LoRA Rank (r)**: 16
+- **LoRA Alpha**: 32
+- **LoRA Dropout**: 0.1
+- **Target Modules**: q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj
+- **Learning Rate**: 2e-4
+- **Optimizer**: AdamW (β1=0.9, β2=0.999, ε=1e-8)
+- **Weight Decay**: 0.01
+- **LR Scheduler**: Cosine with 10% warmup
+- **Epochs**: 3
+- **Precision**: BF16 mixed precision
+- **Max Grad Norm**: 1.0
+### Training Loss
+- **Final Training Loss**: 1.449
+## Evaluation Methodology
+### Metrics
+**JSONExact Score**:
+- Binary exact match (0 or 1 per example)
+- Compares predicted JSON to ground truth
+- Requires perfect field matching
+**Field F1**:
+- Per-field precision and recall
+- Averaged across all fields
+- Partial credit for correct fields
+**Schema Compliance**:
+- Validates against JSON schema specification
+- Checks required fields, types, structure
+### Inference Settings
+- **Temperature**: 0.0 (deterministic)
+- **Max Tokens**: 512
+- **Format**: JSON mode enforced
+- **Platform**: CUDA (GPU) or CPU
+## Limitations and Bias
+### Known Limitations
+**Capacity Ceiling**: This model hits a capacity ceiling on complex schemas (8+ fields, 2+ nesting levels), achieving 0% exact match accuracy. For complex structured extraction, consider the larger Maaza SLM-360M model.
+**Simple Schema Specialization**: Best suited for simple schemas (2-4 fields, flat structure) where it achieves 44.7% accuracy.
+**Synthetic Data**: Trained exclusively on synthetically generated data from Qwen2.5-7B, which may not capture all real-world edge cases.
+**Domain Specificity**: Optimized for structured data extraction, not general-purpose language understanding.
+### Potential Biases
+- Inherits biases from teacher model (Qwen2.5-7B)
+- Synthetic data may not reflect real-world data distributions
+- Performance varies significantly by schema complexity
+### Ethical Considerations
+- **Privacy**: On-device deployment avoids cloud API calls, keeping data local
+- **Energy**: Ultra-fast training (48.7s) and efficient inference reduce carbon footprint
+- **Transparency**: 100% open training methodology, reproducible results
+## How to Use
+### Installation
+```bash
+pip install transformers peft torch
+```
+### Loading the Model
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from peft import PeftModel
+# Load base model
+base_model = AutoModelForCausalLM.from_pretrained(
+    "HuggingFaceTB/SmolLM2-135M",
+    torch_dtype=torch.float16,
+    device_map="auto"
+)
+# Load LoRA adapter
+model = PeftModel.from_pretrained(
+    base_model,
+    "CycleCore/Maaza-MLM-135M-JSON-v1"
+)
+tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")
+```
+### Inference Example
+```python
+prompt = """Extract the structured JSON data from the following text.
+Input: John Doe works at Acme Corp. His email is [email protected] and phone is 555-1234.
+Output:"""
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=512,
+    temperature=0.0,
+    do_sample=False
+)
+result = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(result)
+```
+### Expected Output
+```json
+{
+  "name": "John Doe",
+  "company": "Acme Corp",
+  "email": "[email protected]",
+  "phone": "555-1234"
+}
+```
+## Model Comparison
+For guidance on choosing between MLM-135M and SLM-360M, see our [Model Comparison Guide](https://github.com/CycleCore/SLMBench/blob/main/docs/MODEL_COMPARISON.md).
+**Quick Decision**:
+- **Use MLM-135M** if: Ultra-low latency required, simple schemas (2-4 fields), <500MB deployment size
+- **Use SLM-360M** if: Higher accuracy needed, medium/complex schemas, willing to use ~1GB deployment size
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@misc{cyclecore2025mlm,
+  title={CycleCore Maaza MLM-135M-JSON: Micro Language Model for Edge JSON Extraction},
+  author={CycleCore Technologies},
+  year={2025},
+  publisher={HuggingFace},
+  howpublished={\url{https://huggingface.co/CycleCore/Maaza-MLM-135M-JSON-v1}},
+}
+```
+**Academic Paper** (forthcoming):
+```bibtex
+@article{cyclecore2025slmbench,
+  title={Micro Language Models (MLMs) and SLM-Bench: A Benchmark Suite for Structured Tasks on Resource-Constrained Devices},
+  author={CycleCore Technologies},
+  journal={arXiv preprint},
+  year={2025},
+  note={Paper in preparation}
+}
+```
+## Links
+- **Model Repository**: https://huggingface.co/CycleCore/Maaza-MLM-135M-JSON-v1
+- **Base Model**: https://huggingface.co/HuggingFaceTB/SmolLM2-135M
+- **SLMBench Benchmark**: https://github.com/CycleCore/SLMBench
+- **Documentation**: https://github.com/CycleCore/SLMBench/tree/main/docs
+- **Paper**: Coming soon (arXiv)
+- **Website**: slmbench.com (coming soon)
+## Version History
+### v1.0.0 (2025-11-20)
+- Initial release
+- Trained on EdgeJSON v3 dataset (100% validated)
+- 24.7% JSONExact, 0.520 Field F1
+- LoRA fine-tuning (r=16, alpha=32)
+- 48.7 second training time
+- Apache 2.0 license
+## Contact
+For questions, issues, or collaboration:
+- **GitHub Issues**: https://github.com/CycleCore/SLMBench/issues
+- **Email**: [email protected] (coming soon)
+## License
+Apache License 2.0
+Copyright 2025 CycleCore Technologies
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.

adapter_config.json ADDED Viewed

	@@ -0,0 +1,46 @@

+{
+  "alora_invocation_tokens": null,
+  "alpha_pattern": {},
+  "arrow_config": null,
+  "auto_mapping": null,
+  "base_model_name_or_path": "HuggingFaceTB/SmolLM2-135M",
+  "bias": "none",
+  "corda_config": null,
+  "ensure_weight_tying": false,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 32,
+  "lora_bias": false,
+  "lora_dropout": 0.1,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "peft_version": "0.18.0",
+  "qalora_group_size": 16,
+  "r": 16,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "k_proj",
+    "up_proj",
+    "down_proj",
+    "o_proj",
+    "q_proj",
+    "gate_proj",
+    "v_proj"
+  ],
+  "target_parameters": null,
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_dora": false,
+  "use_qalora": false,
+  "use_rslora": false
+}

adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1e53b30c708a136ac086f3ccf6026424d9cfb183367e0933bdad77806b65b14d
+size 19593064

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,43 @@

+{
+  "additional_special_tokens": [
+    "<|endoftext|>",
+    "<|im_start|>",
+    "<|im_end|>",
+    "<repo_name>",
+    "<reponame>",
+    "<file_sep>",
+    "<filename>",
+    "<gh_stars>",
+    "<issue_start>",
+    "<issue_comment>",
+    "<issue_closed>",
+    "<jupyter_start>",
+    "<jupyter_text>",
+    "<jupyter_code>",
+    "<jupyter_output>",
+    "<jupyter_script>",
+    "<empty_output>"
+  ],
+  "bos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<|endoftext|>",
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,169 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<repo_name>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<reponame>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "<file_sep>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6": {
+      "content": "<filename>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "7": {
+      "content": "<gh_stars>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "8": {
+      "content": "<issue_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "9": {
+      "content": "<issue_comment>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "10": {
+      "content": "<issue_closed>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "11": {
+      "content": "<jupyter_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "12": {
+      "content": "<jupyter_text>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "13": {
+      "content": "<jupyter_code>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "14": {
+      "content": "<jupyter_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "15": {
+      "content": "<jupyter_script>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "16": {
+      "content": "<empty_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|endoftext|>",
+    "<|im_start|>",
+    "<|im_end|>",
+    "<repo_name>",
+    "<reponame>",
+    "<file_sep>",
+    "<filename>",
+    "<gh_stars>",
+    "<issue_start>",
+    "<issue_comment>",
+    "<issue_closed>",
+    "<jupyter_start>",
+    "<jupyter_text>",
+    "<jupyter_code>",
+    "<jupyter_output>",
+    "<jupyter_script>",
+    "<empty_output>"
+  ],
+  "bos_token": "<|endoftext|>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|endoftext|>",
+  "extra_special_tokens": {},
+  "model_max_length": 8192,
+  "pad_token": "<|endoftext|>",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>",
+  "vocab_size": 49152
+}

training_metadata.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+  "model_name": "CycleCore-Maaza-SLM-135M-JSON",
+  "base_model": "HuggingFaceTB/SmolLM2-135M",
+  "training_date": "2025-11-20 10:50:49",
+  "num_epochs": 3,
+  "learning_rate": 0.0002,
+  "batch_size": 32,
+  "train_examples": 629,
+  "validation_examples": 0,
+  "test_examples": 158,
+  "lora_config": {
+    "enabled": true,
+    "r": 16,
+    "lora_alpha": 32,
+    "lora_dropout": 0.1,
+    "target_modules": [
+      "q_proj",
+      "v_proj",
+      "k_proj",
+      "o_proj",
+      "gate_proj",
+      "up_proj",
+      "down_proj"
+    ],
+    "bias": "none",
+    "task_type": "CAUSAL_LM"
+  },
+  "validation_run": false
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff