mispeech
/

midashenglm-7b-1021-w4a16-gptq

Audio-Text-to-Text

audio-language-model

compressed-tensors

Model card Files Files and versions

GrantL10 commited on 18 days ago

Commit

c739209

·

verified ·

1 Parent(s): 2631b47

Upload README.md

Files changed (1) hide show

README.md +98 -0

README.md ADDED Viewed

	@@ -0,0 +1,98 @@

+ ---
+license: apache-2.0
+language:
+- en
+- zh
+- th
+- id
+- vi
+pipeline_tag: audio-text-to-text
+tags:
+- multimodal
+- audio-language-model
+- audio
+base_model:
+- mispeech/dasheng-0.6B
+- Qwen/Qwen2.5-Omni-7B
+base_model_relation: finetune
+---
+# MiDashengLM-7B-0804 (4bit, GPTQ quantized)
+The 4bit (w4a16) weights for [mispeech/midashenglm-7b-1021-fp32](https://huggingface.co/mispeech/midashenglm-7b-1021-fp32), quantized by GPTQ.
+An ideal choice for resource-constrained environments. It offers broad GPU compatibility and a smaller memory footprint, making it suitable for deployment where VRAM, memory, or storage is limited, provided that a slight trade-off in quality is acceptable.
+## Usage
+### Load Model
+```python
+from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
+model_id = "mispeech/midashenglm-7b-1021-w4a16-gptq"
+model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
+```
+### Construct Prompt
+```python
+user_prompt = "Caption the audio."  # You may try any other prompt
+messages = [
+    {
+        "role": "system",
+        "content": [
+            {"type": "text", "text": "You are a helpful language and speech assistant."}
+        ],
+    },
+    {
+        "role": "user",
+        "content": [
+            {"type": "text", "text": user_prompt},
+            {
+                "type": "audio",
+                "path": "/path/to/example.wav",
+                # or "url": "https://example.com/example.wav"
+                # or "audio": np.random.randn(16000)
+            },
+        ],
+    },
+]
+```
+### Generate Output
+```python
+import torch
+with torch.no_grad():
+    model_inputs = processor.apply_chat_template(
+        messages,
+        tokenize=True,
+        add_generation_prompt=True,
+        add_special_tokens=True,
+        return_dict=True,
+    ).to(device=model.device, dtype=model.dtype)
+    generation = model.generate(**model_inputs)
+    output = tokenizer.batch_decode(generation, skip_special_tokens=True)  # ["An engine is idling."]
+```
+## Citation
+MiDashengLM is under the Apache License 2.0, and we encourage its use in **both research and business applications**.
+If you find MiDashengLM useful in your research, please consider citing our work:
+```bibtex
+@techreport{midashenglm7b,
+  title      = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
+  author     = {{Horizon Team, MiLM Plus}},
+  institution= {Xiaomi Inc.},
+  year       = {2025},
+  note       = {Contributors: Heinrich Dinkel et al. (listed alphabetically in Appendix B)},
+  url        = {https://arxiv.org/abs/2508.03983},
+  eprint     = {2508.03983},
+}
+```