GrantL10 commited on
Commit
c739209
·
verified ·
1 Parent(s): 2631b47

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +98 -0
README.md ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ - th
7
+ - id
8
+ - vi
9
+ pipeline_tag: audio-text-to-text
10
+ tags:
11
+ - multimodal
12
+ - audio-language-model
13
+ - audio
14
+ base_model:
15
+ - mispeech/dasheng-0.6B
16
+ - Qwen/Qwen2.5-Omni-7B
17
+ base_model_relation: finetune
18
+ ---
19
+ # MiDashengLM-7B-0804 (4bit, GPTQ quantized)
20
+
21
+ The 4bit (w4a16) weights for [mispeech/midashenglm-7b-1021-fp32](https://huggingface.co/mispeech/midashenglm-7b-1021-fp32), quantized by GPTQ.
22
+
23
+ An ideal choice for resource-constrained environments. It offers broad GPU compatibility and a smaller memory footprint, making it suitable for deployment where VRAM, memory, or storage is limited, provided that a slight trade-off in quality is acceptable.
24
+
25
+ ## Usage
26
+
27
+ ### Load Model
28
+
29
+ ```python
30
+ from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
31
+
32
+ model_id = "mispeech/midashenglm-7b-1021-w4a16-gptq"
33
+ model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
34
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
35
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
36
+ ```
37
+
38
+ ### Construct Prompt
39
+
40
+ ```python
41
+ user_prompt = "Caption the audio." # You may try any other prompt
42
+
43
+ messages = [
44
+ {
45
+ "role": "system",
46
+ "content": [
47
+ {"type": "text", "text": "You are a helpful language and speech assistant."}
48
+ ],
49
+ },
50
+ {
51
+ "role": "user",
52
+ "content": [
53
+ {"type": "text", "text": user_prompt},
54
+ {
55
+ "type": "audio",
56
+ "path": "/path/to/example.wav",
57
+ # or "url": "https://example.com/example.wav"
58
+ # or "audio": np.random.randn(16000)
59
+ },
60
+ ],
61
+ },
62
+ ]
63
+ ```
64
+
65
+ ### Generate Output
66
+
67
+ ```python
68
+ import torch
69
+
70
+ with torch.no_grad():
71
+ model_inputs = processor.apply_chat_template(
72
+ messages,
73
+ tokenize=True,
74
+ add_generation_prompt=True,
75
+ add_special_tokens=True,
76
+ return_dict=True,
77
+ ).to(device=model.device, dtype=model.dtype)
78
+ generation = model.generate(**model_inputs)
79
+ output = tokenizer.batch_decode(generation, skip_special_tokens=True) # ["An engine is idling."]
80
+ ```
81
+
82
+ ## Citation
83
+
84
+ MiDashengLM is under the Apache License 2.0, and we encourage its use in **both research and business applications**.
85
+
86
+ If you find MiDashengLM useful in your research, please consider citing our work:
87
+
88
+ ```bibtex
89
+ @techreport{midashenglm7b,
90
+ title = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
91
+ author = {{Horizon Team, MiLM Plus}},
92
+ institution= {Xiaomi Inc.},
93
+ year = {2025},
94
+ note = {Contributors: Heinrich Dinkel et al. (listed alphabetically in Appendix B)},
95
+ url = {https://arxiv.org/abs/2508.03983},
96
+ eprint = {2508.03983},
97
+ }
98
+ ```