--- library_name: gguf base_model: Qwen/Qwen3-8B quantized_by: Tohirju model_name: Ameena_Qwen3-8B_e3_Quantised_gguf model_author: Tohirju model_type: qwen3 quantization_method: Q4_K_M tags: - quantized - gguf - qwen3 - 8b - q4_k_m license: apache-2.0 --- # Ameena Qwen3-8B e3 Quantized GGUF This is a quantized version of a fine-tuned Qwen3-8B model, optimized for efficient inference. ## Model Details - **Base Model**: Qwen/Qwen3-8B - **Quantization**: Q4_K_M (4-bit with K-quant mixed precision) - **Original Size**: ~15.26 GB - **Quantized Size**: ~4.68 GB - **Compression Ratio**: 3.3x - **Format**: GGUF (GPT-Generated Unified Format) ## Usage ### With llama-cpp-python ```python from llama_cpp import Llama # Load the model llm = Llama( model_path="Ameena_Qwen3-8B_e3.gguf", n_gpu_layers=-1, # Use GPU acceleration n_ctx=4096, # Context window verbose=False ) # Generate text response = llm( "Your prompt here", max_tokens=512, temperature=0.7, top_p=0.9 ) ``` ### With Hugging Face Transformers + llama.cpp ```python # Download the model from huggingface_hub import hf_hub_download model_path = hf_hub_download( repo_id="Tohirju/Ameena_Qwen3-8B_e3_Quantised_gguf", filename="Ameena_Qwen3-8B_e3.gguf" ) ``` ## Quantization Details - **Method**: Q4_K_M - Mixed precision 4-bit quantization - **Quality**: Excellent balance between model size and performance - **Speed**: Optimized for fast inference on both CPU and GPU - **Memory**: Significantly reduced VRAM requirements ## Performance - **Inference Speed**: ~3.3x faster loading due to smaller file size - **Memory Usage**: ~69% reduction in memory requirements - **Quality**: Minimal quality loss compared to FP16 version ## Hardware Requirements - **CPU**: Any modern CPU (optimized for x86_64) - **GPU**: CUDA-compatible GPU recommended (RTX 3060+ or better) - **RAM**: 8GB minimum, 16GB recommended - **Storage**: ~5GB for the model file ## License This model follows the Apache 2.0 license of the base Qwen3-8B model.