Weโre thrilled to announce that the Qwen3-VL family of vision-language models is now available on Azure AI Foundry, thanks to our collaboration with Microsoft.
We bring open-source innovation to enterprise-grade AI infrastructure, making it easier than ever for enterprise to deploy and scale the latest and greatest from models from hugging Face securely within Azure.
๐ Highlights:
- Deploy Qwen3-VL instantly via managed endpoints - Built-in governance, telemetry, and lifecycle management - True multimodal reasoning โ vision, language, and code understanding - State-of-the-art performance, outperforming closed-source models like Gemini 2.5 Pro and GPT-5 - Available in both *Instruct* and *Thinking* modes, across 24 model sizes
๐ Get started today: search for Qwen3-VL in the Hugging Face Collection on Azure AI Foundry.
๐ Optimum libraries keep growing, and Optimum v2 is just around the corner!
I recently added ONNX export support for a bunch of new models in the optimum-onnx library, including: DeepSeek-V3, Cohere, Nemotron, Arcee, StableLM โฆ and more!
โก With ONNX export, you can run your favorite models faster and more efficiently across different hardware backends, making deployment and experimentation much smoother.
๐ก Have a model youโd love to see supported? Contributions are super welcome โ letโs make Optimum even better together!
Is there a "one-size-fits-all" recipe for quantizing Large Language Models? ๐ค
As part of my ongoing work in mixed-precision quantization, I've been exploring this question by measuring layer-by-layer sensitivity. The goal is to see if we can find universal rules for which layers can be quantized aggressively without impacting performance.The results are fascinating and reveal two key insights:
1๏ธโฃ Sensitivity profiles are like architectural "fingerprints." Models from the same family share strikingly similar sensitivity patterns. As you can see in the charts below for the Gemma and SmolLM families, the ranking and relative sensitivity of the layers remain remarkably consistent. This suggests that the underlying architecture is a primary driver of a model's quantization behavior.
2๏ธโฃ A "universal" mixed-precision quantization strategy is challenging. While models within a family are similar, these "fingerprints" change dramatically when comparing different architectures like LLaMA, Qwen, and StableLM. This highlights the difficulty in creating a generalized mixed-precision configuration that works optimally across all model families.
However, there is one near-universal truth we uncovered: the mlp.down_proj layer consistently emerges as one of the most sensitive components across all models studied. This finding strongly resonates with the work in "The Super Weight in Large Language Models" (by Mengxia Yu et al.). The paper identifies that functionally critical parameters, or "super weights," are concentrated in these down_proj layers. Our empirical results provide clear validation for this theory, showing these layers are highly intolerant to precision loss.
In short, while every architecture has a unique sensitivity profile, a fingerprint shaped not only by its core design but also by its specific training dataset and optimization approach, some components remain universally critical! What are your thoughts?
We now have the newest Open AI models available on the Dell Enterprise Hub!
We built the Dell Enterprise Hub to provide access to the latest and greatest model from the Hugging Face community to our on-prem customers. Weโre happy to give secure access to this amazing contribution from Open AI on the day of its launch!
You can now find it in the Hugging Face Collection in Azure ML or Azure AI Foundry, along with 10k other Hugging Face models ๐ค๐ค Qwen/Qwen3-235B-A22B-Instruct-2507-FP8
๐ New in Azure Model Catalog: NVIDIA Parakeet TDT 0.6B V2
We're excited to welcome Parakeet TDT 0.6B V2โa state-of-the-art English speech-to-text modelโto the Azure Foundry Model Catalog.
What is it?
A powerful ASR model built on the FastConformer-TDT architecture, offering: ๐ Word-level timestamps โ๏ธ Automatic punctuation & capitalization ๐ Strong performance across noisy and real-world audio
It runs with NeMo, NVIDIAโs optimized inference engine.
Want to give it a try? ๐ง You can test it with your own audio (up to 3 hours) on Hugging Face Spaces before deploying.If it fits your need, deploy easily from the Hugging Face Hub or Azure ML Studio with secure, scalable infrastructure!
๐ Learn more by following this guide written by @alvarobartt
In case you missed it, Hugging Face expanded its collaboration with Azure a few weeks ago with a curated catalog of 10,000 models, accessible from Azure AI Foundry and Azure ML!
@alvarobartt cooked during these last days to prepare the one and only documentation you need, if you wanted to deploy Hugging Face models on Azure. It comes with an FAQ, great guides and examples on how to deploy VLMs, LLMs, smolagents and more to come very soon.
We need your feedback: come help us and let us know what else you want to see, which model we should add to the collection, which model task we should prioritize adding, what else we should build a tutorial for. Youโre just an issue away on our GitHub repo!
AMD summer hackathons are here! A chance to get hands-on with MI300X GPUs and accelerate models. ๐ซ๐ท Paris - Station F - July 5-6 ๐ฎ๐ณ Mumbai - July 12-13 ๐ฎ๐ณ Bengaluru - July 19-20
Hugging Face and GPU Mode will be on site and on July 6 in Paris @ror will share lessons learned while building new kernels to accelerate Llama 3.1 405B on ROCm
Hugging Face just wrapped 4 months of deep work with AMD to push kernel-level optimization on their MI300X GPUs. Now, it's time to share everything we learned.
Join us in Paris at STATION F for a hands-on weekend of workshops and a hackathon focused on making open-source LLMs faster and more efficient on AMD.
Prizes, amazing host speakers, ... if you want more details, navigate to https://lu.ma/fmvdjmur!