Machine Learning At Your Service
byEasily deploy your AI models to production on our fully managed platform. Instead of spending weeks configuring infrastructure, focus on building you AI application.
No Hugging Face account ? Sign up!

One-click deployment
Import your favorite model from Hugging Face or browse our catalog of hand-picked, ready-to-deploy models!
granite-3.3-8b-instruct-FP8
gpt-oss-safeguard-20b
olmOCR-2-7B-1025-FP8
Qwen3-VL-30B-A3B-Thinking
Qwen3-VL-8B-Instruct
LightOnOCR-1B-1025
Trusted By
These teams are running AI models on Inference Endpoints

Features
Everything you need to deploy AI models at scale
Fully Managed Infrastructure
Don't worry about Kubernetes, CUDA versions, or configuring VPNs. Focus on deploying your model and serving customers.
Autoscaling
Automatically scales up as traffic increases and down as it decreases to save on compute costs.
Observability
Understand and debug your model through comprehensive logs & metrics.
Inference Engines
Deploy with vLLM, TGI, SGLang, TEI, or custom containers.
Hugging Face Integration
Download model weights fast and securely with seamless Hugging Face Hub integration.
Future-proof AI Stack
Stay current with the latest frameworks and optimizations without managing complex upgrades.
Pricing
Choose a plan that fits your needs
Self-Serve
Pay as you go when using Inference Endpoints
- Pay for what you use, per minute
- Starting as low as $0.06/hour
- Billed monthly
- Email support
Enterprise
Get a custom quote and premium support
- Lower marginal costs based on volume
- Uptime guarantees
- Custom annual contracts
- Dedicated support, SLAs
Testimonials
Hear from our users
The coolest thing was how easy it was to define a complete custom interface from the model to the inference process. It just took us a couple of hours to adapt our code, and have a functioning and totally custom endpoint.
It took off a week's worth of developer time. Thanks to Inference Endpoints, we now basically spend all of our time on R&D, not fiddling with AWS. If you haven't already built a robust, performant, fault tolerant system for inference, then it's pretty much a no brainer.
We were able to choose an off the shelf model that's very common for our customers and set it to to handle over 100 requests per second just with a few button clicks. A new standard for easily building your first vector embedding based solution, whether it be semantic search or question answering system.
You're bringing the potential time delta between testing and production down to potentially less than a day. I've never seen anything that could do this before. I could have it on infrastructure ready to support an existing product
Ready to Get Started?
Join thousands of developers and teams using Inference Endpoints to deploy their AI models at scale. Start building today with our simple, secure, and scalable infrastructure.