Machine Learning At Your Service

by Hugging Face

Easily deploy your AI models to production on our fully managed platform. Instead of spending weeks configuring infrastructure, focus on building you AI application.

Learn More

No Hugging Face account ? Sign up!

Trusted By

These teams are running AI models on Inference Endpoints

Musixmatch logo
Grammarly logo
Shopify logo
Pinecone logo
Gorgias logo

Features

Everything you need to deploy AI models at scale

Fully Managed Infrastructure

Don't worry about Kubernetes, CUDA versions, or configuring VPNs. Focus on deploying your model and serving customers.

Autoscaling

Automatically scales up as traffic increases and down as it decreases to save on compute costs.

Observability

Understand and debug your model through comprehensive logs & metrics.

Inference Engines

Deploy with vLLM, TGI, SGLang, TEI, or custom containers.

Hugging Face Integration

Download model weights fast and securely with seamless Hugging Face Hub integration.

Future-proof AI Stack

Stay current with the latest frameworks and optimizations without managing complex upgrades.

Pricing

Choose a plan that fits your needs

Self-Serve

Pay as you go when using Inference Endpoints

  • Pay for what you use, per minute
  • Starting as low as $0.06/hour
  • Billed monthly
  • Email support
See Instance Pricing

Enterprise

Get a custom quote and premium support

  • Lower marginal costs based on volume
  • Uptime guarantees
  • Custom annual contracts
  • Dedicated support, SLAs
Request a Quote

Testimonials

Hear from our users

The coolest thing was how easy it was to define a complete custom interface from the model to the inference process. It just took us a couple of hours to adapt our code, and have a functioning and totally custom endpoint.
Portrait of Andrea Boscarino, Data Scientist at Musixmatch
Andrea Boscarino
Data Scientist at Musixmatch
It took off a week's worth of developer time. Thanks to Inference Endpoints, we now basically spend all of our time on R&D, not fiddling with AWS. If you haven't already built a robust, performant, fault tolerant system for inference, then it's pretty much a no brainer.
Portrait of Bryce Harlan, Senior Software Engineer at Phamily
Bryce Harlan
Senior Software Engineer at Phamily
We were able to choose an off the shelf model that's very common for our customers and set it to to handle over 100 requests per second just with a few button clicks. A new standard for easily building your first vector embedding based solution, whether it be semantic search or question answering system.
Portrait of Gareth Jones, Senior Product Manager at Pinecone
Gareth Jones
Senior Product Manager at Pinecone
You're bringing the potential time delta between testing and production down to potentially less than a day. I've never seen anything that could do this before. I could have it on infrastructure ready to support an existing product
Portrait of Nathan Labenz, Founder at Waymark
Nathan Labenz
Founder at Waymark

Ready to Get Started?

Join thousands of developers and teams using Inference Endpoints to deploy their AI models at scale. Start building today with our simple, secure, and scalable infrastructure.

View Documentation