How to run batch inference with structured output fast?

Hello.

I have a dataset of 100k+ rows of documents, each is on average 100 to 300 words, some are a few thousands long. I’m looking to make feature extraction from the dataset, i.e. populate a SQL table using a self-hosted model.

How to do it optimally?

One way to do it is to simply run a for loop over the dataset with batch_size=64. That works, but I think we would get heavy GPU underutilization.
Many HF libraries allow for Trainer or similar, which allows in turn for no-python-overhead training. However, I don’t know such a lib for inference over an offline dataset.

After the inference, we need to populate a SQL table which requires calling python for schemas and that’s slow.
What do you think? what is a good way to do it?
Langchain+vllm async generation?

Thank you.
Running

1 Like

Python isn’t really suited for applications that need to fully utilize hardware resources due to the GIL constraint, you know…

1 Like

Does vllm not handle this under the hood? I thought that it’s written on some C++/CUDA or similar low-level languages.

1 Like

it’s written on some C++/CUDA or similar low-level languages.

Yeah. vLLM significantly accelerates inference, especially on GPUs. However, it isn’t specifically optimized for handling large numbers of concurrent requests with other tasks. TEI and TGI are designed with larger scales in mind, so they might be more advantageous in such case.
That said, vLLM’s server mode might not differ much…

Anyway, this is less a vLLM issue and more a problem on the pure Python side, including LangChain. For batch processing handling large numbers of files, running everything through a single Python script introduces overhead.
Offloading some of that to the OS (like when setting up a local server, where the OS handles some resource management) can sometimes improve efficiency. It may get messy though…

If the dataset isn’t enormous, a single script is perfectly fine.

I’ve encountered issues like this downstream of LangChain quite often - I wonder why their tooling always tends to be messy or problematic.

1 Like

Feel free to correct me, but what i would recommend is running a LLM server through TGI, vLLM, SGLang or a server of your choice, and if python is really bothering you then writing a simple rust script for loading and the docs in batches and pushing them to the server on an openai endpoint using (fearless) concurrency.

You can see some examples here: openai_client - Rust

1 Like

Yeah. We could use bash or PowerShell if needed, or even just subprocess.run Python scripts from Python for speed, but relying solely on scripts makes resource management tricky…

Anyway, having something server-driven makes it easier to ensure both stability and speed…

Good question. For batch inference with structured outputs, you might want to wrap your requests into a pipeline or use the transformers Dataset.map() approach. That way you can process multiple inputs efficiently without hitting performance bottlenecks.

1 Like

The problem I have with TGI is that it’s a docker image which I personally couldn’t get to work (within reasonable timeframe) with a local dataset. The dataset is 400mb of text, I think some 17 000 000 tokens, and is to grow from that.
pipeline’s don’t support vLLM autobatching + concurrency. Which probably isn’t an issue given that I’m running on 5060 ti which doesn’t support that out of the box anyway, but…
The closest I’ve gotten is to literally a) iterating over a dataset, b) send async requests to endpoints, and vllm handles it.
TBH the transformers should have a pipeline that would support vllm auto-batching, whereas instead we can only get for-loop kind of performance. I’ll open the issue.

But I think sending async requests to a vllm endpoint is the solution.

1 Like