kernels-test (Kernels Tests)

danieldk

posted an update 18 days ago

Post

432

We have released kernel-builder 0.7.0: https://github.com/huggingface/kernel-builder/releases/tag/v0.7.0

Headline features:

* 🔮 Supports building kernels for the brand-new PyTorch 2.9.0.
* 🪟 Experimental support for building Windows kernels.

sayakpaul

authored a paper 30 days ago

Factuality Matters: When Image Generation and Editing Meet Structured Visuals

Paper • 2510.05091 • Published about 1 month ago • 18

danieldk

updated a model about 2 months ago

kernels-test/state-test

Updated Sep 19

danieldk

published a model about 2 months ago

kernels-test/state-test

Updated Sep 19

danieldk

updated a model about 2 months ago

kernels-test/kernels-upload-test

Updated Sep 16

sayakpaul

posted an update 3 months ago

Post

1667

Fast LoRA inference for Flux with Diffusers and PEFT 🚨

There are great materials that demonstrate how to optimize inference for popular image generation models, such as Flux. However, very few cover how to serve LoRAs fast, despite LoRAs being an inseparable part of their adoption.

In our latest post, @BenjaminB and I show different techniques to optimize LoRA inference for the Flux family of models for image generation. Our recipe includes the use of:

1. torch.compile
2. Flash Attention 3 (when compatible)
3. Dynamic FP8 weight quantization (when compatible)
4. Hotswapping for avoiding recompilation during swapping new LoRAs 🤯

We have tested our recipe with Flux.1-Dev on both H100 and RTX 4090. We achieve at least a *2x speedup* in either of the GPUs. We believe our recipe is grounded in the reality of how LoRA-based use cases are generally served. So, we hope this will be beneficial to the community 🤗

Even though our recipe was tested primarily with NVIDIA GPUs, it should also work with AMD GPUs.

Learn the details and the full code here:
https://huggingface.co/blog/lora-fast

danieldk

updated a model 4 months ago

kernels-test/versions

Updated Jul 21

danieldk

published a model 4 months ago

kernels-test/versions

Updated Jul 21

danieldk

posted an update 4 months ago

Post

2049

kernels 0.8.0 is out: https://github.com/huggingface/kernels/releases/tag/v0.8.0

This release refines kernel selection in the kernelize function:

• You can now register kernels for certain CUDA capability ranges.
• Rather than doing exact mating of modes, fall back to other compatible modes. If you are kernelizing for inference, but you only registered a training + torch.compile kernel, it will use that kernel since it is compatible with inference as well.

1 reply

·

danieldk

posted an update 4 months ago

Post

471

You can get flash-attention 3 ⚡️ directly from the hub now using kernels!

kernels-community/flash-attn3

danieldk

posted an update 4 months ago

Post

377

Kernels 0.7.0 is out: https://github.com/huggingface/kernels/releases/tag/v0.7.0 🚀

This release makes it possible to register multiple kernels for a layer. Do you have a super-fast kernel for inference and another kernel for training? Register them both and kernelize will pick the kernel depending on whether you are going to do training or inference.

danieldk

updated 2 models 4 months ago

kernels-test/relu-metal

Updated Jul 10

kernels-test/backward-marker-test

Updated Jul 2

danieldk

posted an update 5 months ago

Post

1891

We have been working on a project called kernels. kernels makes it possible to load compute kernels directly from the Hub! 🚀

We plan to give kernels a more proper introduction soon. But for those who have been following along, we are happy to announce a new release:

- New layer API with torch.compile support.
- Experimental support for loading Apple Silicon Metal 🤘 Kernels.
- Generate wheels from Hub kernels for legacy deployments.

Full release notes here: https://github.com/huggingface/kernels/releases/tag/v0.6.0

1 reply

·

danieldk

published a model 6 months ago

kernels-test/relu-metal

Updated Jul 10

sayakpaul

posted an update 6 months ago

Post

2875

Diffusers supports a good variety of quantization backends. It can be challenging to navigate through them, given the complex nature of diffusion pipelines in general.

So, @derekl35 set out to write a comprehensive guide that puts users in the front seat. Explore the different backends we support, learn the trade-offs they offer, and finally, check out the cool space we built that lets you compare quantization results.

Give it a go here:
https://lnkd.in/gf8Pi4-2

2 replies

·

sayakpaul

posted an update 6 months ago

Post

1820

Despite the emergence of combining LLM and DiT architectures for T2I synthesis, its design remains severely understudied.

This was done long ago and got into CVPR25 -- super excited to finally share it now, along with the data and code ♥️

We explore several architectural choices that affect this design. We provide an open & reproducible training recipe that works at scale.

Works like Playground v3 have already explored a deep fusion between an LLM and a DiT, sharing their representations through layerwise attention. They exhibit excellent performance on T2I.

Despite its compelling results and other performance virtues, it remains unexplored, which is what we want to improve in our work. Specifically, we take a pre-trained LLM (Gemma-2B) and trainable DiT, and set out to explore what makes a "good deep fusion" between the two for T2I.

We explore several key questions in the work, such as:

Q1: How should we do attention? We considered several alternatives. PixArt-Alpha like attention (cross-attention) is very promising.
Q2: Should we incorporate additional text modulation?
Q3: Can we eliminate timestep conditioning?
Q4: How do we do positional encodings?
Q5: Do instruction-tuned LLMs help deep fusion?
Q6: Would using a decoder LLM from a multimodal model be helpful?
Q7: Does using a better variant of Gemma help?

Based on the above findings, we arrive at FuseDiT with the following components on top of the base architecture from the findings of our experiments.

* No AdaLN-Zero modules
* 1D + 2D-RoPE
* Gemma 2 2B, adjusting DiT configurations accordingly

We trained FuseDiT on a mixture from CC12M, JourneyDB, & SA (~26M image-text pairs) for 800 steps. While not the best model, it's encouraging to develop something in a guided manner using open datasets.

To know more (code, models, all are available), please check out the paper:
https://lnkd.in/gg6qyqZX.

sayakpaul

authored a paper 6 months ago

Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis

Paper • 2505.10046 • Published May 15 • 9

sayakpaul

authored a paper 7 months ago

From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

Paper • 2504.16080 • Published Apr 22 • 15

danieldk

updated a model 7 months ago

kernels-test/op-without-fake-test

Updated Apr 23

Kernels Tests

AI & ML interests

Recent Activity

Factuality Matters: When Image Generation and Editing Meet Structured Visuals

kernels-test/state-test

kernels-test/state-test

kernels-test/kernels-upload-test

kernels-test/versions

kernels-test/versions

kernels-test/relu-metal

kernels-test/backward-marker-test

kernels-test/relu-metal

Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis

From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

kernels-test/op-without-fake-test

AI & ML interests

Recent Activity

Team members 4

kernels-test's activity