DIFFA: Large Language Diffusion Models Can Listen and Understand
DIFFA is the first diffusion-based large audio-language model for spoken language understanding. It combines a frozen diffusion LLM with dual adapters (semantic + acoustic) to enhance audio perception and reasoning.