Analyze audio to identify speakers
Generate text from audio with instructions
Generate text from audio and instructions
Transcribe audio into text
Ask question to model, let it think and describe semantics