Learning Video LLM with Streaming Speech Transcription at Scale (CVPR 2025)
Joya Chen PRO
chenjoya
AI & ML interests
Video LLM
Recent Activity
upvoted
a
paper
about 9 hours ago
Revisiting Multimodal Positional Encoding in Vision-Language Models
upvoted
a
paper
about 23 hours ago
VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual
Representation