The Sonar Moment: Benchmarking Audio-Language Models in Audio Geo-Localization
Abstract
Audio geo-localization benchmark AGL1K is introduced to advance audio language models' geospatial reasoning capabilities through curated audio clips and evaluation across multiple models.
Geo-localization aims to infer the geographic origin of a given signal. In computer vision, geo-localization has served as a demanding benchmark for compositional reasoning and is relevant to public safety. In contrast, progress on audio geo-localization has been constrained by the lack of high-quality audio-location pairs. To address this gap, we introduce AGL1K, the first audio geo-localization benchmark for audio language models (ALMs), spanning 72 countries and territories. To extract reliably localizable samples from a crowd-sourced platform, we propose the Audio Localizability metric that quantifies the informativeness of each recording, yielding 1,444 curated audio clips. Evaluations on 16 ALMs show that ALMs have emerged with audio geo-localization capability. We find that closed-source models substantially outperform open-source models, and that linguistic clues often dominate as a scaffold for prediction. We further analyze ALMs' reasoning traces, regional bias, error causes, and the interpretability of the localizability metric. Overall, AGL1K establishes a benchmark for audio geo-localization and may advance ALMs with better geospatial reasoning capability.
Community
We found the sonar moment in audio language models. We propose the task of audio geo-localization. And amazingly, Gemini 3 Pro can reach the distance error of less than 55km for 25% samples.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach (2026)
- GeoX-Bench: Benchmarking Cross-View Geo-Localization and Pose Estimation Capabilities of Large Multimodal Models (2025)
- Enhancing Geo-localization for Crowdsourced Flood Imagery via LLM-Guided Attention (2025)
- GEO-Detective: Unveiling Location Privacy Risks in Images with LLM Agents (2025)
- GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization (2025)
- MapTrace: Scalable Data Generation for Route Tracing on Maps (2025)
- ALIGN: A Vision-Language Framework for High-Accuracy Accident Location Inference through Geo-Spatial Neural Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Here are a few concerns I had after reading the paper.
First, the motivation feels a bit thin: it’s not obvious how often we truly need to infer location from audio alone in real-world settings, especially when many plausible use cases would typically rely on richer signals (video/images, timestamps, device metadata, or surrounding context).
Second, the dataset construction may introduce strong sampling bias. Since the benchmark is built from user-uploaded clips on Aporee, it likely over-represents travel/landmark-style recordings and soundscapes with strong linguistic cues, rather than a distribution that resembles everyday environments.
Third, the scale is quite small for a “global” claim (about 1.4K clips across 72 countries/regions, with clear geographic imbalance). With this size, it’s hard to conclude models have a generally reliable audio geo-localization capability; the results could mostly reflect success on a limited set of highly localizable or otherwise “representative” locations.
Finally, since the audio originates from a public online source, it’s plausible that parts of this corpus (or close variants) were already present in some models’ pretraining data. If so, strong performance might reflect memorization or retrieval of seen content rather than genuine audio-based reasoning.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 1
Collections including this paper 0
No Collection including this paper