Multi-modal, multi-platform 3D grounding from 3EED.
Given a scene and a structured natural language expression, the task is to localize the referred object in
3D space.
Our dataset captures diverse embodied viewpoints from three robot platforms:
Vehicle,
Drone, and
Quadruped,
presenting unique challenges in spatial reasoning, scene analysis, and cross-platform 3D generalization.
Visual grounding in 3D is the key for embodied agents to localize language-referred objects in open-world environments. However, existing benchmarks are limited to indoor focus, single-platform constraints, and small scale. We introduce 3EED, a multi-platform, multi-modal 3D grounding benchmark featuring RGB and LiDAR data from Vehicle, Drone, and Quadruped platforms. We provide over 134,000 objects and 25,000 validated referring expressions across diverse outdoor scenes -- 10x larger than existing datasets. We develop a scalable annotation pipeline combining vision-language model prompting with human verification to ensure high-quality spatial grounding. To support cross-platform learning, we propose platform-aware normalization and cross-modal alignment techniques, and establish benchmark protocols for in-domain and cross-platform evaluations. Our findings reveal significant performance gaps, highlighting the challenges and opportunities of generalizable 3D grounding. The 3EED dataset and benchmark toolkit are released to advance future research in language-driven 3D embodied perception.
Dataset | Sensor | ![]() |
![]() |
![]() |
Scene Coverage | #Scenes | #Objects | #Expr. | Elevation |
---|---|---|---|---|---|---|---|---|---|
Mono3DRefer | C | ✓ | ✗ | ✗ | 140m x 140m | 2,025 | 8,228 | 41,140 | 42.8m |
KITTI360Pose | L | ✓ | ✗ | ✗ | 140m x 140m | - | 14,934 | 43,381 | 42.8m |
CityRefer | L | ✗ | ✓ | ✗ | - | - | 5,866 | 35,196 | - |
STRefer | L + C | ✓ | ✗ | ✗ | 60m x 60m | 662 | 3,581 | 5,458 | - |
LifeRefer | L + C | ✓ | ✗ | ✗ | 60m x 60m | 3,172 | 11,864 | 25,380 | - |
Talk2LiDAR | L + C | ✓ | ✗ | ✗ | 140m x 140m | 6,419 | - | 59,207 | 48.6m |
Talk2Car-3D | L + C | ✓ | ✗ | ✗ | 140m x 140m | 5,534 | - | 10,169 | 48.6m |
Ours | L + C | ✓ | ✓ | ✓ | 280m x 240m | 23,618 | 134,143 | 25,551 | 80m |
Figure: Overview of the data annotation workflow. Left: We collect 3D boxes using multi-detector fusion, tracking, filtering, and manual verification across platforms. Middle: Referring expressions are produced by prompting a VLM with structured cues (class, status, position, relations), followed by rule-based rewriting and human refinement. Right: Platform-specific word clouds highlight distinct linguistic patterns in descriptions across vehicle, drone, and quadruped agents.
Figure: Left: Target bounding box distributions in polar coordinates. Color intensity indicates the frequency of targets in each (ρ, θr) bin. Middle: Scene distribution for train/val splits on each platform, along with per-scene object count histograms. Right: Elevation distributions of input point cloud, pz, reflecting view-dependent elevation biases.
Data examples from the
Vehicle platform in our dataset.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Data examples from the
Drone platform in our dataset.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Data examples from the
Quadruped platform in our dataset.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Rong Li*
HKUST(GZ)
Yuhao Dong*
NTU
Tianshuai Hu*
HKUST
Ao Liang*
NUS
Youquan Liu*
FDU
Dongyue Lu*
NUS
Liang Pan
Shanghai AI Lab
Lingdong Kong†
NUS
Junwei Liang✉
HKUST(GZ)
Ziwei Liu✉
NTU
* Equal contributions † Project Lead ✉ Corresponding authors
@article{li20253eed,
title = {3EED: Ground Everything Everywhere in 3D},
author = {Rong Li and Yuhao Dong and Tianshuai Hu and Ao Liang and Youquan Liu and Dongyue Lu and Liang Pan and Lingdong Kong and Junwei Liang and Ziwei Liu},
year = {2025},
}