GitPedia

Awesome vla for ad

๐ŸŒ Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future

From worldbenchยทUpdated June 14, 2026ยทView on GitHubยท

Autonomous driving has long relied on modular "Perception-Decision-Action" pipelines, whose hand-crafted interfaces and rule-based components often struggle in complex, dynamic, or long-tailed scenarios. Their cascaded structure also amplifies upstream perception errors, undermining downstream planning and control. The project is written primarily in HTML, distributed under the MIT License license, first published in 2025. Key topics include: 3d, autonomous-driving, awesome-list, embodied-ai, large-language-models.

Awesome Logo
arXiv
Visitors
PR's Welcome

:sunglasses: Awesome VLA for Autonomous Driving

Autonomous driving has long relied on modular "Perception-Decision-Action" pipelines, whose hand-crafted interfaces and rule-based components often struggle in complex, dynamic, or long-tailed scenarios. Their cascaded structure also amplifies upstream perception errors, undermining downstream planning and control.

This survey reviews vision-action (VA) models and vision-language-action (VLA) models for autonomous driving. We trace the evolution from early VA approaches to modern VLA frameworks, and organize existing methods into two principal paradigms:

  • End-to-End VLA, which integrates perception, reasoning, and planning within a single model.
  • Dual-System VLA, which separates slow deliberation (via VLMs) from fast, safety-critical execution (via planners).
<img width="100%" src="docs/figures/teaser.png">

For more details, kindly refer to our :books: Paper, :globe_with_meridians: Project Page, and :hugs: HuggingFace Leaderboard.

:books: Citation

If you find this work helpful for your research, please kindly consider citing our paper:

bib
@article{survey_vla4ad, title = {Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future}, author = {Tianshuai Hu and Xiaolu Liu and Song Wang and Yiyao Zhu and Ao Liang and Lingdong Kong and Guoyang Zhao and Zeying Gong and Jun Cen and Zhiyu Huang and Xiaoshuai Hao and Linfeng Li and Hang Song and Xiangtai Li and Jun Ma and Shaojie Shen and Jianke Zhu and Dacheng Tao and Ziwei Liu and Junwei Liang}, journal = {arXiv preprint arXiv:2512.16760}, year = {2025}, }
bib
@article{survey_3d_4d_world_models, title = {{3D} and {4D} World Modeling: A Survey}, author = {Lingdong Kong and Wesley Yang and Jianbiao Mei and Youquan Liu and Ao Liang and Dekai Zhu and Dongyue Lu and Wei Yin and Xiaotao Hu and Mingkai Jia and Junyuan Deng and Kaiwen Zhang and Yang Wu and Tianyi Yan and Shenyuan Gao and Song Wang and Linfeng Li and Liang Pan and Yong Liu and Jianke Zhu and Wei Tsang Ooi and Steven C. H. Hoi and Ziwei Liu}, journal = {arXiv preprint arXiv:2509.07996}, year = {2025} }

Table of Contents

1. Vision-Action Models

:one: Action-Only Models

:timer_clock: In chronological order, from the earliest to the latest.

ModelPaperVenueWebsiteGitHub
LBCarXiv<br>Learning by CheatingCoRL 2020-GitHub
Latent-DRLarXiv<br>End-to-End Model-Free Reinforcement Learning for Urban Driving using Implicit AffordancesCVPR 2020--
NEATarXiv<br>NEAT: Neural Attention Fields for End-to-End Autonomous DrivingICCV 2021-GitHub
RoacharXiv<br>End-to-End Urban Driving by Imitating a Reinforcement Learning CoachICCV 2021WebsiteGitHub
WoRarXiv<br>Learning to Drive from A World on RailsICCV 2021WebsiteGitHub
TCParXiv<br>Trajectory-guided Control Prediction for End-to-end Autonomous Driving: A Simple yet Strong BaselineNeurIPS 2022-GitHub
Urban-DriverarXiv<br>Urban Driver: Learning to Drive from Real-world Demonstrations Using Policy GradientsCoRL 2022WebsiteGitHub
LAVarXiv<br>Learning from All VehiclesCVPR 2022WebsiteGitHub
TransFuserarXiv<br>TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous DrivingTPAMI 2023-GitHub
GRIarXiv<br>GRI: General Reinforced Imitation and its Application to Vision-Based Autonomous DrivingRobotics 2023--
BEVPlannerarXiv<br>Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?CVPR 2024-GitHub
Raw2DrivearXiv<br>Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2)NeurIPS 2025--
RADarXiv<br>RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement LearningNeurIPS 2025Website-
TrajDiffarXiv<br>TrajDiff: End-to-End Autonomous Driving without Perception AnnotationarXiv 2025-GitHub
SimScalearXiv<br>SimScale: Learning to Drive via Real-World Simulation at ScalearXiv 2025WebsiteGitHub
-arXiv<br>Addressing the Waypoint-Action Gap in End-to-End Autonomous Driving via Vehicle Motion ModelsarXiv 2026--

:two: Perception-Action Models

:timer_clock: In chronological order, from the earliest to the latest.

ModelPaperVenueWebsiteGitHub
ST-P3arXiv<br>ST-P3: End-to-End Vision-Based Autonomous Driving via Spatial-Temporal Feature LearningECCV 2022-GitHub
UniADarXiv<br>Planning-Oriented Autonomous DrivingCVPR 2023-GitHub
VADarXiv<br>VAD: Vectorized Scene Representation for Efficient Autonomous DrivingICCV 2023-GitHub
OccNetarXiv<br>Scene as OccupancyICCV 2023-GitHub
GenADarXiv<br>GenAD: Generative End-to-End Autonomous DrivingECCV 2024-GitHub
PARA-DriveCVPR<br>PARA-Drive: Parallelized Architecture for Real-Time Autonomous DrivingCVPR 2024Website-
Hydra-MDPCVPRW<br>Hydra-MDP: End-to-End Multimodal Planning with Multi-Target Hydra-DistillationCVPRW 2024WebsiteGitHub
SparseADarXiv<br>SparseAD: Sparse Query-Centric Paradigm for Efficient End-to-End Autonomous DrivingarXiv 2024--
GaussianADarXiv<br>GaussianAD: Gaussian-Centric End-to-End Autonomous DrivingarXiv 2024--
DiFSDarXiv<br>DiFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for Efficient End-to-End Self-DrivingarXiv 2024-GitHub
DriveTransformerarXiv<br>DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous DrivingICLR 2025-GitHub
SparseDrivearXiv<br>SparseDrive: End-to-End Autonomous Driving via Sparse Scene RepresentationICRA 2025-GitHub
DiffusionDrivearXiv<br>DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous DrivingCVPR 2025-GitHub
GoalFlowarXiv<br>GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous DrivingCVPR 2025WebsiteGitHub
GuideFlowarXiv<br>GuideFlow: Constraint-Guided Flow Matching for Planning in End-to-End Autonomous DrivingarXiv 2025-GitHub
ETAarXiv<br>ETA: Efficiency through Thinking Ahead, A Dual Approach to Self-Driving with Large ModelsarXiv 2025-GitHub
GeoarXiv<br>Spatial Retrieval Augmented Autonomous DrivingarXiv 2025--
DiffusionDriveV2arXiv<br>DiffusionDriveV2: Reinforcement Learning-Constrained Truncated Diffusion Modeling in End-to-End Autonomous DrivingarXiv 2025-GitHub
NaviHydraarXiv<br>NaviHydra: Controllable Navigation-Guided End-to-End Autonomous Driving with Hydra DistillationarXiv 2025--
MimirarXiv<br>Mimir: Hierarchical Goal-Driven Diffusion with Uncertainty Propagation for End-to-End Autonomous DrivingarXiv 2025-GitHub
FROST-DrivearXiv<br>FROST-Drive: Scalable and Efficient End-to-End Driving with a Frozen Vision EncoderarXiv 2026--
DrivoRarXiv<br>Driving on RegistersarXiv 2026WebsiteGitHub
SPSarXiv<br>See Less, Drive Better: Generalizable End-to-End Autonomous Driving via Foundation Models Stochastic Patch SelectionarXiv 2026--
BevADarXiv<br>What Matters for Scalable and Robust Learning in End-to-End Driving Planners?CVPR 2026WebsiteGitHub

:three: Image-Based World Models

:timer_clock: In chronological order, from the earliest to the latest.

ModelPaperVenueWebsiteGitHub
DriveDreamerarXiv<br>DriveDreamer: Towards Real-World-Driven World Models for Autonomous DrivingECCV 2024WebsiteGitHub
GenADarXiv<br>GenAD: Generalized Predictive Model for Autonomous DrivingCVPR 2024-GitHub
Drive-WMarXiv<br>Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous DrivingCVPR 2024WebsiteGitHub
DrivingWorldarXiv<br>DrivingWorld: Constructing World Model for Autonomous Driving via Video GPTarXiv 2024WebsiteGitHub
Imagine-2-DrivearXiv<br>Imagine-2-Drive: Leveraging High-Fidelity World Models via Multi-Modal Diffusion PoliciesIROS 2025Website-
DrivingGPTarXiv<br>DrivingGPT: Unifying Driving World Modeling and Planning with Multi-Modal Autoregressive TransformersICCV 2025Website-
EponaarXiv<br>Epona: Autoregressive Diffusion World Model for Autonomous DrivingICCV 2025WebsiteGitHub
VaViMarXiv<br>VaViM and VaVAM: Autonomous Driving through Video Generative ModelingarXiv 2025WebsiteGitHub
UniDrive-WMarXiv<br>UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous DrivingarXiv 2026Website-
DwDarXiv<br>Driving with DINO: Vision Foundation Features as a Unified Bridge for Sim-to-Real Generation in Autonomous DrivingarXiv 2026--
WorldDrivearXiv<br>Bridging Scene Generation and Planning: Driving with World Model via Unifying Vision and Motion RepresentationarXiv 2026-GitHub

:four: Occupancy-Based World Models

:timer_clock: In chronological order, from the earliest to the latest.

ModelPaperVenueWebsiteGitHub
OccWorldarXiv<br>OccWorld: Learning a 3D Occupancy World Model for Autonomous DrivingECCV 2024WebsiteGitHub
NeMoECCV<br>Neural Volumetric World Models for Autonomous DrivingECCV 2024--
OccVAROpenReview<br>OCCVAR: Scalable 4D Occupancy Prediction via Next-Scale PredictionOpenReview 2024--
RenderWorldarXiv<br>RenderWorld: World Model with Self-Supervised 3D LabelarXiv 2024--
DFIT-OccWorldarXiv<br>An Efficient Occupancy World Model via Decoupled Dynamic Flow and Image-assisted TrainingarXiv 2024--
Drive-OccWorldarXiv<br>Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous DrivingAAAI 2025WebsiteGitHub
TยณFormerarXiv<br>Temporal Triplane Transformers as Occupancy World ModelsarXiv 2025--
OmniNWMarXiv<br>OmniNWM: Omniscient Driving Navigation World ModelsarXiv 2025-GitHub
AD-R1arXiv<br>AD-R1: Closed-Loop Reinforcement Learning for End-to-End Autonomous Driving with Impartial World ModelsarXiv 2025--
SparseOccVLAarXiv<br>SparseOccVLA: Bridging Occupancy and Vision-Language Models via Sparse Queries for Unified 4D Scene Understanding and PlanningarXiv 2026-GitHub

:five: Latent-Based World Models

:timer_clock: In chronological order, from the earliest to the latest.

ModelPaperVenueWebsiteGitHub
Covariate-ShiftarXiv<br>Mitigating Covariate Shift in Imitation Learning for Autonomous Vehicles Using Latent Space Generative World ModelsarXiv 2024--
World4DrivearXiv<br>World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World ModelICCV 2025--
WoTEarXiv<br>End-to-End Driving with Online Trajectory Evaluation via BEV World ModelICCV 2025-GitHub
LAWarXiv<br>Enhancing End-to-End Autonomous Driving with Latent World ModelICLR 2025-GitHub
SSRarXiv<br>Navigation-Guided Sparse Scene Representation for End-to-End Autonomous DrivingICLR 2025-GitHub
Echo-PlanningarXiv<br>Echo Planning for Autonomous Driving: From Current Observations to Future Trajectories and BackarXiv 2025--
SeerDrivearXiv<br>Future-Aware End-to-End Driving: Bidirectional Modeling of Trajectory Planning and Scene EvolutionNeurIPS 2025-GitHub
Drive-JEPAarXiv<br>Drive-JEPA: Video JEPA Meets Multimodal Trajectory Distillation for End-to-End DrivingarXiv 2026-GitHub

2. Vision-Language-Action Models

:one: Textual Action Generator

:timer_clock: In chronological order, from the earliest to the latest.

ModelPaperVenueWebsiteGitHub
DriveMLMarXiv<br>DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous DrivingarXiv 2023-GitHub
RAG-DriverarXiv<br>RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language ModelRSS 2024WebsiteGitHub
RDA-DriverarXiv<br>Making Large Language Models Better Planners with Reasoning-Decision AlignmentECCV 2024--
DriveLMarXiv<br>DriveLM: Driving with Graph Visual Question AnsweringECCV 2024WebsiteGitHub
DriveGPT4arXiv<br>DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language ModelRA-L 2024Website-
DriVLMearXiv<br>DriVLMe: Enhancing LLM-based Autonomous Driving Agents with Embodied and Social ExperienceIROS 2024WebsiteGitHub
LLaDAarXiv<br>Driving Everywhere with Large Language Model Policy AdaptationCVPR 2024WebsiteGitHub
VLAADWACVW<br>VLAAD: Vision and Language Assistant for Autonomous DrivingWACVW 2024-GitHub
OccLLaMAarXiv<br>OccLLaMA: A Unified Occupancy-Language-Action World Model for Understanding and Generation Tasks in Autonomous DrivingarXiv 2024Website-
Doe-1arXiv<br>Doe-1: Closed-Loop Autonomous Driving with Large World ModelarXiv 2024WebsiteGitHub
LINGO-2arXiv<br>LINGO-2: Driving with Natural Language-Website-
SafeAutoarXiv<br>SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation ModelsICML 2025-GitHub
OpenEMMAarXiv<br>OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous DrivingWACV 2025-GitHub
ReasonPlanarXiv<br>ReasonPlan: Unified Scene Prediction and Decision Reasoning for Closed-loop Autonomous DrivingCoRL 2025-GitHub
WKERarXiv<br>World Knowledge-Enhanced Reasoning Using Instruction-Guided Interactor in Autonomous DrivingAAAI 2025--
OmniDrivearXiv<br>OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and PlanningCVPR 2025-GitHub
S4-DriverarXiv<br>S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Model with Spatio-Temporal Visual RepresentationCVPR 2025Website-
Occ-LLMarXiv<br>Occ-LLM: Enhancing Autonomous Driving with Occupancy-BasedLarge Language ModelsICRA 2025--
DriveBencharXiv<br>Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric PerspectivesICCV 2025WebsiteGitHub
FutureSightDrivearXiv<br>FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous DrivingNeurIPS 2025WebsiteGitHub
ImpromptuVLAarXiv<br>Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action ModelsNeurIPS 2025WebsiteGitHub
Sce2DriveXarXiv<br>Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive LearningRA-L 2025--
EMMAarXiv<br>EMMA: End-to-End Multimodal Model for Autonomous DrivingTMLR 2025Website-
DriveAgent-R1arXiv<br>DriveAgent-R1: Advancing VLM-Based Autonomous Driving with Hybrid Thinking and Active PerceptionarXiv 2025--
Drive-R1arXiv<br>Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement LearningarXiv 2025--
FastDriveVLAarXiv<br>FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-Based Token PruningarXiv 2025--
WiseADarXiv<br>WiseAD: Knowledge Augmented End-to-End Autonomous Driving with Vision-Language ModelarXiv 2025WebsiteGitHub
AutoDrive-RยฒarXiv<br>AutoDrive-Rยฒ: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous DrivingarXiv 2025--
OmniReasonarXiv<br>OmniReason: A Temporal-Guided Vision-Language-Action Framework for Autonomous DrivingarXiv 2025--
OpenREADarXiv<br>OpenREAD: Reinforced Open-Ended Reasoning for End-to-End Autonomous Driving with LLM-as-CriticarXiv 2025-GitHub
dVLM-ADarXiv<br>dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable ReasoningarXiv 2025--
PLAarXiv<br>A Unified Perception-Language-Action Framework for Adaptive Autonomous DrivingarXiv 2025--
AlphaDrivearXiv<br>AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and ReasoningarXiv 2025-GitHub
CoReVLAarXiv<br>CoReVLA: A Dual-Stage End-to-End Autonomous Driving Framework for Long-Tail Scenarios via Collect-and-RefinearXiv 2025WebsiteGitHub
WAM-DiffarXiv<br>WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous DrivingarXiv 2025-GitHub

:two: Numerical Action Generator

:timer_clock: In chronological order, from the earliest to the latest.

ModelPaperVenueWebsiteGitHub
LMDrivearXiv<br>LMDrive: Closed-Loop End-to-End Driving with Large Language ModelsCVPR 2024WebsiteGitHub
BEVDriverarXiv<br>BEVDriver: Leveraging BEV Maps in LLMs for Robust Closed-Loop DrivingIROS 2025--
CoVLA-AgentarXiv<br>CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous DrivingWACV 2025Website-
ORIONarXiv<br>ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action GenerationICCV 2025WebsiteGitHub
SimLingoarXiv<br>SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action AlignmentCVPR 2025WebsiteGitHub
DriveGPT4-V2CVPR<br>DriveGPT4-V2: Harnessing Large Language Model Capabilities for Enhanced Closed-Loop Autonomous DrivingCVPR 2025--
AutoVLAarXiv<br>AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-TuningNeurIPS 2025WebsiteGitHub
DriveMoEarXiv<br>DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous DrivingarXiv 2025WebsiteGitHub
DSDrivearXiv<br>DSDrive: Distilling Large Language Model for Lightweight End-to-End Autonomous Driving with Unified Reasoning and PlanningarXiv 2025--
OccVLAarXiv<br>OccVLA: Vision-Language-Action Model with Implicit 3D Occupancy Supervision.arXiv 2025--
VDRivearXiv<br>VDRive: Leveraging Reinforced VLA and Diffusion Policy for End-to-End Autonomous DrivingarXiv 2025--
ReflectDrivearXiv<br>Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous DrivingarXiv 2025-GitHub
E3ADarXiv<br>E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous DrivingarXiv 2025--
LCDrivearXiv<br>Latent Chain-of-Thought World Modeling for End-to-End DrivingarXiv 2025--
Alpamayo-R1arXiv<br>Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long TailarXiv 2025--
UniUGParXiv<br>UniUGP: Unifying understanding, generation, and planing for end-to-end autonomous driving.arXiv 2025--
MindDrivearXiv<br>MindDrive: An All-in-One Framework Bridging World Models and Vision-Language Model for End-to-End Autonomous DrivingarXiv 2025--
AdaThinkDrivearXiv<br>AdaThinkDrive: Adaptive Thinking via Reinforcement Learning for Autonomous DrivingarXiv 2025--
Percept-WAMarXiv<br>Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous DrivingarXiv 2025--
Reasoning-VLAarXiv<br>Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous DrivingarXiv 2025--
SpaceDrivearXiv<br>SpaceDrive: Infusing Spatial Awareness into VLM-Based Autonomous DrivingarXiv 2025--
OpenDriveVLAarXiv<br>OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action ModelAAAI 2026WebsiteGitHub
WAM-FlowarXiv<br>WAM-Flow: Parallel Coarse-to-Fine Motion Planning via Discrete Flow Matching for Autonomous DrivingCVPR 2026GitHub
ColaVLAarXiv<br>ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous DrivingCVPR 2026WebsiteGitHub

:three: Explicit Action Guidance

:timer_clock: In chronological order, from the earliest to the latest.

ModelPaperVenueWebsiteGitHub
DriveVLMarXiv<br>DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language ModelsCoRL 2024Website-
LeapADarXiv<br>Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous DrivingNeurIPS 2024WebsiteGitHub
FasionADarXiv<br>FASIONAD: Fast and Slow Fusion Thinking Systems for Human-Like Autonomous Driving with Adaptive FeedbackarXiv 2024--
SennaarXiv<br>Senna: Bridging Large Vision-Language Models and End-to-End Autonomous DrivingarXiv 2024-GitHub
DualADarXiv<br>DualAD: Dual-Layer Planning for Reasoning in Autonomous DrivingIROS 2025WebsiteGitHub
DME-DriverarXiv<br>DME-Driver: Integrating Human Decision Logic and 3D Scene Perception in Autonomous DrivingAAAI 2025--
SOLVEarXiv<br>SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous DrivingCVPR 2025--
ReAL-ADarXiv<br>ReAL-AD: Towards Human-Like Reasoning in End-to-End Autonomous DrivingICCV 2025Website-
LeapVADarXiv<br>LeapVAD: A Leap in Autonomous Driving via Cognitive Perception and Dual-Process ThinkingTNNLS 2025--
DiffVLAarXiv<br>DiffVLA: Vision-Language Guided Diffusion Planning for Autonomous DrivingarXiv 2025--
FasionAD++arXiv<br>FASIONAD++: Integrating High-Level Instruction and Information Bottleneck in Fast-Slow fusion Systems for Enhanced Safety in Autonomous Driving with Adaptive FeedbackarXiv 2025--
HiST-VLAarXiv<br>HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous DrivingarXiv 2026--

:four: Implicit Representations Transfer

:timer_clock: In chronological order, from the earliest to the latest.

ModelPaperVenueWebsiteGitHub
VLParXiv<br>VLP: Vision Language Planning for Autonomous DrivingCVPR 2024--
VLM-ADarXiv<br>VLM-AD: End-to-End Autonomous Driving through Vision-Language Model SupervisionCoRL 2025--
DiMAarXiv<br>Distilling Multi-modal Large Language Models for Autonomous DrivingCVPR 2025--
DINO-ForesightarXiv<br>DINO-Foresight: Looking into the Future with DINONeurIPS 2025WebsiteGitHub
ALN-P3arXiv<br>ALN-P3: Unified Language Alignment for Perception, Prediction, and Planning in Autonomous DrivingarXiv 2025--
VERDIarXiv<br>VERDI: VLM-Embedded Reasoning for Autonomous DrivingarXiv 2025--
VLM-E2EarXiv<br>VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention FusionarXiv 2025--
ReCogDrivearXiv<br>ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous DrivingarXiv 2025WebsiteGitHub
InsightDrivearXiv<br>InsightDrive: Insight Scene Representation for End-to-End Autonomous DrivingarXiv 2025-GitHub
NetRollerarXiv<br>NetRoller: Interfacing General and Specialized Models for End-to-End Autonomous DrivingarXiv 2025-GitHub
ViLaDarXiv<br>ViLaD: A Large Vision Language Diffusion Framework for End-to-End Autonomous DrivingarXiv 2025--
OmniScenearXiv<br>OmniScene: Attention-Augmented Multimodal 4D Scene Understanding for Autonomous DrivingarXiv 2025--
LMADarXiv<br>LMAD: Integrated End-to-End VisionLanguage Model for Explainable Autonomous DrivingarXiv 2025--
BEVLMarXiv<br>BEVLM: Distilling Semantic Knowledge from LLMs into Bird's-Eye View RepresentationsarXiv 2026--

3. Datasets & Benchmarks

:timer_clock: In chronological order, from the earliest to the latest.

:one: Vision-Action Datasets

DatasetPaperVenueWebsiteGitHub
BDD100KarXiv<br>BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask LearningCVPR 2020WebsiteGitHub
nuScenesarXiv<br>nuScenes: A Multimodal Dataset for Autonomous DrivingCVPR 2020Website-
WaymoarXiv<br>Scalability in Perception for Autonomous Driving: Waymo Open DatasetCVPR 2020WebsiteGitHub
nuPlanarXiv<br>nuPlan: A Closed-Loop ML-Based Planning Benchmark for Autonomous VehiclesarXiv 2021WebsiteGitHub
Argoverse 2arXiv<br>Argoverse 2: Next Generation Datasets for Self-Driving Perception and ForecastingNeurIPS 2021WebsiteGitHub
Bench2DrivearXiv<br>Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-to-End Autonomous DrivingNeurIPS 2024-GitHub
RoboBEVarXiv<br>Benchmarking and Improving Bird's Eye View Perception Robustness in Autonomous DrivingTPAMI 2025-GitHub
WOD-E2EarXiv<br>WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-Tail ScenariosarXiv 2025WebsiteGitHub
navdreamarXiv<br>The Constant Eye: Benchmarking and Bridging Appearance Robustness in Autonomous DrivingarXiv 2026--

:two: Vision-Language-Action Datasets

DatasetPaperVenueWebsiteGitHub
BDD-XarXiv<br>Textual Explanations for Self-Driving VehiclesECCV 2018-GitHub
Talk2CarIEEE<br>Talk2Car: Predicting Physical Trajectories for Natural Language CommandsIEEE Access 2022-GitHub
SDNarXiv<br>DOROTHIE: Spoken Dialogue for Handling Unexpected Situations in Interactive Autonomous Driving AgentsEMNLP 2022-GitHub
DriveMLMarXiv<br>DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous DrivingarXiv 2023-GitHub
LMDrivearXiv<br>LMDrive: Closed-Loop End-to-End Driving with Large Language ModelsCVPR 2024WebsiteGitHub
DriveLM-nuScenesarXiv<br>DriveLM: Driving with Graph Visual Question AnsweringECCV 2024WebsiteGitHub
HBDarXiv<br>DME-Driver: Integrating Human Decision Logic and 3D Scene Perception in Autonomous DrivingAAAI 2025--
VLAADWACVW<br>VLAAD: Vision and Language Assistant for Autonomous DrivingWACVW 2024-GitHub
SUP-ADarXiv<br>DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language ModelsCoRL 2024Website-
NuInstructarXiv<br>Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large ModelsCVPR 2024-GitHub
WOMD-ReasoningarXiv<br>WOMD-Reasoning: A Large-Scale Dataset for Interaction Reasoning in DrivingICML 2025WebsiteGitHub
DriveCoTarXiv<br>DriveCoT: Integrating Chain-of-Thought Reasoning with End-to-End DrivingarXiv 2024Website-
Reason2DrivearXiv<br>Reason2Drive: Towards Interpretable and Chain-Based Reasoning for Autonomous DrivingECCV 2024-GitHub
DriveBencharXiv<br>Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric PerspectivesICCV 2025WebsiteGitHub
MetaADarXiv<br>AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and ReasoningarXiv 2025WebsiteGitHub
OmniDrivearXiv<br>OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and PlanningCVPR 2025-GitHub
NuInteractarXiv<br>Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous DrivingarXiv 2025--
DriveActionarXiv<br>DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA ModelsarXiv 2025--
ImpromptuVLAarXiv<br>Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action ModelsarXiv 2025WebsiteGitHub
CoVLAarXiv<br>CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous DrivingWACV 2025Website-
OmniReason-nuScenesarXiv<br>OmniReason: A Temporal-Guided Vision-Language-Action Framework for Autonomous DrivingarXiv 2025--
OmniReason-B2DarXiv<br>OmniReason: A Temporal-Guided Vision-Language-Action Framework for Autonomous DrivingarXiv 2025--

4. Applications

5. Other Resources

Contributors

Showing top 10 contributors by commit count.

View all contributors on GitHub โ†’

This article is auto-generated from worldbench/awesome-vla-for-ad via the GitHub API.Last fetched: 6/14/2026