Autonomous Vehicles' Future: The Role of Vision Language Models Expressed
Freakin' Apex.AI's Xingjian "XJ" Zhang, the dude behind Growth, drops some knowledge on how Vision-Language Models (VLMs) are revolutionizing the autonomous driving game.
Y'all know how the AV industry's been lagging, particularly with that pesky "long tail problem" – the ocean of rare, unexpected scenarios that vehicles need to juggle? Even the most everyday road trip can throw curveballs – sudden weather shifts, unexpected roadworks or erratic pedestrians.
So, what gives? Current AVs rely heavily on high-def maps, meticulously labeled datasets, and rule-based logic. They're like stage actors who can nail a script but crumble when improv time hits. To expand into a new operational design domain (ODD), you've gotta keep mapping files current, label more data, and re-engineer the system – which is a bloody costly and time-consuming process.
But wait, there's a new sheriff in town: VLMs! These bad boys integrate computer vision and natural language processing, allowing AVs to interpret multimodal data by linking visual inputs with textual descriptions.
Check this out: the DriveVLM project by Li Auto and Tsinghua University? It uses a vision transformer encoder alongside a large language model (LLM). The deal is that the vision encoder turns images into tokens, which an attention-based extractor aligns with the LLM, generating a detailed linguistic description of the environment – even in those tricky long-tail scenarios.
So, here's why VLMs are truly game-changers: they leverage pre-training on large-scale internet data, giving them a solid understanding of the world and improving scene understanding and planning, helping AVs navigate complex environments.
Now, AV architectures are evolving from modular systems to end-to-end (E2E). That means perception, prediction, and planning are unified, and raw sensor inputs are directly processed into driving actions. Combining VLMs with end-to-end (E2E) systems could be a colossal breakthrough. Waymo's End-to-End Multimodal Model for Autonomous Driving (EMMA) is a prime example of VLM in action. This E2E system integrates perception and planning into a single framework, achieving state-of-the-art performance in benchmark datasets like nuScenes and the Waymo Open Motion Dataset (WOMD).
Unlike modular architectures, EMMA directly processes raw camera images and high-level driving commands to generate driving outputs – like trajectory planning, object detection, and road graph estimation. This unified approach reduces error accumulation across independent modules, leveraging the extensive world knowledge embedded in pre-trained LLMs.
EMMA also employs self-supervised learning, similar to next-token prediction in LLMs, to anticipate traffic patterns. It iterates through multiple future motion scenarios, identifying nuanced driving behaviors that traditional models might overlook. Plus, EMMA improves decision-making through chain-of-thought (CoT) reasoning, allowing the model to explain its driving choices, leveling-up safety and interpretability.
Now, VLMs are sweet but not without their challenges. For one, they need to process continuous, high-dimensional video streams in real time – not easy. Advanced 3D scene understanding is still a scarcity. Take Waymo's EMMA: it's limited to camera-only inputs, without fusion with 3D sensing modalities like LiDAR.
Another key issue is inference latency. Every millisecond counts in AVs, and current VLMs, like DriveVLM, running on an NVIDIA Orin X with a four-billion-parameter Qwen model, have prefill latency of 0.57 seconds and a decode latency of 1.33 seconds. That's 1.9 seconds to process a single scene, which translates to the vehicle traveling approximately 139 feet (42 meters) before it could react in critical situations.
Overall, VLMs have enormous potential for AVs, but they've gotta solve these hurdles. Advancements in model distillation and edge computing will make VLMs more efficient, enabling real-time decision-making. So, grab some popcorn and stay tuned – the future of autonomous driving is about to get really interesting!
By the way, do y'all think ol' XJ has what it takes to join the Forbes Technology Council? Those are the kool kids, ain't they?
- Xingjian "XJ" Zhang's contribution to the field of autonomous driving, through Vision-Language Models (VLMs), is exhibiting the potential to revolutionize the industry's approach to handling the long-tail problem by processing multimodal data to generate detailed environmental descriptions in real-millisecond time.
- The future of autonomous driving architectures is evolving from modular systems to end-to-end designs, combining VLMs with this approach could lead to a significant breakthrough, reducing error accumulation across independent modules and leveraging the extensive world knowledge embedded in pre-trained language models.
- One hurdle that Vision-Language Models must overcome is the challenge of processing continuous, high-dimensional video streams in real time, as well as achieving lower inference latency to ensure timely decision-making in critical situations.