Promptable Closed-loop Traffic Simulation

Conference on Robot Learning (CoRL) 2024

Shuhan Tan¹, Boris Ivanovic², Yuxiao Chen², Boyi Li²,
Xinshuo Weng², Yulong Cao², Philipp Krähenbühl¹, Marco Pavone²

¹UT Austin, ²NVIDIA

Example of Promptable Closed-loop Traffic Simulation.
All agents are controlled by ProSim: green ones are unconditioned, others are prompted with multimodal prompts.

ProSim

In this work, we propose a new task: promptable closed-loop traffic simulation . Aside from simulating realistic traffic agent interactions in closed-loop, traffic models should also generate agent motions that satisfy a complex set of user-specified prompts, which contains multimodal prompts like:

Goal Point one point a 3D point of the agent destination
Route Sketch many points a noisy sketch of the agent's route
Action Tag categorical Accelerate, LeftTurn
Text instruction textual "Instruct <A0> to decelerate before turning right."

Towards this problem, we propose ProSim, a multimodal promptable closed-loop traffic simulation framework. ProSim allows the user to give a complex set of numerical, categorical or textual prompts to instruct each agent's behavior and intention. ProSim then rolls out a traffic scenario in a closed-loop manner, modeling each agent's interaction with other traffic participants. We show the architecture of ProSim in below.

Overview of ProSim architecture.

ProSim consists of three main modules:

Encoder:

Efficiently encodes a scene initialization (map and agent states) to a shared set of scene tokens.

Generator:

Takes the scene tokens and agent prompts to generate policy tokens for all agents.

Policy:

For each agent, takes the agent's policy token, current state, and observation to generate the next state. Note that we run this module for all the agents separately in parallel to simulate agent interactions.

During rollout, ProSim only runs the heavy Encoder and Generator once, while the Policy runs for all agents in parallel. This design enables us to train and inference with ProSim in closed-loop with high efficiency.

To enable complex textural prompts, in the policy token Generator, we use an LLM to comprehend the natural language prompt and policy queries, and generate language-conditioned policy queries for all agents. In particular, we use a Llama3-8B model finetuned with LoRA as backbone, with two MLP adaptors:

Language condition encoder in ProSim.

ProSim prioritizes three key properties:

Promptable:

Allows users to give a flexible combination of multimodal prompts for each agent.

Closed-loop:

Allows agents to interact with the map and other agents in real-time, simulating real-world reactive behaviors.

Efficient:

Simulates a 8-second traffic scenario with 64 agents and text prompt within 50ms on a single GPU.

ProSim-Instruct-520k

To provide realistic and diverse agent motion data with multimodal prompts for our task, we propose ProSim-Instruct-520k: a high-quality paired prompt- scenario dataset with 520K real-world driving scenarios from the Waymo Open Motion Dataset (WOMD).
ProSim-Instruct-520k includes multimodal prompts for more than 10M unique agents, representing over 575 hours of driving data. For each scenario, it includes Goal Points and Route Sketchs for each of the agents. For most agents, it contains Action Tags that describe agent movement behaviors, which are automatically labeled and quality assured by humans. Finally, ProSim-Instruct-520k includes 20 Text Instructions for each sceanrio, describing both scenario-level and agent-level behaviors and interactions, labeled by LLama3-70B given the action tags and metadata. In total, ProSim-Instruct-520k contains over 10M text prompts.

Here we show examples from different types of Text Instructions from ProSim-Instruct-520k. Note all agent names are refered as <id>, where "id" is the abbreviated agent id in the WOMD dataset.

Simple agent behaviors:

"Have <71f1c> maintain a steady acceleration from start to finish"
"Keep <df6a1> moving straight after 65 seconds."
"Command <dad99> and <a261a> to stop after 40 seconds."

Temporal transitions of agent behaviors:

"Let <ego> maintain a steady speed after decelerating"
"After slowing, instruct the bicycle <d3ddc> to continue on a direct trajectory."
"Initially, <8cc93> accelerates, but then slows down and comes to a stop."

Scenario-level properties:

"Most vehicles, except for a few, are parked and stationary at the start of the simulation.""
"Emphasize the limited activity within the scene, agents either stopping or staying within their lanes."
"Keep all parked vehicles stationary to represent a low-activity scene."

We have released the annotated prompts here!
Check out the Data Colab Demo to see how to load and preview the data.

Results

Let's take a closer look at one example of ProSim in the promptable closed-loop traffic simulation task.

All agents are controlled by ProSim: green ones are unconditioned, others are prompted with multimodal prompts.

We can observe ProSim's two core properties from some examples above:

Promptable:

In Figure (d), 4 agents being prompted by a single sentence with complex high- level information, with other 2 agents simultaneously prompted by low-level goal points.

Closed-loop:

Note that the unconditioned right-turning agent A7 in Figure (a) changes its behavior to yield to A24 in Figure (c). Also, the two agents following A7 brake as A7 yields in Figure (c). These behaviors shows dynamic agent interactions through closed-loop rollout.

We quantatively show that ProSim achieves high controllability when given prompts from different modalities:

Controllability evaluation of ProSim.

When no prompt is given, ProSim can also achieve competitive performance on Waymo Sim Agents Challange:

WOMD Sim Agents Challenge 2024.

Demo Video

Acknowledgement

We thank Yue Zhao, Vincent Cho, Jerry Ouyang-Zhang, and Brady Zhou for their insightful discussions.

Reference

@inproceedings{
    tan2024promptable,
    title={Promptable Closed-loop Traffic Simulation},
    author={Tan, Shuhan and Ivanovic, Boris and Chen, Yuxiao and Li, Boyi and Weng, Xinshuo and Cao, Yulong and Kr{\"a}henb{\"u}hl, Philipp and Pavone, Marco},
    booktitle={8th Annual Conference on Robot Learning},
    year={2024},
}