Promptable Closed-loop Traffic Simulation

Conference on Robot Learning (CoRL) 2024

Shuhan Tan1,   Boris Ivanovic2,   Yuxiao Chen2,   Boyi Li2,  
Xinshuo Weng2,   Yulong Cao2,   Philipp Krähenbühl1, Marco Pavone2  
1UT Austin, 2NVIDIA
Webpage | Video | Paper | Code | Model Demo | Data Demo | Dataset

Example of Promptable Closed-loop Traffic Simulation.
All agents are controlled by ProSim: green ones are unconditioned, others are prompted with multimodal prompts.
ProSim

In this work, we propose a new task: promptable closed-loop traffic simulation . Aside from simulating realistic traffic agent interactions in closed-loop, traffic models should also generate agent motions that satisfy a complex set of user-specified prompts, which contains multimodal prompts like:

  • Goal Point one point a 3D point of the agent destination
  • Route Sketch many points a noisy sketch of the agent's route
  • Action Tag categorical Accelerate, LeftTurn
  • Text instruction textual "Instruct <A0> to decelerate before turning right."

Towards this problem, we propose ProSim, a multimodal promptable closed-loop traffic simulation framework. ProSim allows the user to give a complex set of numerical, categorical or textual prompts to instruct each agent's behavior and intention. ProSim then rolls out a traffic scenario in a closed-loop manner, modeling each agent's interaction with other traffic participants. We show the architecture of ProSim in below.


Overview of ProSim architecture.

ProSim consists of three main modules:

    Encoder:
    Efficiently encodes a scene initialization (map and agent states) to a shared set of scene tokens.
    Generator:
    Takes the scene tokens and agent prompts to generate policy tokens for all agents.
    Policy:
    For each agent, takes the agent's policy token, current state, and observation to generate the next state. Note that we run this module for all the agents separately in parallel to simulate agent interactions.
During rollout, ProSim only runs the heavy Encoder and Generator once, while the Policy runs for all agents in parallel. This design enables us to train and inference with ProSim in closed-loop with high efficiency.

To enable complex textural prompts, in the policy token Generator, we use an LLM to comprehend the natural language prompt and policy queries, and generate language-conditioned policy queries for all agents. In particular, we use a Llama3-8B model finetuned with LoRA as backbone, with two MLP adaptors:

Language condition encoder in ProSim.

ProSim prioritizes three key properties:

    Promptable:
    Allows users to give a flexible combination of multimodal prompts for each agent.
    Closed-loop:
    Allows agents to interact with the map and other agents in real-time, simulating real-world reactive behaviors.
    Efficient:
    Simulates a 8-second traffic scenario with 64 agents and text prompt within 50ms on a single GPU.
ProSim-Instruct-520k

To provide realistic and diverse agent motion data with multimodal prompts for our task, we propose ProSim-Instruct-520k: a high-quality paired prompt- scenario dataset with 520K real-world driving scenarios from the Waymo Open Motion Dataset (WOMD).
ProSim-Instruct-520k includes multimodal prompts for more than 10M unique agents, representing over 575 hours of driving data. For each scenario, it includes Goal Points and Route Sketchs for each of the agents. For most agents, it contains Action Tags that describe agent movement behaviors, which are automatically labeled and quality assured by humans. Finally, ProSim-Instruct-520k includes 20 Text Instructions for each sceanrio, describing both scenario-level and agent-level behaviors and interactions, labeled by LLama3-70B given the action tags and metadata. In total, ProSim-Instruct-520k contains over 10M text prompts.

Here we show examples from different types of Text Instructions from ProSim-Instruct-520k. Note all agent names are refered as <id>, where "id" is the abbreviated agent id in the WOMD dataset.

Simple agent behaviors:

Temporal transitions of agent behaviors:

Scenario-level properties:

We have released the annotated prompts here!
Check out the Data Colab Demo to see how to load and preview the data.
Results

Let's take a closer look at one example of ProSim in the promptable closed-loop traffic simulation task.

All agents are controlled by ProSim: green ones are unconditioned, others are prompted with multimodal prompts.

We can observe ProSim's two core properties from some examples above:

We quantatively show that ProSim achieves high controllability when given prompts from different modalities:

Controllability evaluation of ProSim.
When no prompt is given, ProSim can also achieve competitive performance on Waymo Sim Agents Challange:

WOMD Sim Agents Challenge 2024.
Demo Video
Acknowledgement

We thank Yue Zhao, Vincent Cho, Jerry Ouyang-Zhang, and Brady Zhou for their insightful discussions.

Reference
@inproceedings{
    tan2024promptable,
    title={Promptable Closed-loop Traffic Simulation},
    author={Tan, Shuhan and Ivanovic, Boris and Chen, Yuxiao and Li, Boyi and Weng, Xinshuo and Cao, Yulong and Kr{\"a}henb{\"u}hl, Philipp and Pavone, Marco},
    booktitle={8th Annual Conference on Robot Learning},
    year={2024},
}