Language Conditioned Traffic Generation

Conference on Robot Learning (CoRL) 2023

Shuhan Tan1,   Boris Ivanovic2,   Xinshuo Weng2,   Marco Pavone2,   Philipp Krähenbühl1
1UT Austin, 2NVIDIA
Webpage | Video | Paper | Code | Demo (Colab)
Overview

This work presents a language conditioned traffic generation model, LCTGen. Our model takes as input a natural language description of a traffic scenario, and outputs traffic actors' initial states and motions on a compatible map.


Fig. 1 Overview of the proposed method

LCTGen has two main modules: Interpreter, and Generator. Given any user-specified natural language query, the LLM-powered Interpreter converts the query into a compact, structured representation. Interpreter also retrieves an appropriate map that matches the described scenario from a real-world map library. Then, the Generator takes the structured representation and map to generate realistic traffic scenarios that accurately follow the user's specifications.


Fig. 2: Example Interpreter input and output.

The Interpreter takes a natural language text description as input and produces a structured representation with a LLM (GPT-4). The structured representation describes agent and map-specific information with integer vectors. To obtain the structured representation, we use a large language model (LLM) and formulate the problem into a text-to-text transformation. Specifically, we ask GPT-4 to translate the textual description of a traffic scene into a YAML-like description through in-context learning. An exmaple input-output pair is shown above.


Fig. 3: The architecture of Generator.

Given a structured representation and map, the Generator produces a traffic scenario (composed of actor initialization and their motion). We design Generator as a query-based transformer model to efficiently capture the interactions between different agents and between agents and the map. It places all the agents in a single forward pass and supports end-to-end training. The Generator has four modules: 1) a map encoder that extracts per-lane map features; 2) an agent query generator that converts structured representation to agent query; 3) a generative transformer that models agent-agent and agent-map interactions; 4) a scene decoder to output the scenario.

Qualitative results

We show examples of LCTGen's output given texts from the Crash Report (first row) and Attribute Description (second row) datasets below. Each example is a pair of input text and the generated scenario. Because texts in Crash Report are excessively long, we only show the output summary of the Interpreter module.


Fig. 4: Results of text-conditioned traffic generation.

We also apply LCTGen to instructional traffic scenario editing. We show an example of consecutive instructional editing of a real-world scenario in below. We can see that LCTGen supports high-level editing instructions (vehicle removal, addition and action change). It produces realistic output following the instruction.


Fig. 5: Instructional editing on a real-world scenario.
Demo Video
This video shows animated scenarios generated by LCTGen. We also show the application of LCTGen to controllable self-driving policy evaluation.
Acknowledgement

We thank Yuxiao Chen, Yulong Cao, and Danfei Xu for their insightful discussions. This material is supported by the National Science Foundation under Grant No. IIS-1845485.

Thanks to MetaDrive for the template.

Reference
@inproceedings{
    tan2023language,
    title={Language Conditioned Traffic Generation},
    author={Shuhan Tan and Boris Ivanovic and Xinshuo Weng and Marco Pavone and Philipp Kraehenbuehl},
    booktitle={7th Annual Conference on Robot Learning},
    year={2023},
    url={https://openreview.net/forum?id=PK2debCKaG}
    }