Imitation learning (IL) is a simple and powerful way to use high-quality human driving data, which can be collected at scale, to produce human-like behavior. However, policies based on imitation learning alone often fail to sufficiently account for safety and reliability concerns. This paper presents a method that combines imitation learning with reinforcement learning using simple rewards to substantially improve the safety and reliability of driving policies over those learned from imitation alone. In particular, we train a policy on over 100k miles of urban driving data, and measure its effectiveness in test scenarios grouped by different levels of collision likelihood. Our analysis shows that while imitation can perform well in low-difficulty scenarios that are well-covered by the demonstration data, our proposed approach significantly improves robustness on the most challenging scenarios (over 38% reduction in failures). To our knowledge, this is the first application of a combined imitation and reinforcement learning approach in autonomous driving that utilizes large amounts of real-world human driving data.
We conduct the first large-scale application of a combined imitation and reinforcement learning (RL) approach in autonomous driving utilizing large amounts of real-world urban human driving data (over 100k miles). RL and imitation learning offer complementary strengths. RL is able to use reward signals to optimize objectives directly, but reward design is a difficult problem. On the other hand, imitation learning avoids the need for a hand-designed reward, but it lacks explicit knowledge of what constitutes good driving, such as collision avoidance, and suffers from covariate shift. By combining both approaches, we develop an agent that only requires specifying a simple reward function and yet is able to perform well in rare and challenging driving scenarios.
We systematically evaluate its performance and baseline performance by slicing the dataset by difficulty, demonstrating that combining IL and RL improves safety and reliability of policies over those learned from imitation alone (over 38% reduction in safety events on the most difficult bucket).
We propose BC-SAC as an algorithm for learning autonomous driving agents, which combines Soft Actor-Critic (SAC) with a behavior cloning (BC) learning term added to the actor objective. The critic update remains the same as in SAC. We use a simple reward function that linearly combines collision and off-road rewards.
The advantage of BC-SAC can be intuitively visualized in the figure above. For in-distribution states, both imitation and RL terms contribute to learning. However, in out-of-distribution states where no demonstration data is available, the agent can fall back to learning from reward signals.
We find that BC-SAC is able to outperform competing benchmarks (open and closed-loop imitation learning): BC and MGAIL , especially on the most challenging scenarios.
 Bronstein, Eli, et al. "Hierarchical Model-Based Imitation Learning for Planning in Autonomous Driving." 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022.