Development blog for TORCS project.
by Jimi
Following the successful development of our custom TorcsClient and TorcsEnv, this week we decided to experiment with getting a driver up and running to verify our TorcsClient and TorcsEnv code.
This model will serve as our baseline for future improvements.
Before diving into any technical details, here is our first RL model completing a lap of the corkscrew track in 1:41.66.
Quick list of our approach:
We wanted to really stress test IBM Granite this week, so all code was generated with IBM Granite through a back and forth prompting workflow.
This turned out to be an effective way to work at this early stage, being able to describe what we wanted and get working code back quickly.
Granite was especially helpful for rapidly prototyping ideas, suggesting approaches we hadn’t considered, and debugging issues when things weren’t behaving as expected.
We started with a rule-based driver that follows a fixed set of instructions to control the car, determining throttle, steering, and braking based on sensor input.
Lap time: 2:14.67 (134.67s)
With the PID driver working, we recorded it completing a number of laps so we could use a technique called Behaviour Cloning. The idea is to give the RL agent a head start by showing it what good driving looks like, allowing us to skip the initial process of learning basic car control and move straight onto learning more complex behaviour.
Combining behaviour cloning with a PPO model, we trained the agent for 1 million steps. As we weren’t starting from scratch thanks to behaviour cloning, this process was considerably sped up, allowing the model to learn the track through trial and error, earning rewards for clean driving and penalties for going off track. As training came to an end, the model had significantly outperformed the PID driver, discovering new racing lines to shave off as many seconds as possible.
The PPO model completed the corkscrew in 1:41.66 (101.66s), around 33 seconds faster, highlighting the limitations of rule based control, as the PID driver cannot adapt beyond its predefined logic, whereas PPO continuously improves through exploration.
With upcoming robotics assessments, we’ll be laying off the gas next week and focusing on our requirements document and risk assessment rather than development.
tags: