5 March 2026

Week 4: Establishing a Benchmark

by Jimi

Following the successful development of our custom TorcsClient and TorcsEnv, this week we decided to experiment with getting a driver up and running to verify our TorcsClient and TorcsEnv code.

This model will serve as our baseline for future improvements.

Before diving into any technical details, here is our first RL model completing a lap of the corkscrew track in 1:41.66.

Quick list of our approach:

IBM Granite
PID Driver
Behaviour Cloning
PPO Model

IBM Granite

We wanted to really stress test IBM Granite this week, so all code was generated with IBM Granite through a back and forth prompting workflow.

This turned out to be an effective way to work at this early stage, being able to describe what we wanted and get working code back quickly.

Granite was especially helpful for rapidly prototyping ideas, suggesting approaches we hadn’t considered, and debugging issues when things weren’t behaving as expected.

PID Driver

We started with a rule-based driver that follows a fixed set of instructions to control the car, determining throttle, steering, and braking based on sensor input.

Lap time: 2:14.67 (134.67s)

Behaviour Cloning

With the PID driver working, we recorded it completing a number of laps so we could use a technique called Behaviour Cloning. The idea is to give the RL agent a head start by showing it what good driving looks like, allowing us to skip the initial process of learning basic car control and move straight onto learning more complex behaviour.

PPO Model

Combining behaviour cloning with a PPO model, we trained the agent for 1 million steps. As we weren’t starting from scratch thanks to behaviour cloning, this process was considerably sped up, allowing the model to learn the track through trial and error, earning rewards for clean driving and penalties for going off track. As training came to an end, the model had significantly outperformed the PID driver, discovering new racing lines to shave off as many seconds as possible.

The PPO model completed the corkscrew in 1:41.66 (101.66s), around 33 seconds faster, highlighting the limitations of rule based control, as the PID driver cannot adapt beyond its predefined logic, whereas PPO continuously improves through exploration.

Plans For Week 5

With upcoming robotics assessments, we’ll be laying off the gas next week and focusing on our requirements document and risk assessment rather than development.

tags: