RL Crash Course: From Bellman to Bots¶
Part 2: Building a Custom Environment with Gymnasium¶
Objective: In this section, we transition from theory to practice. We will learn the industry-standard API for RL environments and build a custom simulation of a server room thermostat from scratch.
The Gymnasium API¶
Gymnasium vs. OpenAI Gym
Gymnasium is the maintained, modern version of the classic OpenAI Gym library. All new RL projects should use gymnasium, as the original gym library is no longer supported by the community!
Before an agent can learn, it needs a world to interact with. In Python, the standard framework for creating these worlds is Gymnasium (formerly OpenAI Gym).
Gymnasium provides a standardized interface. As long as your environment follows this specific set of rules, any modern RL algorithm can interact with it.
Spaces: Defining the Boundaries¶
How does an agent know what it can see or do? We define the boundaries using gymnasium.spaces. These objects describe the exact shape, bounds, and data types of inputs and outputs.
Why do we need to define Spaces?
The RL algorithm needs to know the shape and type of the data it will receive (observations) and the data it can send (actions). By defining these spaces, we ensure that our environment and agent can communicate effectively without type errors or mismatched dimensions. Remember, these machine learning algorithms are mathematical functions that expect inputs of a specific shape and type. The spaces act as a contract between the environment and the agent, ensuring they speak the same language.
Discrete(n): A single integer from0ton-1. Used when there are distinct, separate options (e.g.,Discrete(3)for moving Left, Right, or Jumping).Box(low, high, shape): An n-dimensional array of continuous numbers (floats). Used for continuous readings (e.g., a camera frame, joint angles, or speedometer).MultiDiscrete([n, m]): Multiple discrete actions at once. (e.g., pressing 'A' and 'Up' simultaneously on a controller).Dict(...): A dictionary of simpler spaces. Great for composite observations where variables have different types (e.g., combining a camera frameBoxwith a scalar health valueDiscrete).Tuple(...): Similar to a Dict, but structures the spaces as a tuple instead of named keys.
Start with Discrete Actions
When building your first environments, try to frame your problem using discrete action spaces if possible. Algorithms generally learn faster and more reliably in finite discrete spaces than in continuous Box spaces where the options are practically infinite!
The Core Methods¶
Every Gymnasium environment is a Python class that inherits from gym.Env and implements four main methods:
sequenceDiagram
autonumber
participant A as Agent
participant E as Environment (gym.Env)
A->>E: reset()
E-->>A: observation, info
loop Every Step
A->>E: step(action)
Note over E: Physics Engine & Rules Applied
E-->>A: observation, reward, terminated, truncated, info
end
__init__(self): Sets up the environment, defining what the agent can see (Observation Space) and what it can do (Action Space).reset(self): Restarts the environment to an initial state and returns the first observation. (Think of this as hitting the reset button on a console). It can also accept aseedfor reproducibility and anoptionsdictionary for custom configurations. Returns a tuple:(initial observation, info).step(self, action): The core physics engine. It takes the agent's action, updates the world, and returns:observation: The new state (\(S_{t+1}\)).reward: The feedback (\(R_{t+1}\)).terminated: A boolean indicating if the episode ended due to the rules (e.g., agent crashed).truncated: A boolean indicating if the episode ended due to a time limit.info: A dictionary for debugging.
render(self): (Optional) Draws the environment to the screen for human viewing.
Hands-On: The Server Room Thermostat (Advanced)¶
Let's build a custom environment.
Real World Case Study: Server Room Thermostat
This is a toy model of a real-world problem. Data centers consume massive amounts of energy, and cooling is a huge part of that cost. An intelligent thermostat could learn to pre-cool the room just before a big job starts, saving energy while keeping the servers safe.
The Scenario: You manage a server room. Instead of constant random heat, heat is generated by Jobs. Jobs are submitted randomly, one at a time, and require a certain number of CPUs for a few timesteps. More CPUs generate more heat. Your agent controls a thermostat and must keep the room temperature as close to 20°C as possible.
Why are we adding jobs?
Because the agent can see the active processes, it can learn to pre-emptively cool the room when a massive job arrives, rather than just reacting after the temperature has already spiked. This teaches the foundation of predictive action!
Step 1: Defining the MDP (State, Action, Reward, Termination)¶
Before writing code, we must translate our real-world problem into the mathematical framework we learned in Part 1.
flowchart TD
S[State: Temp & CPUs] --> A{Action: Cool / None / Heat}
A --> |Apply Action| E[Environment: Job Processing & Heat]
E --> R["Reward: -abs(Temp - 20)"]
E --> S2[New State]
E --> D{Done?}
D --> |Yes: Temp > 90 or < 5| F[Terminate: Failure]
D --> |Yes: Step >= 60| T[Truncate: Success]
D --> |No| S2
- Action Space (\(A\)): The agent can make 3 choices: Cool down (-1), Do nothing (0), Heat up (+1). We will use a
Discrete(3)space. - Observation Space (\(S\)): The agent reads TWO things of completely different types: the current temperature (a continuous float from 0.0°C - 100.0°C) and the number of active CPUs currently running a job (an integer from 0 - 100). We will use a composite
Dictspace containing aBoxand aDiscretespace. - Reward (\(R\)): The agent gets punished the further the temperature strays from the target of 20°C.
Reward = -abs(Current_Temp - 20). And a large bonus reward of+100for surviving the entire episode without melting or freezing the servers.
Termination vs. Truncation
The difference between these two is massive to an RL algorithm!
- Termination tells the agent: "This was a failure state, avoid this behavior entirely."
- Truncation tells the agent: "Time ran out for this simulation, but the state itself wasn't necessarily bad."
- Termination & Truncation (Done Conditions): We need to know when the episode ends.
- Termination (Failure): If the temperature hits 90.0°C (servers melt) or drops to 5.0°C (servers freeze), the episode immediately ends in failure.
- Truncation (Success): If the agent manages the room successfully for 60 timesteps without a failure, the episode ends successfully (time limit reached).
Step 2: Writing the Code¶
Let's write the actual Python code for our environment.
| custom_env.py | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 | |
Step 3: Testing the Environment with a Random Agent¶
Before we apply advanced AI to solve this, we must ensure the physics engine works. We can test it by letting an agent take completely random actions.
Example output:
Starting Temperature: 21.14°C | Starting CPUs: 0
Step: 01 | Temp: 26.61°C | Active CPUs: 58
Step: 02 | Temp: 32.63°C | Active CPUs: 58
Step: 03 | Temp: 32.98°C | Active CPUs: 58
Step: 04 | Temp: 32.85°C | Active CPUs: 58
Step: 05 | Temp: 36.06°C | Active CPUs: 58
Step: 06 | Temp: 38.90°C | Active CPUs: 0
...
🔥 BOOM! Servers melted at 93.1°C! You exploded! 🔥
Step: 59 | Temp: 93.07°C | Active CPUs: 57
Episode 1 finished with Total Reward: -2398.05
If you run this code, you will notice the Active CPUs periodically spike up. A random agent will just guess, leading to huge temperature swings. Eventually, you'll see one of our fatal print statements trigger when the random agent inevitably freezes or melts the servers!
Next Steps: In the final section, we will train a Deep Reinforcement Learning algorithm (PPO). The neural network will learn the correlation between Active CPUs and rising temperature, allowing it to pre-cool the room exactly when a massive 80-CPU job is submitted!