Deep Learning and AI

Reinforcement Learning - Teaching AI with Rewards

September 12, 2024 • 14 min read

Introduction

Artificial Intelligence (AI) offers a novel approach to problem-solving. In traditional algorithms, the programmer explicitly defines the variables and logic, which are then used to generate an output based on predetermined rules. With AI, however, data is fed into the model, which learns to identify patterns and relationships on its own—without the need for rules specified by the programmer.

But how can we guide an AI model when it’s making decisions autonomously? AI often detects patterns in data, but translating those patterns into real-world actions requires more. This is where Reinforcement Learning (RL) comes in. RL is a branch of AI that focuses on systems that learn from experience, making decisions through trial and error while receiving feedback in the form of rewards.

What is Reinforcement Learning?

Reinforcement Learning (RL) is a branch of AI where an agent learns to make intelligent decisions by interacting with its environment and receiving feedback in the form of rewards or penalties.

A helpful analogy is training a dog. When a dog performs a desirable action, like walking attentively beside its trainer, it receives a reward, such as a treat and praise. Conversely, when the dog does something undesirable, like sniffing something inappropriate, it may face a correction, like a verbal "No" and a slight leash pull, something undesirable and uncomfortable for the dog. Over time, the dog learns to maximize positive outcomes (more treats) and avoid negative ones. The type of reward matters swell; kibble might be a low-value reward, while a piece of steak is a much higher-value reward.

Similarly, in RL, an agent takes actions that result in rewards or penalties, gradually refining its behavior to maximize rewards. These rewards can vary in value and are adjusted over time to guide the agent toward performing the desired task more effectively. However, these rewards need to be carefully tuned or exploitation can be an issue – more on that later.

Elements of an RL Algorithm

The first step to understanding the basics of RL is introducing the key elements of an RL algorithm. These elements are illustrated in the diagram below.

Agent: The learner or decision-maker (e.g., a robot, algorithm, etc.).
Environment: The space or system the agent operates within.
State: A snapshot of the environment at a particular time.
Action: A decision or move the agent makes in response to the environment's state.
Reward: Feedback from the Environment that tells the Agent how good or bad its Action was in a given State. The reward is like a value (+1 point or -2 points) that guides the agent in making future decisions.

SPC-Blog-Reinforcement-learning-teaching-AI-with-rewards-1.jpg

How does the Agent learn in Reinforcement Learning?

Similar to training a dog, we reward our model for performing desirable actions. But before the model can seek rewards, we need to define how it can map actions to those rewards. The goal is for the model to continually earn more rewards. This is where the concepts of policy and value come into play in decision-making.

The Policy is the strategy the agent uses to decide which action to take in a given state. It can be represented as a set of possible actions or inputs and is often based on logical conditions (e.g., ‘IF’ statements). When the model chooses a policy for a state, that policy determines its action.
The Value represents the expected reward or punishment for being in a particular state or taking a specific action in that state. As the model interacts with the environment, the value function is updated and refined to better predict the outcomes of future actions.

For example, imagine we're training an AI to play a vertical platformer like Doodle Jump. The objective is to guide the Doodler (the character) higher by bouncing on green (stationary) and blue (moving) platforms. As the game progresses, the AI must also avoid hazards like cracked red platforms, dodge enemies, and collect power-ups. The Doodler is controlled by tilting the device, and as it climbs higher, the game score increases. The AI's policy would dictate how it responds to each situation, and the value function would predict how beneficial certain moves are for achieving a higher score.

SPC-Blog-Reinforcement-learning-teaching-AI-with-rewards-2.jpg

The Agent is the Reinforcement Learning algorithm that controls the character, Doodler.
The Environment consists of platforms, springs, power-ups, enemies, holes, and cracked platforms.
The State is a snapshot of the game at a specific moment, captured either frequently or at intervals as defined by the model.
- The State includes all of Doodler’s attributes at that moment, such as left/right velocity, proximity to danger, and proximity to safety.
- We can define as many or as few variables in the state as needed.
- Each state is assigned a value related to safety. If Doodler is approaching danger, the value will be negative, prompting the Agent to take an action that leads to a positive value, such as targeting power-ups or avoiding enemies.
The Policy represents the possible actions Doodler can take and the corresponding inputs the Agent can execute.
- For Doodler: Move left, move right, or stay still.
- For the Agent: Set tilt to -20 degrees (go left), Set tilt to 0 degrees (stay straight), Set tilt to 15 degrees (go right).
- Each policy also has an associated value function that predicts the expected reward for a given action.
  - For example: Expected positive reward for moving left to land on a higher platform; Expected negative reward for moving too far left and approaching an enemy.
- Policies can be based on past actions or simulations.
The Action is the policy that is executed, which determines what the Agent has Doodler do. For instance, "Move left by setting tilt to -20 degrees."
The Reward is the numerical feedback the Agent receives for taking an action in a given state.
- Each state variable can result in positive or negative rewards.
- The combination of rewards, both positive and negative, helps the model learn that certain negative actions may be acceptable if they optimize the overall outcome by evaluating policy values.
  - For example, the AI might control Doodler to jump near an enemy (incurring a small negative reward) as long as it avoids touching the enemy (which would incur a large negative reward) if it allows Doodler to reach the only available platform (resulting in a large positive reward).

The process is repeated millions of times. Most reinforcement learning models require tens, if not hundreds, of thousands of iterations before they can perform well in the game.

SPC-Blog-Reinforcement-learning-teaching-AI-with-rewards-3.jpg

Model-based vs Model-free RL Approaches

In reinforcement learning (RL), there are two main approaches for training an agent: model-based and model-free. These approaches differ in how they interact with the environment and how they learn to make decisions and how policies and reward/value functions are made.

Model-Based Reinforcement Learning

In model-based RL, the agent builds and uses a model of the environment to plan its actions. This model predicts the next state and reward based on the current state and action, allowing the agent to simulate different outcomes before taking actions in the real environment.

Advantages of Model-Based RL

Efficient learning: The agent can plan ahead by simulating many potential outcomes without having to physically try every action in the environment.
Faster convergence: Since the agent uses a model to explore different options internally, it can often find optimal solutions faster, especially when the environment is complex.

Challenges of Model-Based RL

Model accuracy: If the model of the environment is inaccurate, the agent may make poor decisions based on faulty predictions.
Complexity: Learning or constructing an accurate model can be computationally expensive, especially for large or complex environments.

Model-Free Reinforcement Learning:

In model-free RL, the agent does not build a model of the environment. Instead, it learns directly from interactions with the environment by trial and error. It focuses on learning a policy or value function based on experience in the environment on observed states, actions, and rewards, without trying to predict future states in advance.

Advantages of Model-Free RL

Simplicity: Model-free RL doesn't require the agent to learn or maintain a model of the environment, reducing complexity.
Versatility: It can be used in environments where building a model is difficult, such as high-dimensional or dynamic environments like games.

Challenges of Model-Free RL

Sample inefficiency: The agent often needs to interact with the environment many times (sometimes millions of steps) to learn effective strategies, making it slow to converge.
Less foresight: Without a model, the agent has no way to "plan" ahead; it must rely entirely on its past experiences to decide on actions, which can lead to suboptimal decisions early on.

Comparison of Model-Based RL and Model-Free RL

Model-based RL uses a model of the environment (often learned upon data via machine learning or deep learning) to assess the outcomes of actions taken, whereas model-free RL gradually constructs this model through direct environmental interaction, relying on pure "trial and error" rather than predictions or estimates.

Aspect	Model-Based RL	Model-Free RL
Model	Learns or is given a model of the environment	No model of the environment is used
Learning Process	Simulates future states/actions to plan	Learns directly from trial-and-error experiences
Efficiency	More sample-efficient, as it can simulate actions	Sample-inefficient; requires lots of exploration
Foresight	Can plan ahead and evaluate different actions before executing	Relies on learning from past experiences with no lookahead
Use Case	Effective when the environment can be modeled accurately (e.g., physics simulations)	Suitable when modeling the environment is complex or infeasible (e.g., video games, complex systems)
Convergence Speed	Can converge faster if the model is accurate	Tends to take longer to learn effective strategies
Example Algorithms	Dyna-Q, AlphaZero (uses planning with simulations)	Q-learning, Deep Q-Networks (DQN), Policy Gradient methods

Imagine a robot navigating a maze.

In model-based RL:

The robot builds a model of the maze that predicts where it will end up after moving (next state) and how close it gets to its goal (reward).
Before physically moving, the robot simulates different paths using the model and chooses the one that leads to the goal fastest.

In the same example, a model-free RL robot would:

Try different movements randomly (exploration), learning which ones get closer to the goal by receiving rewards (e.g., positive reward for reaching the goal).
Over time, it updates its policy to take actions that maximize rewards based on experiences, without needing to simulate or plan ahead.

If we take into account our Doodle Jump example, we have to use a model-free approach when training a model that can play the game with randomized map generation. Beginning training will take lengthy amounts of time, but in the end, it will be better generalized to approach different environments.

However, agents where we can start training our Agent with a model-based approach with multiple predefined Doodle Jump maps, with predefined obstacle locations. The Agent can simulate policies and actions, train faster, and then apply these rules and tendencies to a randomly generated game. However, training on a single map multiple times can cause overfitting causing the inability to generalize to new environments.

Challenges and Limitations

One of the major challenges of RL is its high computational and data consumption cost, due to intensive interactions with the environment to learn effectively, thereby making its application in certain real-world scenarios more difficult.

Exploration vs. Exploitation

Striking the right balance between exploring new strategies (to find better actions) and exploiting known good strategies (to maximize rewards) is difficult.

There’s an instance where AI is taught to play Tetris using Reinforcement Learning. Because Tetris is highly complex, and the rewarding methodology only came when lines were cleared, the AI had a hard time understanding how to fit pieces together. To avoid total failure, the AI paused the game right before gameover, an exploit to never increase points, but also to never lose, halting training indefinitely. This Tetris AI would never learn the game, exploiting the pause button on every iteration.

Other instances showcase where exploitation can break the game's intended functionality. For example, the game is 3v1 hide and seek using AI and reinforcement learning. The AI in charge of hiding took advantage of a bug, glitching itself off of the map, never to be caught by the seekers. Eventually, the AI was able to replicate this bug flawlessly, playing the game in an unintended manner.

Other times, we can hit a plateau in training performance. Too much exploration leads to wasted time trying different actions that don't benefit the model's performance, while too much exploitation can cause the agent to miss better strategies or make unintended choices to farm the reward system. To alleviate exploitation is a case-by-case problem that we won't be going in-depth here, but is something to be vigilant of when using Reinforcement Learning.

Computational Cost

Training RL agents, especially in complex environments, requires massive computational resources and time. This is particularly true for model-free RL, where millions of interactions may be needed for the agent to learn effectively.

Here, you can use efficient algorithms like DQN with experience replay or off-policy methods that reuse previous data to reduce the need for new interactions and utilize prior learned behaviors.

However, reusing experience replays to reduce computational costs can stifle creativity in new environments. To have a truly raw experience using reinforcement learning, increasing your computational resources will not only reduce the time to completion but allow for even greater complexity. A dedicated GPU system can parallelize multiple AI iterations for training, dramatically speeding up the workload compared to an enthusiast laptop.

Blog