Game AI using Reinforcement Learning*3LEr0qqL-d7MMHNC

Original Source Here

Box Jump Game

Agent: The AI-controlled box that we want to train.

Observation Space (Environment): X coordinates of approaching an enemy. (represented by the triangles/spikes)

Action Space: Jump, Don’t Jump

Objective (Reward): The longer you survive the higher the score. Get the highest score without dying to the approaching enemy by jumping over them.

Figure 3. The player needs to control to box to jump when an enemy approaches to survive in the game. Each successful jump over a spike earns the player one point. Game code from:

Model: Deep Q-Learning Network

We use a neural network to approximate a Q table. It learns to output q-values based on states given to it. At a given state, when a reward is given, we update the Q-values with rewards. Then, we train the model with that updated Q value and state.


Let’s take a look at the code that will train and run the AI for the box jump game. We took this code from Sani Khurana, who trained the bot using a basic neural net. You can see the results in their medium article. Our goal here is a little bit different, it is to see if we can use DQNs to train our bot.

We first import the game libraries as well as any functions that we will be using.

Code 1: Set up the environment and import libraries.
Figure 4: Update reward function after each observed outcome. (Source:

We use a pre-built wrapper class which we modified in order to pass our deep q-learning network to the game and update the game state. It will reward our DQN model by looking at the state of the game and the actions it took. For example on line 31, we modified the code to give a reward of +1 for jumping over an enemy. In addition, in the case the agent loses, we go to the last move and assign a negative reward of -1. Whether or not the bot loses, we run our DQN remember and experience replay functions. These functions store the past states, actions, and, rewards to train the model. In this case, we randomly sample from our existing “memory”, a batch of 64 training points, and iteratively train the model. This method performs well, however, we do see that weighing highly positive or negative reward actions allows for faster training and higher scores.

Code 2: The weighted game code was inspired by and modified from: Please support them by reading their article!

We then define different parameters for the neural network by creating a DQN class. We have functions that predict new actions and update the network at every frame. Note the parameters epsilon represents or probability of taking a random action. We decay this value by .95 every experience replay so that the model can learn from its past random actions. This is done so the model can experiment with new potentially beneficial strategies as opposed to getting locked into one strategy at the beginning. Gamma and Alpha are our discounts on our new q values and future predicted reward.

Our sequential neural network model (lines 20–30) is simple to enable faster training. We used trial and error to determine different parameters such as layer size. It takes as an input the observation space (X-coordinates) and outputs an action (Jump or not).

We used LeakyReLU that enables faster training and convergence by having non-zero derivatives in the negative part. (Source:

Figure 5: Leaky ReLU (Source:

We also used both Dropout and Batch Normalization layers that stabilize the learning process. We used trial and error to find the best architecture and parameters for our model.

Code 3: The DQN implementation was modified and inspired by:

We run the game and plot the results that we will present below.

Code 4: Run game and plot results

You can find the full game code in the repository below. This includes box game classes that were not in the scope of this article. It also includes a weighted reward version and energy version of the RL algorithm.


We tried multiple experiments in order to improve our bot. This includes:

  • Retraining at every frame vs. Retraining every death.
  • Retraining on a randomly sampled batch every death vs. Iterative training.
  • Retraining on weighted sampled batch every death.
  • Reward +1 for jumping over an enemy, 0 for surviving, or -1 for dying to an enemy.
  • Including energy as an input (jumping expends one point of energy, while not jumping gives you one point of energy. You can’t jump if energy is 0).

Best Performance: Iterative training on every frame with weighted sampling. After approximately 15 games, it reached a record high of 600 points.

Figure 6: A game where high and low rewards are heavily weighted
Figure 7: A game where high and low rewards are moderately weighted

We see given a gamma and alpha of .5, a learning rate of .05, epsilon of .99, and decay of .95 the bot learns a decent strategy by round 15. Weighting the high and low rewards heavily leads to a more proficient bot by round 15 (average of 71 after 30 rounds) than if we weigh the rewards moderately (average of 40 after 30 rounds), although the bot still tends to make mistakes in both cases.

The agent did best when we gave it weighted samples, learning extremely fast. We had to cut off the number of training rounds since it consistently hit 200+. Meanwhile, the energy bot could not pass a score of even 20, as it became reluctant to jump.


  • Sampling for training data could be weighted better for the most important rewards.
  • The game is extremely slow. This inhibits our ability to train the bot on more games. A solution would be to train on a faster machine. However, we could also find a better balance between re-training the model and running the game. We found the best performance when training at every frame, but there is a significant speed tradeoff.
  • The box jumps excessively. We can improve this by punishing it with a jump counter relative to its existing score. We tried implementing an energy system, but this needs further exploration since as mentioned, it caused the agent to avoid jumping. We could also code the bot not to jump when there is no enemy nearby. However, this slightly defeats the purpose of the task being reinforcement learning, as if that were the case — one could simply tell it to jump when an enemy is close enough.

Why excessive jumping?

We see that the box tends to double jump before a spike, therefore it might be that during the random phase, it cleared a number of spikes by jumping twice before a spike. Since the Q-learning training mechanism takes into account a discounted future state reward, jumping prematurely is being rewarded: ie. the predicted future state is positive, and thus the action just before is associated with a positive reward. This strategy also reduces the number of training inputs the model requires since no input is given to the model when it is in the air. On the other hand, this may be beneficial behavior since the frame rate of the game is not very consistent: after a box jumps and lands, it will always land a certain distance from where it started vs. when it does nothing the distance traveled is always inconsistent. It’s likely we need more epochs of training to fix this creating a more stable version of the game or by discounting future rewards more.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: