Environments

The environments in which the algorithms were tested are offered by the OpenAI Gym toolkit for reinforcement learning research. Specifically, the following were used:

  • CartPole-v0: a pole is attached to a cart by an un-actuated joint and starts upright. Reward is given as long as the pole remains upright (less than 15 degrees with the vertical). Each episode terminates if the pole falls (exceeds 15 degrees with the vertical) or if the cart leaves the screen.

  • MountainCarContinuous-v0: a car starts in a valley between two hills in every episode. A reward is given if it reaches the top of the right hill. A small penalty is incurred based on the torque applied by the motor, which is not strong enough to push the car directly up the hill. Therefore the agent must learn to build momentum by swinging the car back and forth between the hills.

  • LunarLanderContinuous-v2: the agent controls a spaceship by regulating the thrust of the main and lateral motors. A small penalty is incurred based on the main engine's throttle and a large one is the spaceship crashes onto the ground. Successful landings are rewarded, even more so by landing on the designated pad in the center of the screen.

  • BipedalWalker-v2: the agent controls a biped robot (in two dimensions) and is rewarded by moving forward, up to the far end of the track. A small penalty is incurred based on the torque applied to the joints of the robot and a large one if it crashes onto the ground.

  • RoboschoolHalfCheetah-v2: the agent controls a two-dimensional cheetah-like robot and is reward by moving forward. Because of the time limit, better agents must move faster in order to achieve a higher total reward. Small penalties are incurred based on the joint torques and impact forces and a large one if the robot falls.

  • RoboschoolAnt-v2: the agent controls a tri-dimensional robot with four legs. Rewards and penalties are given based on the same criteria of the previous environment; however, the added dimensionality makes this a more challenging task.

There are more criteria ruling how rewards and penalties are given at each timestep of the last two environments, however, these are not documented in OpenAI Gym. Videos of the best policies obtained being executed in these environments are available in the "Videos" section. Moreover, the last video shows simulations of the policies obtained in the quadruped robot environment over the course of training.