It is obvious that to propose a problem better, one has to illustrate the problem well. RL generalization, as the survey indicated, is a class of problems. And here, we show two benchmark environments and their common experiment settings.

Procgen

Following Coinrun, OpenAI’s team proposed a new testing environment called procgen. Consisting of 16 games, the Procgen provides a convenient way to generate environments procedurally that share the same underlying logic and reward but are different in layout and rendering. All 16 games share the discrete action space of size 15 and 64x64x3 RGB observation.

To master any one of these environments, agents must learn a policy that is robust across all axes of variation.

In 2.2 of the paper, an experiment protocol is introduced. The authors chose PPO as the default algorithm. For hard mode Procgen, PPO is trained for 200M timesteps, while the 25M timesteps in easy mode are much less expensive in terms of GPU hours. Procgen can be used to test sample efficiency and generalization ability.

When evaluating generalization, we train on a finite set of levels, and we test on the full distribution of levels. Unless otherwise specified, we use a training set of 500 levels to evaluate generalization in each environment. For easy difficulty environments, we recommend using training sets of 200 levels.

The authors further proposed a new setting of testing generalization, 500 Level Generalization, as the easy mode setting is still costly, and 500 is the number where generalization started to take effect.

Jumping Task

Compared with Procgen, jumping task is much lighter and less complicated. It is less popular but very cheap to run. The agent can only choose to ‘jump’ or ‘move right.’ The goal is to jump over an obstacle. The position of the obstacle varies in height and horizontal position, thus requiring generalization ability.

A training set shall be chosen(commonly of size 18) and test the generalization across the full distribution of the environment. As the position of an obstacle fully determines the environment, the performance, $\frac{task\ solved}{task\ number}$, varies as the training set varies.

Question

There are other environments for testing generalization, as RL generalization is a class of problems. The question remains how do we know or define, precisely, the requirement for generalization ability of different environments? After all, the same environment with different seeds may require a little generalization, the same environment with different colors (black-white to colorful) may require more, while the DMControl Suite and Procgen require even more generalization ability.