A few weeks ago, OpenAI attempted a new major milestone in AI development, a (nearly) full game of Dota2 against some of the best human players. Although the OpenAI Five was defeated by both of its professional opponents, the level of play was high and at times the match looked fairly even. This is amazing as the full game of Dota2 is very complex. Even more incredibly, the agent was trained using a relatively simple and very general reinforcement learning algorithm, PPO.
While the network structure has many bells and whistles to incorporate the complexities of the game, the algorithm itself is general enough to be applied to robotics, image recognition, and many more tasks. Congratulations to OpenAI and a huge win for RL!
The algorithm was used in conjunction to what is now a pretty popular trick of self-play. In environments that are competitive and easy to simulate, self-play refers to the agent learning purely by playing against itself. This way, simulations can be run very fast on thousands of CPUs/GPUs and years of experience collected every hour (256 GPUs and 128,000 CPUs for OpenAI Five). Self play was also used in the training of AlphaGo Zero.
Here, I have jotted down a few quick thoughts on why this extremely useful simulation trick of self-play puts a limitation to the performance of the agent. I will argue that while some of the deficiencies of the bots may be able to be quickly “trained away”, there are fundamental weaknesses in self play that will limit performance even after a lot of training. OpenAI Five may eventually beat the best human players, but as we make environments more complex, these issues will become more and more apparent.
My comments are only about the learning strategy on a very high level and not game commentary as I am not a Dota player myself. For in-depth commentary, I found this to be great. Also, there is a ton of material on this topic by OpenAI themselves and you can watch the games yourself here and here.
Playing against oneself.
OpenAI five is really good at a single playing style. That is why it is often so hard to beat the first time. But in competitive games, once you figure out the opponent’s strategy, you can make it less effective. Human players can do this by observing bot behavior during the matches and planning counters.
OpenAI Five, on the other hand, simply rolls out its learnt policy during human matches. The policy takes into account the parts of the opponent’s state that are visible to it and hence can react to developments in the game, but there is no attempt to update model weights based on human play. So it is unable to react to human opponent’s meta-strategy.
In gaming, the meta-strategy (or just meta) is how the game or specific heroes are generally played at a higher level than reactions to in-game developments. OpenAI’s original 1v1 bot was initially more powerful than a pro-gamer, but human players quickly identified its playing style and developed many counter metas within the same event.
Self-play could overcome this if, during training time, the agent was pushed to encounter all different metas and hence forced to develop a single optimal policy which counters them all.
A standard way to do this in self-play is to sometimes play against an earlier version of oneself. The idea is that this will provide enough variation in opponents to avoid overfitting to itself. OpenAI Five plays 80% of its games against itself and 20% against a former version of itself.
This works well for stabilizing training, but is not the complete solution. The only opponent the agent has seen at par or better than itself is its own (future) policy. Previous versions of the agent will be weaker than the current version. Additionally, it is unlikely to see all good metas that exist in Dota2 as it is so complex and gradient descent training progresses along a single path which depends on the random seed of the network and the environment. Moreover, metas change over time as players discover better ways to play the game and the game itself is updated (unlike Go which has remained unchanged for centuries).
In short, self play does not provide the agent enough variance in advanced metas to counter all strategies human players can form against it in a sufficiently complex environment.
This shows up in the value estimates of the trained agent. OpenAI Five’s value estimates were remarkably good for the benchmark team.
After the game 1 draft, OpenAI Five predicted a 95% win probability, even though the matchup seemed about even to the human observers. It won the first game in 21 minutes and 37 seconds. After the game 2 draft, OpenAI Five predicted a 76.2% win probability, and won the second in 24 minutes and 53 seconds. - OpenAI
This is perhaps because the benchmark team played at a level below OpenAI Five and with a meta it has seen.
But for the pro team, the initial estimates of winning were optimistic despite the end result, perhaps because they employed a play style the agent never encountered during training and hence has an inaccurate estimate.
… [OpenAI Five] maintaining a good chance of winning for the first 20-35 minutes of both games. - OpenAI
OpenAI themselves have employed a partial solution to this kind of overfitting in an earlier project by training ensembles of agents in parallel and playing against all of them. Read more about it here. It is not mentioned whether this was done for OpenAI Five as well.
Examples of randomizations used by OpenAI Five are increasing/decreasing a hero’s speed or starting health, assigning lane’s randomly by providing shaping rewards etc. These randomizations make the training even more robust by presenting an unseen play style, forcing the agent to explore more of its state space. But this is not the same as a directed meta that is perhaps made to counteract its own. Human play can exhibit a mode that is very far from the uniform sampling that domain randomization provides.
A solution could be a fast moving model of the meta play that updates according to opponent strategy. Data from professional human matches can be used to learn this fast meta layer and allow the agent to predict and quickly adapt to the style of play being used by humans. This could also be used to construct a domain randomization model that goes beyond just perturbing the physics or graphics and randomizes between entirely different human developed metas.
Pure self-play as applied to OpenAI Five is blind to the problem of having to learn high level strategies from just a few samples, such as a single game of Dota2. OpenAI Five plays centuries of games against itself every day, so a single game against humans will hardly make a difference to network parameters. But humans are really good at this, which is why they are able to counter OpenAI Five after observing its playing style. It is an essentially skill for AI that wants to compete against, or hopefully, work with us. Self-play combined with such a strategy could be very powerful in learning competitive games.