DeepMind’s MuZero teaches itself how to win at Atari, chess, shogi, and Go
data:image/s3,"s3://crabby-images/c5d10/c5d10652853aec4c4a8bbef5707a1262afee2a8e" alt="DeepMind MuZero"
Image Credit: DeepMind
In a paper published in the journal Science late last year, Google parent company Alphabet’s DeepMind detailed AlphaZero,
an AI system that could teach itself how to master the game of chess, a
Japanese variant of chess called shogi, and the Chinese board game Go.
In each case, it beat a world champion, demonstrating a knack for
learning two-person games with perfect information — that is to say,
games where any decision is informed of all the events that have
previously occurred.
But AlphaZero had the advantage of knowing the rules of games it was tasked with playing. In pursuit of a performant machine learning model capable of teaching itself the rules, a team at DeepMind devised MuZero, which combines a tree-based search (where a tree is a data structure used for locating information from within a set) with a learned model. MuZero predicts the quantities most relevant to game planning, such that it achieves industry-leading performance on 57 different Atari games and matches the performance of AlphaZero in Go, chess, and shogi.
But AlphaZero had the advantage of knowing the rules of games it was tasked with playing. In pursuit of a performant machine learning model capable of teaching itself the rules, a team at DeepMind devised MuZero, which combines a tree-based search (where a tree is a data structure used for locating information from within a set) with a learned model. MuZero predicts the quantities most relevant to game planning, such that it achieves industry-leading performance on 57 different Atari games and matches the performance of AlphaZero in Go, chess, and shogi.
Recommended Videos
Powered by AnyClip
“Planning algorithms … have achieved remarkable successes in artificial intelligence … However, these planning algorithms all rely on knowledge of the environment’s dynamics, such as the rules of the game or an accurate simulator,” wrote the scientists in a preprint paper describing their work. “Model-based … learning aims to address this issue by first learning a model of the environment’s dynamics, and then planning with respect to the learned model.”
Model-based reinforcement learning
Fundamentally, MuZero receives observations — i.e., images of a Go board or Atari screen — and transforms them into a hidden state. This hidden state is updated iteratively by a process that receives the previous state and a hypothetical next action, and at every step the model predicts the policy (e.g., the move to play), value function (e.g., the predicted winner), and immediate reward (e.g., the points scored by playing a move).data:image/s3,"s3://crabby-images/e8901/e890164773ea8f6bf4deeee428731f269cdc5255" alt="DeepMind MuZero"
Above: Evaluation of MuZero throughout training in chess, shogi, Go, and Atari. The y-axis shows Elo rating.
Image Credit: DeepMind
As the DeepMind researchers explain, one form of reinforcement learning — the technique that’s at the heart of MuZero and AlphaZero, in which rewards drive an AI agent toward goals — involves models. This form models a given environment as an intermediate step, using a state transition model that predicts the next step and a reward model that anticipates the reward.
Commonly, model-based reinforcement learning focuses on directly modeling the observation stream at the pixel level, but this level of granularity is computationally expensive in large-scale environments. In fact, no prior method has constructed a model that facilitates planning in visually complex domains such as Atari; the results lag behind well-tuned model-free methods, even in terms of data efficiency.
data:image/s3,"s3://crabby-images/7b94d/7b94d1ac5b77902ee45a05cb927af40dedbccd25" alt="DeepMind MuZero"
Above: Comparison of MuZero against previous agents in Atari.
Image Credit: DeepMind
Training and experimentation
The DeepMind team applied MuZero to the classic board games Go, chess, and shogi as benchmarks for challenging planning problems, and to all 57 games in the open source Atari Learning Environment as benchmarks for visually complex reinforcement learning domains. They trained the system for five hypothetical steps and a million mini-batches (i.e., small batches of training data) of size 2,048 in board games and size 1,024 in Atari, which amounted to 800 simulations per move for each search in Go, chess, and shogi and 50 simulations for each search in Atari.With respect to Go, MuZero slightly exceeded the performance of AlphaZero despite using less overall computation, which the researchers say is evidence it might have gained a deeper understanding of its position. As for Atari, MuZero achieved a new state of the art for both mean and median normalized score across the 57 games, outperforming the previous state-of-the-art method (R2D2) in 42 out of 57 games and outperforming the previous best model-based approach in all games.
data:image/s3,"s3://crabby-images/baeed/baeedfe94afd94db0e84b67905ff1a804e30a1c2" alt="DeepMind MuZero"
Above: Evaluations of MuZero on Go (A), all 57 Atari Games (B), and Ms. Pac-Man (C-D).
Image Credit: DeepMind
Lastly, in an attempt to better understand the role the model played in MuZero, the team focused on Go and Ms. Pac-Man. They compared search in AlphaZero using a perfect model to the performance of search in MuZero using a learned model, and they found that MuZero matched the performance of the perfect model even when undertaking larger searches than those for which it was trained. In fact, with only six simulations per move — fewer than the number of simulations per move than is enough to cover all eight possible actions in Ms. Pac-Man — MuZero learned an effective policy and “improved rapidly.”
“Many of the breakthroughs in artificial intelligence have been based on either high-performance planning,” wrote the researchers. “In this paper we have introduced a method that combines the benefits of both approaches. Our algorithm, MuZero, has both matched the superhuman performance of high-performance planning algorithms in their favored domains — logically complex board games such as chess and Go — and outperformed state-of-the-art model-free [reinforcement learning] algorithms in their favored domains — visually complex Atari games.”
No comments:
Post a Comment