DeepMind’s MuZero teaches itself how to win at Atari, chess, shogi, and Go

Kyle Wiggers @Kyle_L_Wiggers November 20, 2019 8:42 AM

Source: https://venturebeat.com/2019/11/20/deepminds-muzero-teaches-itself-how-to-win-at-atari-chess-shogi-and-go/

Image Credit: DeepMind

In a paper published in the journal Science late last year, Google parent company Alphabet’s DeepMind detailed AlphaZero, an AI system that could teach itself how to master the game of chess, a Japanese variant of chess called shogi, and the Chinese board game Go. In each case, it beat a world champion, demonstrating a knack for learning two-person games with perfect information — that is to say, games where any decision is informed of all the events that have previously occurred.
But AlphaZero had the advantage of knowing the rules of games it was tasked with playing. In pursuit of a performant machine learning model capable of teaching itself the rules, a team at DeepMind devised MuZero, which combines a tree-based search (where a tree is a data structure used for locating information from within a set) with a learned model. MuZero predicts the quantities most relevant to game planning, such that it achieves industry-leading performance on 57 different Atari games and matches the performance of AlphaZero in Go, chess, and shogi.

Model-based reinforcement learning

Fundamentally, MuZero receives observations — i.e., images of a Go board or Atari screen — and transforms them into a hidden state. This hidden state is updated iteratively by a process that receives the previous state and a hypothetical next action, and at every step the model predicts the policy (e.g., the move to play), value function (e.g., the predicted winner), and immediate reward (e.g., the points scored by playing a move).

Above: Evaluation of MuZero throughout training in chess, shogi, Go, and Atari. The y-axis shows Elo rating.

Image Credit: DeepMind

Intuitively, MuZero internally invents game rules or dynamics that lead to accurate planning.
As the DeepMind researchers explain, one form of reinforcement learning — the technique that’s at the heart of MuZero and AlphaZero, in which rewards drive an AI agent toward goals — involves models. This form models a given environment as an intermediate step, using a state transition model that predicts the next step and a reward model that anticipates the reward.
Commonly, model-based reinforcement learning focuses on directly modeling the observation stream at the pixel level, but this level of granularity is computationally expensive in large-scale environments. In fact, no prior method has constructed a model that facilitates planning in visually complex domains such as Atari; the results lag behind well-tuned model-free methods, even in terms of data efficiency.

Above: Comparison of MuZero against previous agents in Atari.

Image Credit: DeepMind

For MuZero, DeepMind instead pursued an approach focusing on end-to-end prediction of a value function, where an algorithm is trained so that the expected sum of rewards matches the expected value with respect to real-world actions. The system has no semantics of the environment state but simply outputs policy, value, and reward predictions, which an algorithm similar to AlphaZero’s search (albeit generalized to allow for single-agent domains and intermediate rewards) uses to produce a recommended policy and estimated value. These in turn are used to inform an action and the final outcomes in played games.

Training and experimentation

The DeepMind team applied MuZero to the classic board games Go, chess, and shogi as benchmarks for challenging planning problems, and to all 57 games in the open source Atari Learning Environment as benchmarks for visually complex reinforcement learning domains. They trained the system for five hypothetical steps and a million mini-batches (i.e., small batches of training data) of size 2,048 in board games and size 1,024 in Atari, which amounted to 800 simulations per move for each search in Go, chess, and shogi and 50 simulations for each search in Atari.
With respect to Go, MuZero slightly exceeded the performance of AlphaZero despite using less overall computation, which the researchers say is evidence it might have gained a deeper understanding of its position. As for Atari, MuZero achieved a new state of the art for both mean and median normalized score across the 57 games, outperforming the previous state-of-the-art method (R2D2) in 42 out of 57 games and outperforming the previous best model-based approach in all games.

Above: Evaluations of MuZero on Go (A), all 57 Atari Games (B), and Ms. Pac-Man (C-D).

Image Credit: DeepMind

The researchers next evaluated a version of MuZero — MuZero Reanalyze — that was optimized for greater sample efficiency, which they applied to 75 Atari games using 200 million frames of experience per game in total. They report that it managed a 731% median normalized score compared to 192%, 231%, and 431% for previous state-of-the-art model-free approaches IMPALA, Rainbow, and LASER, respectively, while requiring substantially less training time (12 hours versus Rainbow’s 10 days).
Lastly, in an attempt to better understand the role the model played in MuZero, the team focused on Go and Ms. Pac-Man. They compared search in AlphaZero using a perfect model to the performance of search in MuZero using a learned model, and they found that MuZero matched the performance of the perfect model even when undertaking larger searches than those for which it was trained. In fact, with only six simulations per move — fewer than the number of simulations per move than is enough to cover all eight possible actions in Ms. Pac-Man — MuZero learned an effective policy and “improved rapidly.”
“Many of the breakthroughs in artificial intelligence have been based on either high-performance planning,” wrote the researchers. “In this paper we have introduced a method that combines the benefits of both approaches. Our algorithm, MuZero, has both matched the superhuman performance of high-performance planning algorithms in their favored domains — logically complex board games such as chess and Go — and outperformed state-of-the-art model-free [reinforcement learning] algorithms in their favored domains — visually complex Atari games.”

Thursday, November 21, 2019

Tal-Koblenz: deep dark forest of blunders !

In 1957, Mikhail Tal played a training game that sparkled of tactics against his trainer, Alexander Koblents.

One of Tal's most famous chess quotes is "You must take your opponent into a deep, dark forest where 2+2=5 and the path leading out is only wide enough for one."

Analysis with Stockfish exposes many blunders:

1.e4 c5 2.Nf3 d6 3.d4 cxd4 4.Nxd4 Nf6 5.Nc3 Nc6 6.Bg5 e6 7.Qd2 Be7 8.O-O-O
O-O 9.Nb3 Qb6 10.f3 a6
    { Last book move}
11.g4 Rd8 12.Be3 Qc7 $201 13.h4 $6 $10 $201
    { Stockfish 10 64: 24:-0.24}
    ( 13.g5 {23:+0.93} 13...Nd7 14.h4 b5 15.h5 b4 16.Na4 Nc5 17.Naxc5 dxc5
    18.Qg2 Rxd1+ 19.Kxd1 Bb7 20.Kc1 c4 21.Bxc4 Ne5 22.Bf1 Rc8 23.Nd4 Nc4
    24.Bxc4 Qxc4 25.Kb1 a5 $14 )
13...b5 $6 $14 $201
    { Stockfish 10 64: 23:+0.65}
    ( 13...d5 $10 {24:-0.24} )
14.g5 $6 $10
    { Stockfish 10 64: 22:+0.00}
    ( 14.Qf2 {23:+0.65} 14...Nd7 15.Kb1 Bb7 16.a3 Bf6 17.Bd2 Nc5 18.Bg5
    Be7 19.Bf4 Nxb3 20.cxb3 b4 21.Na4 e5 22.Be3 Nd4 23.Rc1 Bc6 24.Bxd4
    exd4 25.axb4 d5 $14 )
14...Nd7 $201 15.g6 $6 $10 $201
    { Stockfish 10 64: 23:-0.43}
    ( 15.h5 {22:+0.44} 15...Nc5 $10 )
15...hxg6 $6 $14
    { Stockfish 10 64: 23:+0.85}
    ( 15...fxg6 {23:-0.43} 16.h5 $10 )
16.h5 gxh5 17.Rxh5 $201 Nf6 $2 $16
    { Stockfish 10 64: 23:+1.77}
    ( 17...Nce5 {24:+0.19} 18.Qh2 $10 )
18.Rh1 $201 d5 $2 $18 $201
    { Stockfish 10 64: 25:+3.69}
    ( 18...Bb7 $14 {22:+1.39} )
19.e5 $2 $14
    { Stockfish 10 64: 26:+0.95}
    ( 19.Bf4 {25:+3.69} 19...Bd6 20.Bxd6 Qxd6 21.f4 Ng4 22.e5 Qc7 23.Bd3
    f5 24.exf6 Nxf6 25.Kb1 Kf7 26.Qg2 Ne7 27.Nd4 Rg8 28.Qg5 Bd7 29.Nf3 Ke8
    30.Ne5 b4 31.Ne2 Kd8 32.Bh7 Re8 33.Qxg7 Nxh7 34.Rxh7 Kc8 35.Qf6 Qd6
    $18 )
19...Nxe5 $201 20.Bf4 $4 $17
    { Stockfish 10 64: 23:-2.06}
    ( 20.Qh2 {25:+1.73} 20...Kf8 21.Qh8+ Ng8 22.Bd4 Bf6 23.Bc5+ Be7 24.Rh7
    Bxc5 25.Qxg7+ Ke7 26.Nxc5 Nf6 27.Rh6 Ng8 28.Rh8 Ng6 29.Qxf7+ Kxf7 30.
    Rh7+ Kf8 31.Rxc7 Re8 32.Re1 Nf4 33.Ne2 e5 34.Nxf4 exf4 35.Rxe8+ Kxe8
    36.Rc6 Kf7 37.Bd3 Ne7 38.Rd6 $16 )
20...Bd6 $201 21.Qh2 $6 $17
    { Stockfish 10 64: 23:-2.00}
    ( 21.Qg2 {24:-0.87} 21...b4 22.Na4 Nd3+ 23.Bxd3 Bxf4+ 24.Kb1 Kf8 25.
    Rh8+ Ke7 26.Rxd8 Qxd8 27.Qxg7 Bd7 28.Nac5 Qg8 29.Qxg8 Rxg8 30.Bxa6 Bd6
    31.Re1 Rg2 32.Nd3 Kd8 33.Nd4 e5 34.Nb5 Bxb5 35.Bxb5 Kc7 36.Kc1 e4 37.
    fxe4 dxe4 38.Rf1 exd3 39.Rxf6 $15 )
21...Kf8 $201 22.Qh8+ $2 $19 $201
    { Stockfish 10 64: 23:-3.13}
    ( 22.Kb1 $15 {25:-1.40} )
22...Ng8 $4 $14
    { Stockfish 10 64: 26:+0.80}
    ( 22...Ke7 {23:-3.13} 23.Qh4 $19 )
23.Rh7 f5 24.Bh6 Rd7 25.Bxb5 $201 Rf7 $4 $18
    { Stockfish 10 64: 24:+3.73}
    ( 25...Ng6 {28:+0.52} 26.Nd4 Nxh8 27.Nxe6+ Kf7 28.Nxc7 Rxc7 29.Rxg7+
    Kf8 30.Rxc7+ Nxh6 31.Bc6 Rb8 32.Ra7 Bc5 33.Ra8 Rxa8 34.Bxa8 Be3+ 35.
    Kb1 d4 36.Ne2 N6f7 37.Nxd4 Ne5 38.c3 Nhg6 39.a4 Kf7 40.Re1 Bd2 41.Re2
    Bg5 42.b4 Kf6 43.Kc2 Be6 44.Nxe6 Kxe6 $14 )
26.Rg1 Ra7 $201 27.Nd4 $4 $14
    { Stockfish 10 64: 22:+0.65}
    ( 27.Be8 {23:+4.03} 27...Re7 $18 )
27...Ng4 28.fxg4 $201 Be5 $4 $20 $201
    { Stockfish 10 64: 23:+6.65}
    ( 28...Bf4+ {23:+0.63} 29.Kb1 Bxh6 30.g5 Bxg5 31.Rxg5 Qf4 32.Rgh5 Rfc7
    33.Nde2 Qf1+ 34.Nc1 axb5 35.Rh1 Qc4 36.Nd3 Rxa2 37.Qxg8+ Kxg8 38.Rh8+
    Kf7 39.Ne5+ Kf6 40.Nxc4 Ra6 41.Nd2 b4 42.Nb5 Rc5 43.Nd4 e5 44.N2b3 Rc4
    45.Rf8+ Ke7 46.Rhh8 exd4 47.Rxc8 Rxc8 48.Rxc8 $14 )
29.Nc6 $4 $16 $201
    { Stockfish 10 64: 23:+1.70}
    ( 29.Nf3 {23:+6.65} 29...axb5 30.Nh4 Ke8 31.Ng6 Rf8 32.Nxf8 Kxf8 33.
    gxf5 exf5 34.Nxb5 Qb6 35.Rhxg7 Qxh6+ 36.Qxh6 Rxg7 37.Qb6 Rxg1+ 38.Qxg1
    Kf7 39.Qc5 Bf4+ 40.Kd1 Be6 41.Qd4 Bb8 42.Qh8 Kg6 43.Nd4 $20 )
29...Bxc3 $4 $20
    { Stockfish 10 64: 26:+6.21}
    ( 29...axb5 {23:+1.70} 30.Nxe5 $16 )
30.Be3 $5 d4 31.Rgh1 Rd7 $201 32.Bg5 $4 $10
    { Stockfish 10 64: 23:+0.00}
    ( 32.gxf5 {22:+9.07} 32...Bxb2+ 33.Kxb2 Rd5 34.Rxg7 Rxb5+ 35.Kc1 Qxg7
    36.Bh6 Rab7 37.Bxg7+ Rxg7 38.f6 Rgg5 39.f7 Kxf7 40.Rh7+ Kf8 41.Rc7
    Rg1+ 42.Kd2 Bb7 43.Nxd4 Rd5 44.Ke2 Rg2+ 45.Ke3 Rxd4 46.Kxd4 Rg6 47.
    Rxb7 $20 )
32...axb5 33.R1h6 $201 d3 $4 $20
    { Stockfish 10 64: 25:+6.47}
    ( 33...Bxb2+ {27:+0.00} 34.Kxb2 Qxc6 35.Rf6+ gxf6 36.Bh6+ Rg7 37.Bxg7+
    Ke7 38.Bxf6+ Kd6 39.Rxa7 Nxf6 40.Qxf6 Qc3+ 41.Kb1 Qe1+ 42.Kb2 $10 )
34.bxc3 d2+ 35.Kd1 Qxc6 36.Rf6+ Rf7 37.Qxg7+
    ( 37.Qxg7+ Ke8 38.Rxf7 Qf3+ 39.Kxd2 Qf2+ 40.Kc1 Qe1+ 41.Kb2 Qxc3+ 42.
    Qxc3 Rxa2+ 43.Kxa2 Bb7 44.Rxb7 Ne7 45.Qh8+ Ng8 46.Qxg8# )
*

Scacchi e gambetti

Friday, November 22, 2019

In real life, would MuZero enroll in Freemasonry?

DeepMind’s MuZero teaches itself how to win at Atari, chess, shogi, and Go

Model-based reinforcement learning

Training and experimentation

Thursday, November 21, 2019

Tal-Koblenz: deep dark forest of blunders !

Skewering The Royal Family In Akhtamar Island

Blog Archive

Followers