pytorch implementation of "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model" based on his pseudocode. This implementation is intended to be as close as possible to the pseudocode presented.
The main difference is that this version uses the uniform distribution to samples data from the replay, instead of using prioratized experience replay.
To train your own muzero to play with caterpole you just have to launch
To evaluate the average sum of rewards it gets (number of moves that performs before failing (or finishing) the game in the case of caterpole), you can call the function.
Some metrics that it's possible to keep track while training (using tensorboard):
mean_reward: mean rewards of the last 50 games
Getting a score of 200-250+ is very feasable without tweaking parameters.
The problem with cartpole is that the training replay gets less and less crowded with failed games, using prioritized experience replay can be a solution to this problem.