Since the creation of AlphaZero, a majority of Deep Learning research and engineering for computer chess has been centered around the “Zero” doctrine; that is, focusing on creating a chess engine utilizing zero human knowledge. While AlphaZero (and Leela Zero) are grand milestones for AI, a common critique is the computational costs required to execute these reinforcement learning algorithms. Motivated to create an efficient, yet scalable learning algorithm, I propose an elementary, yet novel solution: 𝝐-Explore. 𝝐-Explore is a handcrafted adaptation of greedy-epsilon exploration*, Go-Explore*, and supervised learning that frames exploration tasks as continual learning and utilizes significantly less computational resources when compared to state-of-the-art reinforcement learning algorithms. All experimentation uses only a single GPU (RTX Titan) and a single CPU (Threadripper 16-core). The results of 𝝐-Explore are not state-of-the-art with our experimental setup, but provide a foundation for creating more efficient handcrafted algorithms in other large search spaces given an available expert policy.
The algorithm leverages the existence of an expert-policy (and expert-evaluation in the case of chess) to generate labeled data for the supervised learning step of the algorithm. The existence of an expert in the specified exploration domain is required for the execution of 𝝐-Explore. While AlphaZero uses Monte-Carlo Tree Search as the expert-policy (and evaluation)*, 𝝐-Explore uses Stockfish.
Before the first exploration phase, 𝝐-Explore creates an agent to keep track of the iteration count and two NN’s: the “traversing network” and “averaging network”. The intuition for these two networks is very similar to Stochastic Weight Averaging*; the traversing network is continually optimized by SGD with nesterov momentum after each training iteration, and the averaging network is the simple average of the traversing network’s weights over each iteration.
ε-Explore utilizes a combination of dirichlet noise applied to the NN’s policy and playouts by Stockfish to generate ~600,000 labeled chess positions (one-hot policy and evaluation in centipawns) per hour on a single 16-core Threadripper CPU. The exploration phase is split into two main processes that utilize GPU and CPU independently, allowing for complete parallelization of this phase!
Dirichlet Noise (𝛼 = 0.3, 𝝐 = 0.25) and a temperature (𝜏 = 5.0) are applied to the policy of the traversing network to force the labeling of new, but potentially relevant positions. (Notice that Dirichlet Noise and temperature act as a random sampling mechanism that maintains the “idea” of the original policy distribution; thus, only a simple maximum is selected from our modified policy rather than a weighted sampling.) After a new position is achieved, it is sent off to the playout process for labeling and the greedy-epsilon process is repeated (if our new position is a terminal state, we simply reset and begin the process on a new game)! Because of the master-slave relationship between the two processes in the exploration phase, we control our exploration via the hyperparameter N, where N is the number of times we want to repeat the greedy-epsilon process.
The playout process simply takes a position from the greedy-epsilon phase and uses stockfish to simulate the rest of a game starting from the given position. The advantages of simulating the entire game from Stockfish’s perspective are: (1) the “technique” of closing out a winning position and best defense of a losing position are demonstrated, (2) complicated endgames, tactical solutions, perpetual draws, and other delayed reward signals are easily solved by stockfish’s optimized AB search (making our algorithm stable when compared to traditional RL), and (3) stockfish utilizes only a CPU, meaning this process can be run in parallel to the greedy-epsilon process and, more importantly, itself, since CPU’s are optimized for multithreading. Because of the simplicity of this process, only a single hyperparameter that controls either the depth or thinking time per position for our expert (Stockfish) needs to be tuned. Further experimentation is required to measure how the quality of the expert’s labeling affects the success of the algorithm; for now, Stockfish will think for 100ms for each position. At this thinking time, we are able to label ~600,000 positions per hour with a 16-core CPU.
Before explaining the training phase, I will present the logic for our “traversing” and “averaging” networks.
𝝐-Explore attempts to provide an approximation to a solution for a problem we will refer to as: 𝞱-Optimum. 𝞱-Optimum interprets the weights of our traversing network at iteration i, 𝞱i , as samples from a connected, low-loss region in the weight space. Our assumption that there exists such a region that is both low-loss and connected, comes from the intuition that over-parameterization and l2-regularization create an error-surface that satisfy these conditions because (1) the millions of dimensions in the weight-space (resulting from over-parameterization) empirically create connected regions of low-loss *(SWA paper) and (2) the spring-force from l2-regularization compels SGD to stay in the same low-loss region, despite the introduction of brand-new training data, after each iteration. The problem part of 𝞱-Optimum is: how do we arithmetically manipulate 𝞱1 , 𝞱2 , …, 𝞱n (where n is our iteration number), so that we achieve a 𝞱-Optimum, the point of lowest loss in the connected, low-loss region we are sampling from? 𝝐-Explore argues that the simple mean of the “traversing” network’s weights will be a good estimator of 𝞱-Optimum since the mid-point of points of low-loss (resulting from SGD) tend to be better at generalizing, hence the use of an “averaging” network. (This post does not explore better solutions to 𝞱-Optimum, though there clearly are. For example, elastic weight consolidation between iterations would sample from a smaller, low-loss region that isn’t arbitrarily constrained by l2-regularization, but rather the Fisher Information Matrix.)
The training phase is split into two sections: a simple supervised learning step where we handle the “traversing” network (TN), and a stochastic averaging step where we handle the “averaging” network (AN). This phase has a lot of room for improvement, for there are several hyperparameters I did not experiment with out of time constraint. Those hyperparameters being: batch-size, learning rate, l2-regularization constant, resnet architecture (depth and width), and epochs (before SWA).
Supervised Learning Step
After gathering our data from the exploration phase, we simply interpret the entirety of the data as a test set. (In our first training phase, we create a validation set that is the last 2% of the data. This validation set remains the same for the entirety of the algorithm so that we can monitor catastrophic forgetting.) The learning rate for each supervised learning step follows a cosine-annealing schedule*. After SGD with nesterov momentum, the traversing network is optimized using SWA with 25, 1-epoch training cycles with a constant learning rate. (A constant learning rate is used over a cyclical one to keep the assumption that the TN will stay in the same low-loss region.) We test the TN loss after fixing the batch normalization layers post-SWA.
Just like SWA, the averaging step takes a stochastic average of the TN’s weights over each iteration. We test the AN loss after fixing the batch normalization layers post averaging. Note that the validation loss in this step will give us information on how the AN generalizes as more data is fed to the TN.
This final phase tests the progress of the TN by forcing the TN to play a 100-game set (with dirichlet noise in the first 8-ply to create diverse opening positions) with its previous version. These sets are saved and analyzed by myself (with the help of Stockfish) to understand how the TN progressively learns and forgets strategies after its exposure to new data. Moreover, the winrate of the new TN is used to calculate and monitor elo after each iteration. The AN is not tested during this phase for the sake of time (this phase takes upwards of an hour to complete), but is tested after the algorithm is complete. After testing, our new TN will be used in the exploration phase and the rest of ε-Explore is repeated for our specified amount of iterations.
How to Execute 𝝐-Explore
WARNING: This code IS NOT optimized by any means. It takes LOTS of RAM (32+ Gbs) and the python multiprocessing is not stable (my computer crashes very often). If you want to properly replicate my experiment, I recommend coding you own version of 𝝐-Explore. Feel free to cite any of this code!
There are only two modifications you must change to this repository to replicate my experiment! 1 - create a bin/ folder in your repository 2 - add more iteration folders into the games/ folder (Make sure you add 1 more than the iterations you plan on running for safe measure)
Again, I recommend taking a look at the num-processes & cycles in main.py and modifying the code accordingly.
My name is Philip Felizarta and I am an undergraduate student at the University of California, Merced. I am the sole creator/contributor of this project. If you have any questions for me contact me at: firstname.lastname@example.org