I originally familiarized myself with OpenAI's Gym platform when working on my Udacity Capstone project. The Gym platform is a collection of environments for reinforcement learning bench-marking and development. One of those environments is Cartpole, an environment where the agent you program must learn to balance a pole hinged on a moving cart. I had chosen reinforcement learning as my Capstone subject because I found it fascinating, but also because I didn't feel it was covered in as much depth as unsupervised learning and supervised learning during the Machine Learning Engineer program. I'm revisiting this now for the following reasons:
This will be a multi-part project. In the first part we'll explore integrating an LSTM architecture into a modified version of my original Cartpole program. Using a recurrent neural network architecture such as an LSTM is an important part of training agents to take advantage of transfer learning or meta-learning because these architectures have a form of built-in memory. This memory allows them to learn patterns across training batches or time steps; these are the architectures used for machine translation, audio generation, and a variety of other sequence prediction applications. We'll use Tensorflow to incorporate the LSTM and throw in some Tensorboard graphs along the way.
We'll start our agent off with a simpler, but more powerful, model than the one from my Capstone. The agent will still use an e-greedy exploration strategy and a policy gradient implementation, but we'll be able to get rid of the "long-term memory" of the Capstone agent, which was a cache of up to 10,000 state-action pairs from episodes where the agent performed well. This cache had to be sampled randomly during batch training so our neural net didn't over-fit on the most recent episodes, and thus perform poorly on episodes similar to those where it had performed well but since forgotten. This kind of randomly sampled training from past experience is also called experience replay. The LSTM's memory seems to compensate well, and the kicker is that we're only batch training on the most recent episode where we performed well while greatly outperforming the previous model. As mentioned we'll start out simple, but gradually make the model more robust, and in posts to come try to use the models explored for transfer and/or meta-learning.
The code for the LSTM is below. I won't try to give a Tensorflow tutorial, or go in depth on the architecture here, but I do want to point out some important points for anyone wanting to replicate this program in a similar environment. Excellent Tensorflow tutorials can be found in Aurélien Géron's Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. The first thing to notice below is that our Tensorflow session is an Interactive Session; this is important since our model is continually feeding data to the Tensorflow graph and requesting action guidance from it based on the current state. This version of an LSTM has 3 hidden layers with 30 cells, "neurons", each. In general more neurons are better as are more layers. More layers are more important than tons of neurons for learning higher abstractions, but the amount of layers and neurons needed depends on the complexity of the function you're trying to approximate; ours isn't that complex so a few layers and 20 or more neurons per layer seems to suffice. The learning rate determines how quickly our LSTM learns, but speed needs to be balanced; if the learning rate is too low we might be waiting around forever and if it is too high we might bounce around a good solution instead of actually landing on it. The logits are the outputs we want to evaluate for our actions and as you can see by the n_outputs parameter there are two possible actions. Ultimately we will evaluate these at run time for each state and pick our action based on which logit output has the highest value.
On to the agent! There are two main aspects of our agent, how do we train our agent and how do we select actions. The question of how to train the agent is very broad. Looking at the values of states or state-action pairs is one way. There are actually dozens of variations on this way alone. They can include look-ahead methods that plan actions based on states several moves in the future; one can use use a network to learn to predict the next state and act based on it's learned value; one can learn state values in different ways including using simulation modeling to look ahead and estimate values of current states never visited before. Suffice to say there are many variations because there are many different types of environments and some methods work better in certain environment. Our approach is very intuitive and is a variation of what is known as a policy gradient method. Basically a policy is how an agent decides to act and policy gradient methods attempt to learn a policy. The simplest version, used here, just takes episodes where the agent performed well and trains its network to act like it did in those episodes. As the network gets more training it performs better, and as it performs better it gets trained on better data from better performing episodes creating a positive feedback loop.
The main parameters of our specific policy gradient feedback loop are high_score, did_well_threshold, and last_good_batch. As mentioned we train on data from episodes where we did well, but we need a way to evaluate what that means. Here we judge how well we believe an episode to be based on the highest score we've seen so far using the high_score parameter; as in life our evaluations of our own performance our often based on how well we've previously performed. Another psychological analog is a cutoff point for deciding if we feel we performed well. This role is filled by the did_well_threshold parameter, which ranges in value between 0 and 1. Any episode were we scored at least our high score multiplied by the threshold parameter is defined as an episode where we did well. This parameter need to be tuned, a very high threshold is like having high standards, but also means we're less likely to get much data for training and the data may not be as varied, which is also important for training. On the other hand having a low threshold leads to continued training on episodes where we haven't done very well perpetuating bad action choices. Finally the last_good_batch parameter is a memory of our last good episode and is used for training after every episode regardless of what we saw that episode; this is mainly to ensure we're getting plenty of training on good batches.
One of the quintessential problems facing a reinforcement learning agent is the trade-off between using the knowledge that it has gained, aka exploiting, and trying new things to see if it can gain more knowledge, aka exploring. The exploration vs exploitation dilemma is a very active area of research and it's one of the things I focus on most in this post. Explore too much and your agent isn't going to be performing very well as it continues to select actions that aren't as good as the ones it knows to be good; explore too little and it might never find out that the actions it thinks are good aren't really that great.
There are three different functions in this program for different exploration strategies. They are the decay_epsilon function, the decay_epsilon_custom function, and the epsilon_from_uncertainty function. All of these functions are a form of exploration called ε-greedy, where we choose to explore ε, aka epsilon, percent of the time, and choose an action based on our knowledge the remainder of the time. The decay_epsilon function is a classic version of an ε-greedy exploration strategy; the logic is to start with a large epsilon, close to 1.0, and decay it over time until it becomes quite small, hoping that by the time it becomes small we've done most of the exploring we needed to do to behave optimally. The epsilon_decay_rate parameter is used to decrease, aka decay, our epsilon using this strategy and it is implemented by multiplying the current epsilon with the decay rate after each episode where we performed well. The decay_epsilon_custom function is exactly that, a bespoke function that I came up with for the Cartpole agent that decreases epsilon based on how much "good" experience it has has, which is tracked by the experience parameter. Although this function performs better on average than the others I'm not a big fan; it is highly tuned based on observing dozens of runs for the program, which is another way of saying that it's something I had to learn, not the agent(so much for machine learning). Frankly I'm not a big fan of either of the two previous strategies which is why I've spent a lot of time thinking about new approaches that aren't as rigid or as tuned by hand. My most recent attempt is the epsilon_from_uncertainty function. Unlike the other two, this strategy can not only decrease the exploration rate as time goes on, but also increase it when the agent finds itself in a state where it isn't so sure of what to do. I'm very excited about this kind of approach, and my current implementation works, but it is usually much slower to solve the environment than the custom function. The epsilon_from_uncertainty function works by looking at the distance between the logit outputs of the LSTM, the absolute value of their difference, and exploring more when they're closer; as the LSTM is trained more the logit outputs diverge more for inputs on which it was trained. The beauty here is that this acts as a proxy for judging how much training we've had for our current state, and hence how confident we are acting based on our training. When the logit outputs are close we explore a lot, and when they are, relatively, distant we explore less. There is a bit of tuning involved in this function, but it is based on the fact the the logit outputs usually don't differ by more than 3.0 and differences of around 1.5 are reflective of a good deal of training for the current state. As mentioned, this method does cause the agent to learn more slowly than the others, but the reason is that it gives the agent more agency in choosing actions and relies less on human tuning, which makes it more of a pure machine learning implementation.
The Cartpole environment is part of a class of environments called MDPs, or Markov decision processes. One of the defining aspects of an MDP is that you do not need to have knowledge of past states in order to make an optimal decision, knowing your current state should be all you need. So, why do we need an architecture with memory at all, be it external for experience replay or built-in like an LSTM? The answer is regularization of the neural network; this is related to the problem of overfitting in supervised learning, except in reinforcement learning we usually have to worry more about overfitting on recent training data due to the overwriting effects of backprop on older but still relevant approximations. Thus there is a difference between using history to enhance a function approximator for an MDP and using it to make decisions. There is a good deal of nuance here however because things that happened in the past can be incorporated into the current state of an MDP without violating it defining properties. The point to remember is that making the correct decision in an MDP need only depend on its current state.
Another aspect that Cartpole shares with most MDP environments is that of a finite action set. Thus, training a neural net to make the correct action based on the current state equates to training a classifier that maps states to actions. As mentioned above however the approximation adjustments made to the network weights by recent or more frequent training states can overwrite past or rarer states. So regularization methods are a way of trying to maintain fidelity of past training or training on rarer states while still being open to training on new states. This is why memory structures that can maintain long-term memories with fidelity are such powerful regularization methods and this is what we'll be exploring in Part 2.
... Updates to come...
The whole program: