Caveat: This project is very much in progress, and will likely take a year or more. The motivation behind it, building memory models for machine learning, is itself a very long-term research project. If I ever pursue a PhD, this is the area I would choose.
In this post I will bring to bear years of study in machine learning, probability, graph theory, and psychology/neuroscience to explore the future of memory structures for machine intelligence. In particular, memory structures of interest are those that enable humanlike flexibility in the face of long-term and novel tasks. Memory with attributes such as temporal and sequential addressing, categorical structure, long-term fidelity coupled with updatability, associative linking, an analog to consolidation between short-term memory and long-term memory, cue based recall, free recall, and... the list goes on. As mentioned, this is a large project. I'm currently studying the work of Josh Tenenbaum and others on the subjects of probabilistic programming and program induction. Related to these are probabilistic graphical models and model building. Some readings of interest are:
Building Machines That Learn and Think Like People
Probabilistic Graphical Modeling - Stanford CS 228
I originally familiarized myself with OpenAI's Gym platform when working on my Udacity Capstone project. The Gym platform is a collection of environments for reinforcement learning bench-marking and development. One of those environments is Cartpole, an environment where the agent you program must learn to balance a pole hinged on a moving cart. I had chosen reinforcement learning as my Capstone subject because I found it fascinating, but also because I didn't feel it was covered in as much depth as unsupervised learning and supervised learning during the Machine Learning Engineer program. I'm revisiting this now for the following reasons:
This will be a multi-part project. In the first part we'll explore integrating an LSTM architecture into a modified version of my original Cartpole program. Using a recurrent neural network architecture such as an LSTM is an important part of training agents to take advantage of transfer learning or meta-learning because these architectures have a form of built-in memory. This memory allows them to learn patterns across training batches or time steps; these are the architectures used for machine translation, audio generation, and a variety of other sequence prediction applications. We'll use Tensorflow to incorporate the LSTM and throw in some Tensorboard graphs along the way.
We'll start our agent off with a simpler, but more powerful, model than the one from my Capstone. The agent will still use an e-greedy exploration strategy and a policy gradient implementation, but we'll be able to get rid of the "long-term memory" of the Capstone agent, which was a cache of up to 10,000 state-action pairs from episodes where the agent performed well. This cache had to be sampled randomly during batch training so our neural net didn't over-fit on the most recent episodes, and thus perform poorly on episodes similar to those where it had performed well but since forgotten. This kind of randomly sampled training from past experience is also called experience replay. The LSTM's memory seems to compensate well, and the kicker is that we're only batch training on the most recent episode where we performed well while greatly outperforming the previous model. As mentioned we'll start out simple, but gradually make the model more robust, and in posts to come try to use the models explored for transfer and/or meta-learning.
The code for the LSTM is below. I won't try to give a Tensorflow tutorial, or go in depth on the architecture here, but I do want to point out some important points for anyone wanting to replicate this program in a similar environment. Excellent Tensorflow tutorials can be found in Aurélien Géron's Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. The first thing to notice below is that our Tensorflow session is an Interactive Session; this is important since our model is continually feeding data to the Tensorflow graph and requesting action guidance from it based on the current state. This version of an LSTM has 3 hidden layers with 30 cells, "neurons", each. In general more neurons are better as are more layers. More layers are more important than tons of neurons for learning higher abstractions, but the amount of layers and neurons needed depends on the complexity of the function you're trying to approximate; ours isn't that complex so a few layers and 20 or more neurons per layer seems to suffice. The learning rate determines how quickly our LSTM learns, but speed needs to be balanced; if the learning rate is too low we might be waiting around forever and if it is too high we might bounce around a good solution instead of actually landing on it. The logits are the outputs we want to evaluate for our actions and as you can see by the n_outputs parameter there are two possible actions. Ultimately we will evaluate these at run time for each state and pick our action based on which logit output has the highest value.
On to the agent! There are two main aspects of our agent, how do we train our agent and how do we select actions. The question of how to train the agent is very broad. Looking at the values of states or state-action pairs is one way. There are actually dozens of variations on this way alone. They can include look-ahead methods that plan actions based on states several moves in the future; one can use use a network to learn to predict the next state and act based on it's learned value; one can learn state values in different ways including using simulation modeling to look ahead and estimate values of current states never visited before. Suffice to say there are many variations because there are many different types of environments and some methods work better in certain environment. Our approach is very intuitive and is a variation of what is known as a policy gradient method. Basically a policy is how an agent decides to act and policy gradient methods attempt to learn a policy. The simplest version, used here, just takes episodes where the agent performed well and trains its network to act like it did in those episodes. As the network gets more training it performs better, and as it performs better it gets trained on better data from better performing episodes creating a positive feedback loop.
The main parameters of our specific policy gradient feedback loop are high_score, did_well_threshold, and last_good_batch. As mentioned we train on data from episodes where we did well, but we need a way to evaluate what that means. Here we judge how well we believe an episode to be based on the highest score we've seen so far using the high_score parameter; as in life our evaluations of our own performance our often based on how well we've previously performed. Another psychological analog is a cutoff point for deciding if we feel we performed well. This role is filled by the did_well_threshold parameter, which ranges in value between 0 and 1. Any episode were we scored at least our high score multiplied by the threshold parameter is defined as an episode where we did well. This parameter need to be tuned, a very high threshold is like having high standards, but also means we're less likely to get much data for training and the data may not be as varied, which is also important for training. On the other hand having a low threshold leads to continued training on episodes where we haven't done very well perpetuating bad action choices. Finally the last_good_batch parameter is a memory of our last good episode and is used for training after every episode regardless of what we saw that episode; this is mainly to ensure we're getting plenty of training on good batches.
One of the quintessential problems facing a reinforcement learning agent is the trade-off between using the knowledge that it has gained, aka exploiting, and trying new things to see if it can gain more knowledge, aka exploring. The exploration vs exploitation dilemma is a very active area of research and it's one of the things I focus on most in this post. Explore too much and your agent isn't going to be performing very well as it continues to select actions that aren't as good as the ones it knows to be good; explore too little and it might never find out that the actions it thinks are good aren't really that great.
There are three different functions in this program for different exploration strategies. They are the decay_epsilon function, the decay_epsilon_custom function, and the epsilon_from_uncertainty function. All of these functions are a form of exploration called ε-greedy, where we choose to explore ε, aka epsilon, percent of the time, and choose an action based on our knowledge the remainder of the time. The decay_epsilon function is a classic version of an ε-greedy exploration strategy; the logic is to start with a large epsilon, close to 1.0, and decay it over time until it becomes quite small, hoping that by the time it becomes small we've done most of the exploring we needed to do to behave optimally. The epsilon_decay_rate parameter is used to decrease, aka decay, our epsilon using this strategy and it is implemented by multiplying the current epsilon with the decay rate after each episode where we performed well. The decay_epsilon_custom function is exactly that, a bespoke function that I came up with for the Cartpole agent that decreases epsilon based on how much "good" experience it has has, which is tracked by the experience parameter. Although this function performs better on average than the others I'm not a big fan; it is highly tuned based on observing dozens of runs for the program, which is another way of saying that it's something I had to learn, not the agent(so much for machine learning). Frankly I'm not a big fan of either of the two previous strategies which is why I've spent a lot of time thinking about new approaches that aren't as rigid or as tuned by hand. My most recent attempt is the epsilon_from_uncertainty function. Unlike the other two, this strategy can not only decrease the exploration rate as time goes on, but also increase it when the agent finds itself in a state where it isn't so sure of what to do. I'm very excited about this kind of approach, and my current implementation works, but it is usually much slower to solve the environment than the custom function. The epsilon_from_uncertainty function works by looking at the distance between the logit outputs of the LSTM, the absolute value of their difference, and exploring more when they're closer; as the LSTM is trained more the logit outputs diverge more for inputs on which it was trained. The beauty here is that this acts as a proxy for judging how much training we've had for our current state, and hence how confident we are acting based on our training. When the logit outputs are close we explore a lot, and when they are, relatively, distant we explore less. There is a bit of tuning involved in this function, but it is based on the fact the the logit outputs usually don't differ by more than 3.0 and differences of around 1.5 are reflective of a good deal of training for the current state. As mentioned, this method does cause the agent to learn more slowly than the others, but the reason is that it gives the agent more agency in choosing actions and relies less on human tuning, which makes it more of a pure machine learning implementation.
The Cartpole environment is part of a class of environments called MDPs, or Markov decision processes. One of the defining aspects of an MDP is that you do not need to have knowledge of past states in order to make an optimal decision, knowing your current state should be all you need. So, why do we need an architecture with memory at all, be it external for experience replay or built-in like an LSTM? The answer is regularization of the neural network; this is related to the problem of overfitting in supervised learning, except in reinforcement learning we usually have to worry more about overfitting on recent training data due to the overwriting effects of backprop on older but still relevant approximations. Thus there is a difference between using history to enhance a function approximator for an MDP and using it to make decisions. There is a good deal of nuance here however because things that happened in the past can be incorporated into the current state of an MDP without violating it defining properties. The point to remember is that making the correct decision in an MDP need only depend on its current state.
Another aspect that Cartpole shares with most MDP environments is that of a finite action set. Thus, training a neural net to make the correct action based on the current state equates to training a classifier that maps states to actions. As mentioned above however the approximation adjustments made to the network weights by recent or more frequent training states can overwrite past or rarer states. So regularization methods are a way of trying to maintain fidelity of past training or training on rarer states while still being open to training on new states. This is why memory structures that can maintain long-term memories with fidelity are such powerful regularization methods and this is what we'll be exploring in Part 2.
... Updates to come...
The whole program:
I'll be using this post to give some insight into my methodology as I work through the TalkingData Kaggle challenge. This is a supervised learning challenge and I'll be using Python with its various packages to tackle it (packages include: pandas, numpy, keras, sklearn, and seaborn). Below is a description of the challenge and a link to the Kaggle page:
Kaggle page: www.kaggle.com/c/talkingdata-adtracking-fraud-detection
Fraud risk is everywhere, but for companies that advertise online, click fraud can happen at an overwhelming volume, resulting in misleading click data and wasted money. Ad channels can drive up costs by simply clicking on the ad at a large scale. With over 1 billion smart mobile devices in active use every month, China is the largest mobile market in the world and therefore suffers from huge volumes of fradulent traffic.
TalkingData, China’s largest independent big data service platform, covers over 70% of active mobile devices nationwide. They handle 3 billion clicks per day, of which 90% are potentially fraudulent. Their current approach to prevent click fraud for app developers is to measure the journey of a user’s click across their portfolio, and flag IP addresses who produce lots of clicks, but never end up installing apps. With this information, they've built an IP blacklist and device blacklist.
While successful, they want to always be one step ahead of fraudsters and have turned to the Kaggle community for help in further developing their solution. In their 2nd competition with Kaggle, you’re challenged to build an algorithm that predicts whether a user will download an app after clicking a mobile app ad. To support your modeling, they have provided a generous dataset covering approximately 200 million clicks over 4 days!
The competition is judged on how well your model performs in predicting the likelihood that a datapoint from the provided test set will be in one of two classes; did an ad click for an app end in an app download or not? The performance metric used is the AUC metric, or the Area Under the Curve metric. The curve in this case comes from a graph, displayed to the right, that is useful for weighting performance in a way that decreases false positives while increasing true positives.
In our example the blue line represents a model generating a high true positive rate, and a low false positive rate. It is not obvious from the graph, but the AUC always ranges from 0 to 1 as the ticks from the graph are viewed as percentages and actually represent 0 to 1 on each axis. Thus, the y-axis length multiplied by the x-axis length, 1 times 1, equals 1.
This is a very popular metric to use when the classification problem involves a target class of sparse hits. For example if your target is spam, but only 1% of your emails are spam you could have a very true positive rate by just classifying everything as not spam, since that's true 99% of the time. BUT, that's a terrible model for filtering spam, and since it would have a high false positive rate the AUC score would be quite low.
The data provided includes a training set for model training, and a test set which is used for making the predictions to be submitted for ranking. One conspicuous fact about the training data is that it comes in the form of a csv almost 7GB in size. Suffice it to say that files this size are much too unwieldy for loading in their entirety on most computers. This is were random sampling from the file is very useful. Heres one example of how to accomplish this using pandas:
The columns in the training set are ip, app, os, device, channel, click_time, attribution_time, and is_attributed. While the test set omits the attribution columns, since attribution is the variable of predictive interest, it adds a click_id column.
Cleaning and understanding data are the first steps to creating a useful model. As an example the click_id column in the test set has several values out of order. If we don't make sure that we sort the data set based on this column before submitting our predictions we will take a hit, albeit a small one.
Formatting data for your algorithm of choice is also important, and this is somewhat related to feature engineering. For example the click_time column of the training set is a string object. How can we expect most out of the box algorithms like k-nearest neighbors or decision trees to use this string, much less understand the underlying temporal implications of the date-time formatting? What from this string is even useful? The string includes information for the year and the month of the click time, but since all of the values for both the training and test sets are from November 2017, these aspects of click_time are not going to contribute to the predictive power of our model and are better discarded for simplification purposes. On the other hand the training data spans several days and thus it may be the case that there are cyclical trends in the hours that click bots are more active. Thus making sure that our algorithm can access the hours as an explicit feature may be important. This brings us to our last point, making sure our data works with the algorithm of interest.
The data and features used for a decision tree algorithm isn't likely to match that for a neural net algorithm. For example even if you've transformed your time data column such that variables of interest are explicit enough to be utilized by either a decision tree or a recursive neural net algorithm(RNN), the RNN may be able to implicitly engineer features of time such as clicks per hour because it has a kind of memory built into the model; this would not be the case for the decision tree algorithm and thus features such as summary statistics about each ip's click times may be needed to compensate.
Much of the work here will be utilizing AWS Sagemaker. The training dataset is too big and the compute power needed for experimenting with algorithms too much for my home setup. That said, I'll only be diving in enough to get a taste for the competition. I could throw $1000 at AWS tweaking and experimenting with algorithms, but my goals for the competition are of a introductory nature, in particular I'm new to Kaggle and have never been in one of their competitions.
Start simple, that's what they say. I started out running a variety of algorithms on small subsets of the data to get a feel for what might be better suited. After tweaking some decision trees, svm's, and logistic regressors, I decided to focus on tree, specifically random forests. At this point I formatted my data to make the day, hour, min, and sec values for click times into their own columns, thus making each explicitly available and splittable features. Trees work by splitting the data on features that have discriminatory power. Said another way, they utilize features which skew on the target classes of interest.
Below is an example of a Random Forest model that I used for one of my Kaggle submissions. The tweaked parameters for this model include a max of 6 features for splitting any given tree, a max of 10 nodes deep, and 20 trees. As you can see I used the top 100,000,000 training samples to fit the model; there are roughly 190,000,000 samples in total. I used a ml.m4.4xlarge instance for the Sagemaker notebook, and the model took about 6 hours to train*. The AUC generated by this submission was 0.9433, which while not terrible was not close to the 0.96 I was getting from my cross-validation experiments. That puts my rank at 2541 out 3347, which means it's time to roll up my sleeves. It's worth pointing out that this competition is very ... competitive. Several hundred positions can span only a 0.001 change in value.
*This was unnecessary in hindsight, scikit only utilizes one cpu core at a time so unless the RAM is what is needed scaling up usually won't get a model trained faster. I used mostly ml.m4.2xlarge instances for training and the base instances for cleaning and analysis.
Exploratory data analysis had revealed that the app, os, and channel columns were the most predictive columns, so I tried simplifying my model. Using the standard Random Forest classifier with a maximum depth of 10 and 50 trees I was actually able to increase my score to 0.9509. But the treadmill keeps turning and this did not improve my ranking by much.
Having been encouraged by the results of my model simplification I decided to revisit using logistic regression. I had listened to a talk were someone who'd won several data competitions said that all of their winning models used logistic regression classifiers with high-dimensional data. Well, logistic regression doesn't work out of the box for categorical data; you need to coax your data into being compatible by using indicator variable columns for each category of interest. Unfortunately, for this dataset the size of the category sets isn't small. There are more than 500 in each of the os, app, and channel categories; high-dimensional data indeed. What makes this really unfortunate is that when you're increasing the size of your dataset by adding indicator columns for each category variable of interest the increase in the amount of memory used can be drastic. For example, when I investigated using dummy(indicator) variables on a single day of the roughly three available days of the dataset the memory needed to hold the pandas dataframe increased from around 1 GB to around 1 TB, a thousandfold increase. And that's with only three variables os, app, and channel. Needless to say I did not find this path to be very promising given my computing constraints.
Next I tried something I was very curious about. What would happen if I threw out all of the fancy scikit-learn models(many of which I've not covered), and just did some probability estimation? Again using my three favorite categories, os, app, and channel, I created a set of all of the permutations of those categories represented in the whole 8 GB training dataset. Then I just calculated the odds that those permutations generated an attribution by dividing their total attributions by their total occurrence in the dataset. To generate a prediction for the test dataset I just used the odds generated by the training set for the permutations the two sets had in common, and used the average attribution likelihood for the training set as a whole where the test set permutation was not represented in the training set. How well did this perform? It was my best score thus far, 0.9549, and a big improvement by the standards of this competition. Unfortunately, the competition is fierce, and hundreds of new participants have entered since I began; my current standing is 2712 out of 3869. That said, there's a lot of potential improvement for this approach, and I'll have a new desktop in a few days that I'll need to break in.
Unfortunately this competition ended on the 7th of May and I got my new setup in some days later. The timeline wasn't actually that bad, I just entered the competition a month after it had already begun. Great learning experience overall. Thanks for reading.
This was my capstone machine learning project with Udacity. I've embedded the document below as a Scribd frame. Tags: neural network, OpenAI, Tensorflow, parameter search, machine learning