I'll be using this post to give some insight into my methodology as I work through the TalkingData Kaggle challenge. This is a supervised learning challenge and I'll be using Python with its various packages to tackle it (packages include: pandas, numpy, keras, sklearn, and seaborn). Below is a description of the challenge and a link to the Kaggle page:
Kaggle page: www.kaggle.com/c/talkingdata-adtracking-fraud-detection
Fraud risk is everywhere, but for companies that advertise online, click fraud can happen at an overwhelming volume, resulting in misleading click data and wasted money. Ad channels can drive up costs by simply clicking on the ad at a large scale. With over 1 billion smart mobile devices in active use every month, China is the largest mobile market in the world and therefore suffers from huge volumes of fradulent traffic.
TalkingData, China’s largest independent big data service platform, covers over 70% of active mobile devices nationwide. They handle 3 billion clicks per day, of which 90% are potentially fraudulent. Their current approach to prevent click fraud for app developers is to measure the journey of a user’s click across their portfolio, and flag IP addresses who produce lots of clicks, but never end up installing apps. With this information, they've built an IP blacklist and device blacklist.
While successful, they want to always be one step ahead of fraudsters and have turned to the Kaggle community for help in further developing their solution. In their 2nd competition with Kaggle, you’re challenged to build an algorithm that predicts whether a user will download an app after clicking a mobile app ad. To support your modeling, they have provided a generous dataset covering approximately 200 million clicks over 4 days!
The competition is judged on how well your model performs in predicting the likelihood that a datapoint from the provided test set will be in one of two classes; did an ad click for an app end in an app download or not? The performance metric used is the AUC metric, or the Area Under the Curve metric. The curve in this case comes from a graph, displayed to the right, that is useful for weighting performance in a way that decreases false positives while increasing true positives.
In our example the blue line represents a model generating a high true positive rate, and a low false positive rate. It is not obvious from the graph, but the AUC always ranges from 0 to 1 as the ticks from the graph are viewed as percentages and actually represent 0 to 1 on each axis. Thus, the y-axis length multiplied by the x-axis length, 1 times 1, equals 1.
This is a very popular metric to use when the classification problem involves a target class of sparse hits. For example if your target is spam, but only 1% of your emails are spam you could have a very true positive rate by just classifying everything as not spam, since that's true 99% of the time. BUT, that's a terrible model for filtering spam, and since it would have a high false positive rate the AUC score would be quite low.
The data provided includes a training set for model training, and a test set which is used for making the predictions to be submitted for ranking. One conspicuous fact about the training data is that it comes in the form of a csv almost 7GB in size. Suffice it to say that files this size are much too unwieldy for loading in their entirety on most computers. This is were random sampling from the file is very useful. Heres one example of how to accomplish this using pandas:
The columns in the training set are ip, app, os, device, channel, click_time, attribution_time, and is_attributed. While the test set omits the attribution columns, since attribution is the variable of predictive interest, it adds a click_id column.
Cleaning and understanding data are the first steps to creating a useful model. As an example the click_id column in the test set has several values out of order. If we don't make sure that we sort the data set based on this column before submitting our predictions we will take a hit, albeit a small one.
Formatting data for your algorithm of choice is also important, and this is somewhat related to feature engineering. For example the click_time column of the training set is a string object. How can we expect most out of the box algorithms like k-nearest neighbors or decision trees to use this string, much less understand the underlying temporal implications of the date-time formatting? What from this string is even useful? The string includes information for the year and the month of the click time, but since all of the values for both the training and test sets are from November 2017, these aspects of click_time are not going to contribute to the predictive power of our model and are better discarded for simplification purposes. On the other hand the training data spans several days and thus it may be the case that there are cyclical trends in the hours that click bots are more active. Thus making sure that our algorithm can access the hours as an explicit feature may be important. This brings us to our last point, making sure our data works with the algorithm of interest.
The data and features used for a decision tree algorithm isn't likely to match that for a neural net algorithm. For example even if you've transformed your time data column such that variables of interest are explicit enough to be utilized by either a decision tree or a recursive neural net algorithm(RNN), the RNN may be able to implicitly engineer features of time such as clicks per hour because it has a kind of memory built into the model; this would not be the case for the decision tree algorithm and thus features such as summary statistics about each ip's click times may be needed to compensate.
Much of the work here will be utilizing AWS Sagemaker. The training dataset is too big and the compute power needed for experimenting with algorithms too much for my home setup. That said, I'll only be diving in enough to get a taste for the competition. I could throw $1000 at AWS tweaking and experimenting with algorithms, but my goals for the competition are of a introductory nature, in particular I'm new to Kaggle and have never been in one of their competitions.
Start simple, that's what they say. I started out running a variety of algorithms on small subsets of the data to get a feel for what might be better suited. After tweaking some decision trees, svm's, and logistic regressors, I decided to focus on tree, specifically random forests. At this point I formatted my data to make the day, hour, min, and sec values for click times into their own columns, thus making each explicitly available and splittable features. Trees work by splitting the data on features that have discriminatory power. Said another way, they utilize features which skew on the target classes of interest.
Below is an example of a Random Forest model that I used for one of my Kaggle submissions. The tweaked parameters for this model include a max of 6 features for splitting any given tree, a max of 10 nodes deep, and 20 trees. As you can see I used the top 100,000,000 training samples to fit the model; there are roughly 190,000,000 samples in total. I used a ml.m4.4xlarge instance for the Sagemaker notebook, and the model took about 6 hours to train*. The AUC generated by this submission was 0.9433, which while not terrible was not close to the 0.96 I was getting from my cross-validation experiments. That puts my rank at 2541 out 3347, which means it's time to roll up my sleeves. It's worth pointing out that this competition is very ... competitive. Several hundred positions can span only a 0.001 change in value.
*This was unnecessary in hindsight, scikit only utilizes one cpu core at a time so unless the RAM is what is needed scaling up usually won't get a model trained faster. I used mostly ml.m4.2xlarge instances for training and the base instances for cleaning and analysis.
Exploratory data analysis had revealed that the app, os, and channel columns were the most predictive columns, so I tried simplifying my model. Using the standard Random Forest classifier with a maximum depth of 10 and 50 trees I was actually able to increase my score to 0.9509. But the treadmill keeps turning and this did not improve my ranking by much.
Having been encouraged by the results of my model simplification I decided to revisit using logistic regression. I had listened to a talk were someone who'd won several data competitions said that all of their winning models used logistic regression classifiers with high-dimensional data. Well, logistic regression doesn't work out of the box for categorical data; you need to coax your data into being compatible by using indicator variable columns for each category of interest. Unfortunately, for this dataset the size of the category sets isn't small. There are more than 500 in each of the os, app, and channel categories; high-dimensional data indeed. What makes this really unfortunate is that when you're increasing the size of your dataset by adding indicator columns for each category variable of interest the increase in the amount of memory used can be drastic. For example, when I investigated using dummy(indicator) variables on a single day of the roughly three available days of the dataset the memory needed to hold the pandas dataframe increased from around 1 GB to around 1 TB, a thousandfold increase. And that's with only three variables os, app, and channel. Needless to say I did not find this path to be very promising given my computing constraints.
Next I tried something I was very curious about. What would happen if I threw out all of the fancy scikit-learn models(many of which I've not covered), and just did some probability estimation? Again using my three favorite categories, os, app, and channel, I created a set of all of the permutations of those categories represented in the whole 8 GB training dataset. Then I just calculated the odds that those permutations generated an attribution by dividing their total attributions by their total occurrence in the dataset. To generate a prediction for the test dataset I just used the odds generated by the training set for the permutations the two sets had in common, and used the average attribution likelihood for the training set as a whole where the test set permutation was not represented in the training set. How well did this perform? It was my best score thus far, 0.9549, and a big improvement by the standards of this competition. Unfortunately, the competition is fierce, and hundreds of new participants have entered since I began; my current standing is 2712 out of 3869. That said, there's a lot of potential improvement for this approach, and I'll have a new desktop in a few days that I'll need to break in.
Unfortunately this competition ended on the 7th of May and I got my new setup in some days later. The timeline wasn't actually that bad, I just entered the competition a month after it had already begun. Great learning experience overall. Thanks for reading.