This week I have focused my energy on learning machine learning tools using the Python package Scipy. I purchased the book Building Machine Learning Systems with Python and have complete about half of the exercises. The take away message is that there are two kinds of statistical models. Heuristic models and Machine Learning models.
Here is an example I made up demonstrating the difference:
We own an ice cream shop and want to predict the best days, locations, and time to sell ice cream. So we make a heuristic statistical model that takes into account past sales using locations we sold, temperature, and time of sale. We model these parameters against # Ice Cream Cones sold which is our y or dependent variable. At the end of our model we find out that higher temperatures, afternoon hours, and areas of high traffic are all positively correlated with # Ice Cream Cones sold. On the plus side we have a deeper understanding of how the variables affect the outcome (ice cones sold), however on the negative side we trained our statistical model on all our past sales and we may have over-fit our model. This means our model is not very robust and if we are selling ice cream on an especially hot and busy day our model will not properly estimate the # ice creams sold because the majority of days that our model was trained on fell on moderately temperate days with moderate foot traffic. This is bad because if we know ahead of time the warmer than usual conditions and foot-traffic we want to bring enough gallons of ice cream to meet demand properly.
This is where Machine Learning comes in handy. Machine Learning uses the hold-out method which sets aside part of our past ice cream transactions and does not train our model on these ice cream sales. We then create a plethora of ice cream sale models and test each model’s accuracy on the creamy transactions that we left out of our model training. You see most models predict accurately on data points that the model was trained off of. But when it comes time to test on new data these models are really being put to the test. By holding out ice cream sales and testing it on the model we are simulating the act of introducing new data that hasn’t been trained on before. We can also choose the model that performs best on this untrained data. So if we have an unusually warm and busy day we can implement our machine learning algorithm to more accurately predict ice cream sales that day and come prepared with X amount of gallons of ice cream. Delicious I know!