Back to Basics: Data Science Applied and Theoretical

One of the awesome data prediction projects I had the privilege of working on as of late is helping the Department of Energy and Lawrence Berkeley National Laboratory improve their Home Energy Score (HES). This single value metric which ranges from 1 – 10 judges the performance of the energy efficiency of a given home regardless of user behavior. In other words this scoring system allows new home buyers to look at household energy performance based on this score without having to take into account the prior users’ behavior. HES is a relatively simple scoring system compared to the California Home Energy Rating System (HERS). I needed to conduct predictive analysis based on the HES features to predict HERS outcomes and find out if the simpler HES model captures the variability of the more complex HERS model.

Click below to see the Home Energy Score Viability Report
Home Energy Score Predictive Analysis

I ended up using Multi-variate Linear Regression, Random Forest, and Support Vector Machines using a repeated 10-fold cross-validation as my resampling method. I read up on the book “Applied Predictive Modeling” and gained more appreciation for the importance of pre-processing. Why the skew of the distribution matters for model A, why multicolinearity can hamper multi-variate regression but not decision trees. I also expanded my modeling toolbox to include more complex methods including Support Vector Machines (SVM) and single layer neural networks.

I highly recommend Applied Predictive Modeling and see the link below:

Applied Predictive Modeling – by Max Kuhn and Kjell Johnson

Other interesting books I’ve been reading lately:

Doing Data Science – Straight Talk From The Frontline – by Cathy O’Neil and Rachel Schutt

Programming – Collective Intelligence – by Toby Seagran

Data Analysis Using Regression and Multilevel/Hierarchical Models – by Andrew Gelman and Jennifer Hill