JOHN M. HORN
University of California, Irvine 2013
M.S. in Physical Science – Big Data Spatial Analysis Focus
Stanford University, Stanford CA 2010 – 2012
M.S. in Civil and Environmental Engineering – Focus in Statistical and Spatial Modeling
University of California, San Diego 2006 – 2010
B.S. in Chemistry – Air Pollution Research and Modeling
Data Science Certificates
John’s Hopkins Bloomberg School
Conducted Big Data Analysis on Millions of Homes to create Valuation Engine
In collaboration with the White House, and 3rd party groups developed algorithm to assess impact of PACE assessment on the valuation of homes. Results were published in white house briefing. Utilized Amazon AWS and Python to complete analysis.
Conducted Million Dollar Data Driven Mailing Campaign based on Regression and Experiment Results
Utilized Multi-Variate Regression on prior mailing campaigns to optimize future campaigns for target demographic based on housing data, demographics, financial data, and other features using TSQL and R.
Experience leading and teaching teams internally on the latest methods for reproducible research in both R and Python using Knitr and Python Notebooks
Shifted business practice away from Excel one-off analysis towards reproducible research that could be passed and interpreted from one analyst to another using markdown, python notebooks, and R Knitr. Found that this practice raised the quality of work and served as better documentation of analysis, assumptions, and findings.
Experience working Collaboratively within a Data Team using Git and Jira to validate and build models
Worked within Business Intelligence to serve multiple stakeholders in delivering models across multiple environments from development to production. Used Jira to create features and manage stories to ensure timely delivery.
Experience utilizing Amazon EC2 and S3 to scale large data Analysis for mining and predictive models within a Linux framework
Utilized the collective power of multiple EC2 instances to conduct analysis on millions of potential customers in order to target the optimal customer. This shift to cloud based computing reduced computation time from years to hours using graph computation.
Experience conducting Regressions and Machine Learning across multiple verticals and product lines in an Agile Workflow
Within a 2 week sprint schedule for a high growth startup utilized Random Forest, Support Vector Machines, as well as standard Multi-Variate Linear Regression to create robust predictions and deliver understanding of features for multiple teams including Marketing, Business Development, Engineering, Compliance, and Product Management.
Experience Utilizing and Building APIS as well as Handling Data in Multiple Structures including JSON, XML, RDB SQL, and Transaction Data
Utilized 3rd party APIs and helped build an internal API to deliver data from back-end to front-end and serve customers customized experiences in real time.
Recent Work Projects
Created The World’s Most Accurate Solar Payback Analyzer Using Green Button
I combined Green Button Data, PVWatts, and Open EI’s Utility Rate Database into a single application to allow prospective solar customers to type in their address, upload their green button data, size their solar system, and see their new utility bill after solar.
Try the Application
Green Button Data Analysis : Calculating the Economic Benefit of Switching to Time of Use Rates Using Smart Grid Data – 2014
Analyzed the hourly energy use profiles of hundreds of electric vehicle owners across three major California Utilities (PG&E, SCE, and SDG&E). By analyzing the smart meter data using Python we found that the about 60% of residential EV customers are on a higher cost Domestic Rate tariff instead of a time of use tariff costing EV customer on average $800 more a year.
See Poster Here
Applied Machine Learning in Python and R (Random Forest and Support Vector Machines) to Analyze Hundreds of Buildings in San Diego : 2014 – 2015
The goal of this project was to data mine xml file paramters, automate analysis, and utilize machine learning to produce the most accurate predictions possible. I created both Python and R scripts using Python Pandas, Numpy, and Scikit-learn as well as R Caret packages to run statistical regressions and machine learning algorithms. Presented results to the Dept. of Energy and the California Energy Commission.
See Report Here
Large Data Mining of thousands of Solar Panels using Python Pandas Package in Support of Vanderbilt University Machine Learning Group : 2014
Was presented with thousands of individual excel files each in their own folders with time series data on energy use. Created a script to crawl through each subfolder, read the excel documents, extract the time series of interest, and stored the data in a Python Pandas Data frame. Analyzed this time series data using the Python data analysis library to show that customers purchasing solar use more energy than non-solar customers. Used data in support Vanderbilt University Machine Learning Research Group in creating an Agent Based Model.
Utilized Feature Selection, Multi-Variate Regression, and Cross Validation to Optimize Target Customer Selection in R : 2014
Mining County Assessor Data, SDG&E residential energy use data, NOAA Weather Stations, and other parameters created a multivariate regression model that maps which homes in San Diego are most likely to participate in
energy efficiency upgrades. Presented results to the San Diego Regional Energy Partnership Meeting.
Bay Area Rapid Transit Solar Planning – 2014
In partnership with Arup, utilized GIS, NREL PVWatts Solar Modeling, SQL Database Language, and spatial analysis to identify key regions along BART owned property to place solar panels to power local transport.
Developed Commercial Solar Rate Calculator Web Application – 2014
Designed an interactive Commercial Solar Rate Calculator that allows a user to enter energy usage data, select a utility tariff, and solar system size to receive customized estimates on the cost and benefit of switching to solar in SDG&E territory.
Modeled, Predicted, and Mapped Optimal Locations for Electric Vehicle Chargers across California – 2014
Was presented the challenge of deciding where to place electric vehicle chargers in the Central Valley. Created a python script to gather data from Google on all the locations of conveniences throughout the valley to optimize placement of chargers in dense regions.
Created a front end database input Access form for users to enter Property Assessed Clean Energy (PACE) programs. The data is then automatically extracted using SQL, uploaded through Google Fusion Table API, and visualized.
Created Spatial/Statistical Model for targeted Home Energy Efficiency Upgrades – 2014
Utilizing County Assessor Data, SDG&E residential energy use data, Weather Stations, and other parameters created a statistical model that maps which homes in San Diego are most likely to participate in energy efficiency upgrades.
Utilized both R and Python to Extract and Analyze Imagery Data, Model Regularization, and Multivariate Regression in Collaboration with Orange County Municipal Water District and UC Irvine: 2013
• Automated the extraction, analysis, and modeling of Satellite imagery of household lawns in combination with ground truth data on actual household water consumption in order to predict an individuals water consumption. Utilized R to select statistically significant features and predict individual consumption of hundreds of thousands of customers.
Led a team in Utilizing R to conduct Data Cleanup, Model Regularization, Model Selection, Sensitivity Analysis, and Multivariate Regression on Wind Farms for Stanford University : 2012
Using the R statistical software package in combination with ArcMap GIS, assessed the optimal wind farm locations in the continental United States using transmission line, road, wind speed, state renewable incentives, endangered species, and topo data. The most important model features were selected using Bayesian Information Criteria. Presented results to Stanford panel of modeling experts.
Skills and Qualifications
Technical expertise in Multivariate Regressions,Machine Learning, R Statistical Modeling, SQL, Data Mining, Python, Tableau, Hadoop, Map/Reduce, Matlab, Git, Excel VBA, Linux, Remote Servers, Spatial Modeling, Data Visualization, RESTful APIs, Python Statistical Modeling using Numpy, Pandas, Scipy, Scikit-learn, Matlplotlib, active member of Meetup Organizations SD Big Data and SD Python
|• Technical expertise in Machine Learning, Regressions, R, Python, SQL, AWS, Linux, Statistical Modeling, Data Vizualization, Tableau, git
Analyst : March 2015 – Present
Analyst : Center for Sustainable Energy 2013 – 2015
Analysis Fellowship: United States Geological Survey 2010 – 2013