Coursera Data Science Certificate: Update 1

For the past four months I have been working on the Verified Data Science Certificate offered by John Hopkins University in conjunction with Coursera. I have so far completed the Data Scientist’s Toolbox, R Programming, and Getting and Cleaning Data. Here is my opinion so far on the ease, applicability, and relevance of these 3 courses (there are 9 total + a capstone project).

The Data Scientist’s Toolbox is very informative and can bring someone up to speed on using Git, citing data, etc.. For my I have already been accustomed to Git and Github so this first course was relatively easy.

R Programming I found to be very informative and the homework was definitely rigorous. Even though I have been using R since my first grad school class at Stanford in Advanced Statistics I found that I could be doing things to make my life easier and that I could be using apply, lapply, and tapply a lot more in my code rather than for loops. I felt this was a great module even for someone who has been using R for 3 years+.

Getting and Cleaning Data as instructive in connecting to a web site and downloading data pro-grammatically or through an API. How to clean, sort, and pre-process that data and also how to take advantage of R’s Data Tables. I have often used Data Frames and was not aware of the compute time advantages of using Data Tables especially for large data sets.

For just $50/course I think this is one of the best cost/value certificates one can do for Data Science. Of course there is University of Washington’s Online Data Science Certificate which may carry more prestige but I think this is a great investment none the less for someone looking for value in their Data Science Training.

Here are links to my certificates so far:

The Data Scientist’s Toolbox
R Programming
Getting and Cleaning Data


Back to Basics: Data Science Applied and Theoretical

One of the awesome data prediction projects I had the privilege of working on as of late is helping the Department of Energy and Lawrence Berkeley National Laboratory improve their Home Energy Score (HES). This single value metric which ranges from 1 – 10 judges the performance of the energy efficiency of a given home regardless of user behavior. In other words this scoring system allows new home buyers to look at household energy performance based on this score without having to take into account the prior users’ behavior. HES is a relatively simple scoring system compared to the California Home Energy Rating System (HERS). I needed to conduct predictive analysis based on the HES features to predict HERS outcomes and find out if the simpler HES model captures the variability of the more complex HERS model.

Click below to see the Home Energy Score Viability Report
Home Energy Score Predictive Analysis

I ended up using Multi-variate Linear Regression, Random Forest, and Support Vector Machines using a repeated 10-fold cross-validation as my resampling method. I read up on the book “Applied Predictive Modeling” and gained more appreciation for the importance of pre-processing. Why the skew of the distribution matters for model A, why multicolinearity can hamper multi-variate regression but not decision trees. I also expanded my modeling toolbox to include more complex methods including Support Vector Machines (SVM) and single layer neural networks.

I highly recommend Applied Predictive Modeling and see the link below:

Applied Predictive Modeling – by Max Kuhn and Kjell Johnson

Other interesting books I’ve been reading lately:

Doing Data Science – Straight Talk From The Frontline – by Cathy O’Neil and Rachel Schutt

Programming – Collective Intelligence – by Toby Seagran

Data Analysis Using Regression and Multilevel/Hierarchical Models – by Andrew Gelman and Jennifer Hill

Great Books on agile production, work culture, leadership, and entrepreneurial management

For the past two months I have continued to learn and focus mainly on Javascript to produce applications quickly. However all this time spent on production and fine details made me want to take a few steps back and improve myself from a higher frame of reference, namely best practices for managing high risk projects where the customers, means, and goals can change rapidly. The reason I felt the urge to better understand project management and development from an agile side is that I am very self-motivated. I wanted to understand how self-motivated technical workers fit into a world where project management techniques are changing. After reading these books my opinion is now changing on what the role of project management is.

I wanted to know the qualities that make a great leader and how those leaders arrived there. The book Lean Startup by Eric Ries has taught me that small dynamic projects should use data to validate the direction and progress of projects as the project is developing. In other words don’t plan steps 1 through 10 and use the deliverable date as the metric of success. One can deliver a poor product that no one uses on time and according to spec all the while feeling incredibly productive. Instead launch the product after step 3, gain feedback from the audience, see if there is traction, and don’t worry about negative opinion or reputation. You might find yourself changing direction based on the feedback and save yourself the time of working through steps 4- 10. Another great book that I took a lot from is the book Good to Great. This book was written in 2005 and studied companies from the 1970s to 2000s but it still applies today now more than ever. The point I took from this book is that before you strategize or rethink the direction of the company get the right people in the right positions. In other words before deciding where to steer the bus get the right people on the bus. The right people always come first. I took my time selecting these books and I would recommend all of them.

My opinion is that every male manager should read Lean In. I’m not necessarily saying that Sheryl Sandberg offers sage advice to women because not all women had the early opportunities that Sheryl Sandberg had including Harvard, etc..It is my opinion however that managers have the creative power to engineer their companies work culture and before digging the foundation and laying the scaffold of that culture one must ask first who will be living in that culture and would I want to live there. I think Delivering Happiness by Tony Hsieh compliments Lean In because it focuses on Culture and how that culture helped Zappos go from a nearly bankrupt online company to the number one online shoe sales company and more importantly customer service company.

I must admit when I first saw the title How to Win Friends and Influence People I thought this book sounded shallow in intent and against every grain of my morality. Why would anyone study how to interact with people in order to get what they want. I think this book should create a second edition and just change the title to “How to Communicate Humbly and Effectively” and it would attract more introverts who might be misled by the title. Nonetheless this book opened my eyes on how our countries greatest leaders from Abraham Lincoln to Benjamin Franklin would fail and fail at debating and one day decided to empathize with their constituents. To instead say “you know in your shoes I understand how you would come to that position especially since …” “I think that you are right given the assumption that ….” “My only question is this, and forgive me I may be wrong”. I think it is important to posit an idea in a way that allows and invites others to still add in their thoughts without shutting the doors by using the right words.

The last man in the pictures above, I bet you don’t know who that is. That man is Darwin E. Smith one of the world’s greatest entrepreneurs who transitioned Kimberly Clark from a good company to a great company. Ask any MBA who their favorite entrepreneur is and they will mention Steve Jobs, Bill Gates, or Elon Musk. Ask them what they think of Darwin E. Smith and they’ll stare at you blinking. Darwin E. Smith took the reigns of a company that was in the business of paper mills. After battling cancer, given a few months, and working for another 20 years as CEO he came to the idea that one must chop ones arm off if that is what it takes to live. So as the paper milling business was making marginal income he decided to sell all of the mills to the criticism of the board and went on to reinvest that capital into consumer paper products such as Huggies, Kleenex, Kotex, Depend etc… The point being that cutting off non-productive business is just as, if not more important than creating new business. The biggest take away for me from Good to Great is that some of the worlds greatest leaders are those that no one ever recalls hearing about in the media. Leaders who don’t boast, brag, or take credit for their achievements. In fact Good to Great studied over 1400 companies and found a negative correlation between a CEO bragging and boasting about him or herself and the outcome of that company. Great leaders at great companies build and recruit a team so strong that when that great leader eventually leaves the company someone equally or more capable is there to continue where they left off.

Rework by Jason Fried and David Hansson
Delivering Happiness by Tony Hsieh (Founder of Zappos)
Good to Great: Why Some Companies make the Leap by Jim Collins
The Lean Startup: How Today’s Entrepreneurs Use Continuous Innovation to Create Radically Successful Businessess by Eric Ries
How to Win Friends and Influence People by Dale Carnegie
Lean In for Graduates by Sheryl Sandberg

Google Fusion Table Maps And Javascript: The Power


The challenge for displaying maps online is the how. How are we going to place a map online for viewers to interact with the data quickly, snapily, and do it at low cost. While boxed solutions exist such as Tableau and ESRI Arcserver both have their respective Pros and Cons.

Tableau: Tableau costs money, however it is a great piece of software to easily display spatial data from a database such as Access, Excel, or a .dbf. I have noticed that tableau maps are not as snappy as Google Fusion Maps. For example the ability to zoom in and click on a zip code within a state map is nowhere near as easy or natural as a Google Fusion table. So Tableau is preferable if you have the budget and don’t really need to zoom in or interact within the map itself.

ESRI Arc Server: Esri’s Arc Server has many strengths if you are already working in the Arc Map environment and have lots of .shp files in your library. One of the nice features is that you can easily set what the map should look like when you zoom in at different levels. This is possible with Google Fusion Tables/Maps however you need to know some Javascript for the functionality. The potential downside of Arc Server is that it also costs a bit of money and you will need to update your Arc Server occasionally which can be tedious if you have live maps already up on your site using older versions of Arc Server. Not to mention I have noticed Arc Server displayed maps don’t feel anywhere near as snappy or alive as Google Fusion Tables.

Google Fusion Tables: The benefits of Fusion Tables is that it is a free service provided by Google. You don’t need to pay to use/display maps on your website. This makes it a key candidate for Non-Profits, Academia, Hobbyists alike. It is incredibly snappy. When you zoom in on a map you are running on top of google maps so everything feels native and looks natural. The real challenge with fusion tables, if you want to go beyond displaying a simple map, is learning some javascript. If you simply want to display a map where a user can click on a polygon and see a pop-up with information then you don’t need to learn Javascript, you can simply do one Google Fusion Tables tutorial and start working. If you want to build a more interactive map with a timeline bar where you can scroll from 1990 to Present or change the polygons from zipcodes to counties then you will need to learn some javascript and html5. I prefer the the Google Fusion Table approach because the map products look incredibly professional and interactive and the learning curve is an obstacle that is worth overcoming in this case.

IPython: Exploratory Data Analysis


If you are like me you typically use R to handle all your in house data analysis (you may also use other programs like STATA, SPSS, etc..). I will attempt to make the case that Python and IPython, which stands for interactive Python, can fill those roles of data exploration, charting, and analysis. What the heck is IPython? Let me demonstrate

In Python we usually have this pattern of write, compile, run. So if we wrote a long script we would have to run the entire script each time, making it difficult to debug and laborious to explore line by line our data. IPython, short for Interactive Python, allows you to run Python code line by line. If you are doing data analysis and are familiar with software like R I highly recommend IPython because it allows you to quickly debug something as simple as writing the correct file path for your .csv file.

HTML5 + CSS + Javascript

CSS Javascript HTML5

We all have, or think we have, really great ideas that hit us in the shower. That revolutionary customer facing web application that could change the industry. So what do we do next? We go out and find someone to build it for us. Err. Wrong. We outsource our project to someone who is well trained. Well technically yes this might work very well and affordably, however what happens when you want to make an addition to the web app. What happens when you are waiting days to weeks for this person to e-mail you back. This is the conundrum that would be entrepreneurs and problem solvers face when trying to go from idea to fully functioning online solution.

In my experience if you want to get it done right you have to do it yourself and do it myself I will. That means understanding, appreciating, and learning the building blocks of the web. Why learn this scary language? What value added does a web application bring to any company? Don’t I need to hire a computer science major to do this stuff? The answers respectively are “Its fun”, “a ton”, and “don’t need one”. Take a look at the class requirements for a Computer Science major. Odds are you will see classes like “Intro to Java”, “Object Oriented Programming”, “Memory Management”. The truth is that present day web development languages and best practices aren’t even taught to CS majors (academia is slow to change their rubric, who wants to update their PowerPoint anyhow?). So if you aren’t a CS major remember that you are starting out at the same level as a new graduate. A great free resource to start learning HTML5, CSS, and Javascript is You don’t need to download any special programs or read any books, you can build the elements of webpage within the site.

This past week I attended a two hour Meetup learning session in San Francisco called Startup Saturdays – IT Training and Entrepreneurship Academy. We went over the basics of HTML5, CSS, and JavaScript. I highly recommend going to your local meetup as every major city will have an HTML group as well as a group dedicated to learning JavaScript. I signed up for the San Diego local groups and there are quite a bit of offerings. The reason to learn these sets of languages are that HTML tells the browser what is being displayed, CSS makes it pretty, and JavaScript makes it interactive and alive. JQuery offers a community built library for common JavaScript tasks. So the next time you have a great idea don’t go running to find someone to build it for you. At the very least understanding the basics of how to build your own site will allow you to speak intelligently with experienced web developers in the same way knowing what your car alternator is when speaking with a car mechanic. Chances are it is something you can do yourself or it takes a lot less time than originally estimated.

PostGIS: A Robust GIS Solution for PostgreSQL Data

This week my development improvement focused on PostGIS which is a way to conduct spatial analysis using SQL on data stored in a PostgreSQL database. The advantage of using PostGIS over desktop GIS applications including proprietary software from ESRI Arcmap or open source Quantum GIS is the ability to work at scale and native relational database support. The normal order of operations for spatial analysis is to first retrieve the data using SQL, export as a csv, import into third party software, geocode addresses, and start spatial analysis. With PostGIS every step from start to finish, even the query itself, is built into PostGIS. Desktop GIS software stores spatial data as a flat file called a shapefile. This is great for a single user to access however it does no support multiple users or applications calling this spatial data at the same time. This is where PostgreSQL and PostGIS shine as they allow for concurrent user access using a commercial opern source solution. The key differences between this second and first generation GIS solution is explained well in this PostGIS introduction. In summary PostGIS offers tremendous automation, speed, and consistency in spatial analysis on a large data set in PostgreSQL.

For learning how to begin I went where all the forums pointed to, Boundless: Introduction to PostGIS. I completed the first few examples using the Open Geo Suite and there is a lot to learn. However if you are already familiar with SQL and spatial analysis concepts such as projections, spatial joins and buffers then the real learning curve is a change in approach. Rather than seeing spatial analysis as a single step in a chain of data preparation steps one is now integrating all those steps into a SQL based approach and can be automated using Python functions. A standard database has data types of string, numeric, and Date/Time Stamp. PostgreSQL is a community driven relational database and is also extendable so that additional datatypes such as spatial based entries are allowed. I believe that PostGIS is well suited for dealing with dynamic spatial data stored in a relational database while desktop GIS software is better suited for one off projects where the data sources are already in shapefile format and users are not concurrently accessing this spatial data.