The challenge for displaying maps online is the how. How are we going to place a map online for viewers to interact with the data quickly, snapily, and do it at low cost. While boxed solutions exist such as Tableau and ESRI Arcserver both have their respective Pros and Cons.
Tableau: Tableau costs money, however it is a great piece of software to easily display spatial data from a database such as Access, Excel, or a .dbf. I have noticed that tableau maps are not as snappy as Google Fusion Maps. For example the ability to zoom in and click on a zip code within a state map is nowhere near as easy or natural as a Google Fusion table. So Tableau is preferable if you have the budget and don’t really need to zoom in or interact within the map itself.
If you are like me you typically use R to handle all your in house data analysis (you may also use other programs like STATA, SPSS, etc..). I will attempt to make the case that Python and IPython, which stands for interactive Python, can fill those roles of data exploration, charting, and analysis. What the heck is IPython? Let me demonstrate
In Python we usually have this pattern of write, compile, run. So if we wrote a long script we would have to run the entire script each time, making it difficult to debug and laborious to explore line by line our data. IPython, short for Interactive Python, allows you to run Python code line by line. If you are doing data analysis and are familiar with software like R I highly recommend IPython because it allows you to quickly debug something as simple as writing the correct file path for your .csv file.
We all have, or think we have, really great ideas that hit us in the shower. That revolutionary customer facing web application that could change the industry. So what do we do next? We go out and find someone to build it for us. Err. Wrong. We outsource our project to someone who is well trained. Well technically yes this might work very well and affordably, however what happens when you want to make an addition to the web app. What happens when you are waiting days to weeks for this person to e-mail you back. This is the conundrum that would be entrepreneurs and problem solvers face when trying to go from idea to fully functioning online solution.
This week my development improvement focused on PostGIS which is a way to conduct spatial analysis using SQL on data stored in a PostgreSQL database. The advantage of using PostGIS over desktop GIS applications including proprietary software from ESRI Arcmap or open source Quantum GIS is the ability to work at scale and native relational database support. The normal order of operations for spatial analysis is to first retrieve the data using SQL, export as a csv, import into third party software, geocode addresses, and start spatial analysis. With PostGIS every step from start to finish, even the query itself, is built into PostGIS. Desktop GIS software stores spatial data as a flat file called a shapefile. This is great for a single user to access however it does no support multiple users or applications calling this spatial data at the same time. This is where PostgreSQL and PostGIS shine as they allow for concurrent user access using a commercial opern source solution. The key differences between this second and first generation GIS solution is explained well in this PostGIS introduction. In summary PostGIS offers tremendous automation, speed, and consistency in spatial analysis on a large data set in PostgreSQL.
For learning how to begin I went where all the forums pointed to, Boundless: Introduction to PostGIS. I completed the first few examples using the Open Geo Suite and there is a lot to learn. However if you are already familiar with SQL and spatial analysis concepts such as projections, spatial joins and buffers then the real learning curve is a change in approach. Rather than seeing spatial analysis as a single step in a chain of data preparation steps one is now integrating all those steps into a SQL based approach and can be automated using Python functions. A standard database has data types of string, numeric, and Date/Time Stamp. PostgreSQL is a community driven relational database and is also extendable so that additional datatypes such as spatial based entries are allowed. I believe that PostGIS is well suited for dealing with dynamic spatial data stored in a relational database while desktop GIS software is better suited for one off projects where the data sources are already in shapefile format and users are not concurrently accessing this spatial data.
SQL or structured query language is a useful relation-based structure that allows one to go query the raw data and then work analysis magic on that data. I decided to brush up on my SQL skills and completed the exercises on Learn SQL The Hard Way. While I was brushing up I decided to also learn a new text editing style / editor called “VI” which shares much in common with the text editor “VIM“. Vi is built into all unix-based machines so that alone makes it very useful. Also compared to other editors such as Nano that are built into the console Vi integrates commands that allows one to write code more efficiently. Specifically Vi allows ones finger to never leave the keyboard when writing code. The advantage may not seem apparent to those used to using a keyboard and mouse together to write code, but to those who code more frequently Vi/Vim offer a lot of benefit in time and ease.
For those who have heard of SQL pronounced (“Sequel”) but aren’t sure of the core of this language. The core four functions are rather basic and if you have used excel you can use this analogy. Everything consists of tables. So a sequel database is much like your excel workbook. Each tab in the workbook is a table in the database. Rows and Columns in excel are Rows and Columns in SQL. There are four core methods in SQL called “CRUD”:
Create – add data into tables
Read – read data from tables
Update – update data in tables
Delete – delete data from tables
This week I have focused my energy on learning machine learning tools using the Python package Scipy. I purchased the book Building Machine Learning Systems with Python and have complete about half of the exercises. The take away message is that there are two kinds of statistical models. Heuristic models and Machine Learning models.
Here is an example I made up demonstrating the difference:
We own an ice cream shop and want to predict the best days, locations, and time to sell ice cream. So we make a heuristic statistical model that takes into account past sales using locations we sold, temperature, and time of sale. We model these parameters against # Ice Cream Cones sold which is our y or dependent variable. At the end of our model we find out that higher temperatures, afternoon hours, and areas of high traffic are all positively correlated with # Ice Cream Cones sold. On the plus side we have a deeper understanding of how the variables affect the outcome (ice cones sold), however on the negative side we trained our statistical model on all our past sales and we may have over-fit our model. This means our model is not very robust and if we are selling ice cream on an especially hot and busy day our model will not properly estimate the # ice creams sold because the majority of days that our model was trained on fell on moderately temperate days with moderate foot traffic. This is bad because if we know ahead of time the warmer than usual conditions and foot-traffic we want to bring enough gallons of ice cream to meet demand properly.
This is where Machine Learning comes in handy. Machine Learning uses the hold-out method which sets aside part of our past ice cream transactions and does not train our model on these ice cream sales. We then create a plethora of ice cream sale models and test each model’s accuracy on the creamy transactions that we left out of our model training. You see most models predict accurately on data points that the model was trained off of. But when it comes time to test on new data these models are really being put to the test. By holding out ice cream sales and testing it on the model we are simulating the act of introducing new data that hasn’t been trained on before. We can also choose the model that performs best on this untrained data. So if we have an unusually warm and busy day we can implement our machine learning algorithm to more accurately predict ice cream sales that day and come prepared with X amount of gallons of ice cream. Delicious I know!
I have been curious if their exists data analysis packages within python that mimic the functions one encounters and uses everyday in excel such as pivot tables, vlookups, index, graphing, etc… The reason I am interested in the possibility of automating these functions is that with larger data sets with tens of thousands of fields these functions begin to slow down. I was also curious if tasks that are performed everyday in excel can be done faster in python. To further my data automation skills I am reading the book “Python for Data Analysis” by Wes McKinney.
In my previous post I mentioned that I am creating a solar energy bill calculator using Green Button data. Green Button data consists of 15 minute or hourly energy use data of a residence. In order to upload this data into a form that one can manipulate I used both the Pandas and Numpy extension (No animals were harmed in this data analysis). Numpy stands for Numerical Python and Pandas is the mnemonic for Python Data Analysis Library. Once the user uploads their energy use data I transform their Green Button XML data into a python data frame using Pandas. I then re-sample or rather up-sample the data into monthly kWh data so that I can then calculate their monthly electricity bill.
What excites me about Numpy and Pandas is that I can take the fundamentals of data analysis that I have learned so far using excel and automate everything. Often there is a trade off between investing time to automate a feature and just implementing the feature in excel. I think this is the real skill set that I am developing knowing when it is more efficient to automate in Python or simply implement in excel. I think a good rule of thumb is that as the data set becomes larger (hundreds of thousands of rows) then Python becomes a more powerful tool to manipulate the data set.
This week I have been busy making the skeleton of my Green Button app. The idea is that a user can enter in their Green Button data which contains their 15 minute energy usage profile along with their address, heating source, and desired solar panel specs to see how much they could be saving by switching to solar. Once the user submits their Green Button data NRELs PVWatts application calculates the predicted energy produced onsite using the address and this is subtracted from the energy logged in the Green Button data. Depending on the utility provider of the user’s energy savings is estimated based on the tariff structure of that utility.
I created the framework in Django along with the file submission and reading functionality. The front-end is a chart using the Highcharts API. I threw in some mock data to make sure that the charts were working and I also have the file submission working properly now. The majority of the work left is to create forms to enter a user’s address and heating source. There is more work needed on the calculations and PVWatts integration. In the past I have worked on projects like creating video games where I started with code (back-end) and delayed the animations and visuals (front-end) until later. I now know better and am working on both the front-end and back-end simultaneously as they often have to conform with one another, rather than one of these concepts being an afterthought.
This week I attempted to make my own application centered around energy and green button data. However before even working on that app I ran into a speed bump. The default database that comes with django is sqlite3 and while this suffices for tutorials and pre-production apps it does not fair as well for production applications, especially for hosting on popular virtual servers like Heroku. For my purposes I wanted to start with the right database even during test production because migrating databases is a challenge I would rather actively prevent. The popular database for web applications is postgresql because of the many advanced features that come included. I decided to go with postgresql as my SQL database of choice because it is recommended as a best practice from the book Two Scoops of Django.
I had many options for hosting my database and hosting this test production app, I could use a one year free trial of Amazon EC2 to host the application and scale appropriately for more users later on. I could push my application to Heroku and use their 5 mb of free storage. Or I could self-host the application and database through my own home-built server. Because I had a raspberry pi that I was playing with I decided to make this miniature $35 computer my home test server. This small computer might not be able to handle tens of thousands of users an hour but it has been known to be able to handle a few thousand users an hour. Because I have an 8gb sd card in this little computer and the energy cost to run it all year comes out to about $7 this actually makes economic sense for test production. It took a bit of time to sort out which database to use for django, setup my raspberry pi server, and understand the appropriate services to host one’s application but I am now ready to get back to coding my app!
Here are the tutorials I used on setting up a postgresql database using Ubuntu, and settting up my raspberry pi server!
This week I have learned some of the best practices for coding including using a virtual environment for my python packages and using Git as a way to keep a log of my codes throughout projects. The benefit of running on a virtual environment is that the many packages that Python offers are constantly being updated and one may update their packages and not realize that a prior program they made is no longer compatible with the new package. This is where virtual environments come in handy, they store older versions of packages so that your work environment can be frozen in time for that given project while you can start a new environment for your next project that requires up to date packages. The VirtualEnv package can be found here and it is a must for running any applications on a server.
The second tool that I am learning is Git and the tool called GitHub. I am still in the beginning phases of realizing the power of git but so far from playing around with it it offers the potential to take snapshots of your project as you move along so that if something breaks you can tell git to move backwards to any point in time you desire. This is one example of the uses of git. GitHub is a repository for code and collaboration and many of the Python tutorials that I have done have code stored on GitHub.
I completed a very simple machine learning task on Kaggle using the Random Forest Algorithm from SciPy. This expired competition is called Predicting a Biological Response and I scored in at about #358 out of the 700 submissions. I also completed most of the videos as well as various exercises for the Andrew Ng led Machine Learning course.
To further my Django application development skills I finished the book Instant Django 1.5 Application Development and all of its exercises. This was a great applied way of building web frameworks quickly. Solely reading about web development from a book is not effective for me with regards to retaining material and this books excels in being a feet first applied approach.