I have been curious if their exists data analysis packages within python that mimic the functions one encounters and uses everyday in excel such as pivot tables, vlookups, index, graphing, etc… The reason I am interested in the possibility of automating these functions is that with larger data sets with tens of thousands of fields these functions begin to slow down. I was also curious if tasks that are performed everyday in excel can be done faster in python. To further my data automation skills I am reading the book “Python for Data Analysis” by Wes McKinney.
In my previous post I mentioned that I am creating a solar energy bill calculator using Green Button data. Green Button data consists of 15 minute or hourly energy use data of a residence. In order to upload this data into a form that one can manipulate I used both the Pandas and Numpy extension (No animals were harmed in this data analysis). Numpy stands for Numerical Python and Pandas is the mnemonic for Python Data Analysis Library. Once the user uploads their energy use data I transform their Green Button XML data into a python data frame using Pandas. I then re-sample or rather up-sample the data into monthly kWh data so that I can then calculate their monthly electricity bill.
What excites me about Numpy and Pandas is that I can take the fundamentals of data analysis that I have learned so far using excel and automate everything. Often there is a trade off between investing time to automate a feature and just implementing the feature in excel. I think this is the real skill set that I am developing knowing when it is more efficient to automate in Python or simply implement in excel. I think a good rule of thumb is that as the data set becomes larger (hundreds of thousands of rows) then Python becomes a more powerful tool to manipulate the data set.