UNPKG

epubjs

Version:

Render ePub documents in the browser, across many devices

287 lines (211 loc) 14.3 kB
<!DOCTYPE html> <html xmlns:epub="http://www.idpf.org/2007/ops" > <head> <title>Next-Gen Coders: Lists</title> <link rel="stylesheet" type="text/css" href="epub.css"/> <script src="thebe_js/jquery-2.1.4.min.js" type="text/javascript"></script> <script src="thebe_js/main-built.js" type="text/javascript"></script> <script type="text/javascript"> jQuery.noConflict(); (function( $ ) { $(function(){ var thebe = new Thebe({ url: "http://104.130.0.7/" }); }); })(jQuery); </script> </head> <body data-type="book"> <section data-type="chapter"> <p>A <em>Pivot Table</em> is a related operation which is commonly seen in spreadsheets and other programs which operate on tabular data. The Pivot Table takes simple column-wise data as input, and groups the entries into a two-dimensional table which provides a multi-dimensional summarization of the data. The difference between Pivot Tables and GroupBy can sometimes cause confusion; it helps me to think of pivot tables as essentially a <strong>multi-dimensional</strong> version of GroupBy aggregation. That is, you split-apply-combine, but both the split and the combine happen across not a one-dimensional index, but across a two-dimensional grid.</p> <section data-type="sect1"> <h2>Motivating Pivot Tables</h2> <p>For the examples in this section, we’ll use the database of passengers on the Titanic, available through the <code>seaborn</code> library:</p> <pre data-code-language="python" data-executable="true" data-type="programlisting"> import numpy as np import pandas as pd import seaborn as sns titanic = sns.load_dataset('titanic') </pre> <pre data-code-language="python" data-executable="true" data-type="programlisting"> titanic.head() </pre> <p>This contains a wealth of information on each passenger of that ill-fated voyage, including their gender, age, class, fare paid, and much more.</p> </section> <section data-type="sect1"> <h2>Pivot Tables By Hand</h2> <p>To start learning more about this data, we might want to like to group it by gender, survival, or some combination thereof. If you have read the previous section, you might be tempted to apply a GroupBy operation to this data. For example, let’s look at survival rate by gender:</p> <pre data-code-language="python" data-executable="true" data-type="programlisting"> titanic.groupby('sex')[['survived']].mean() </pre> <p>This immediately gives us some insight: overall three of every four females on board survived, while only one in five males survived!</p> <p>This is an interesting insight, but we might like to go one step deeper and look at survival by both sex and, say, class. Using the vocabulary of GroupBy, we might proceed something like this: We <em>group by</em> class and gender, <em>select</em> survival, <em>apply</em> a mean aggregate, <em>combine</em> the resulting groups, and then <em>unstack</em> the hierarchical index to reveal the hidden multidimensionality. In code:</p> <pre data-code-language="python" data-executable="true" data-type="programlisting"> titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack() </pre> <p>This gives us a better idea of how both gender and class affected survival, but the code is starting to look a bit garbled. While each step of this pipeline makes sense in light of the tools we’ve previously discussed, the long string of code is not particularly easy to read or use. This type of operation is common enough that Pandas includes a convenience routine, <code>pivot_table</code>, which succinctly handles this type of multi-dimensional aggregation.</p> </section> <section data-type="sect1"> <h2>Pivot Table Syntax</h2> <p>Here is the equivalent to the above operation using the <code>pivot_table</code> method of dataframes:</p> <pre data-code-language="python" data-executable="true" data-type="programlisting"> titanic.pivot_table('survived', index='sex', columns='class') </pre> <p>This is eminently more readable than the equivalent GroupBy operation, and produces the same result. As you might expect of an early 20th century transatlantic cruise, the survival gradient favors both women and higher classes. First-class women survived with near certainty (hi Kate!), while only one in ten third-class men survived (sorry Leo!).</p> <section data-type="sect2"> <h2>Multi-level Pivot Tables</h2> <p>Just as in the GroupBy, the grouping in pivot tables can be specified with multiple levels, and via a number of options. For example, we might be interested in looking at age as a third dimension. We’ll bin the age using the <code>pd.cut</code> function:</p> <pre data-code-language="python" data-executable="true" data-type="programlisting"> age = pd.cut(titanic['age'], [0, 18, 80]) titanic.pivot_table('survived', ['sex', age], 'class') </pre> <p>we can do the same game with the columns; let’s add info on the fare paid using <code>pd.qcut</code> to automatically compute quantiles:</p> <pre data-code-language="python" data-executable="true" data-type="programlisting"> fare = pd.qcut(titanic['fare'], 2) titanic.pivot_table('survived', ['sex', age], [fare, 'class']) </pre> <p>The result is a four-dimensional aggregation, shown in a grid which demonstrates the relationship between the values.</p> </section> <section data-type="sect2"> <h2>Additional Pivot Table Options</h2> <p>The full call signature of the <code>pivot_table</code> method of DataFrames is as follows:</p> <pre data-code-language="python" data-type="programlisting">DataFrame.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True) </pre> <p>Above we’ve seen examples of the first three arguments; here we’ll take a quick look at the remaining arguments. Two of the options, <code>fill_value</code> and <code>dropna</code>, have to do with missing data and are fairly straightforward; we will not show examples of them here.</p> <p>The <code>aggfunc</code> keyword controls what type of aggregation is applied, which is a mean by default. As in the GroupBy, the aggregation specification can be a string representing one of several common choices (e.g. <code>'sum'</code>, <code>'mean'</code>, <code>'count'</code>, <code>'min'</code>, <code>'max'</code>, etc.) or a function which implements an aggregation (e.g. <code>np.sum()</code>, <code>min()</code>, <code>sum()</code>, etc.). Additionally, it can be specified as dictionary mapping a column to any of the above desired options:</p> <pre data-code-language="python" data-executable="true" data-type="programlisting"> titanic.pivot_table(index='sex', columns='class', aggfunc={'survived':sum, 'fare':'mean'}) </pre> <p>Notice also here that we’ve omitted the <code>values</code> keyword; when specifying a mapping for <code>aggfunc</code>, this is determined automatically.</p> <p>At times it’s useful to compute totals along each grouping. This can be done via the <code>margins</code> keyword:</p> <pre data-code-language="python" data-executable="true" data-type="programlisting"> titanic.pivot_table('survived', index='sex', columns='class', margins=True) </pre> <p>Here this automatically gives us information about the class-agnostic survival rate by gender, the gender-agnostic survival rate by class, and the overall survival rate of 38%.</p> </section> </section> <section data-type="sect1"> <h2>Example: Birthrate Data</h2> <p>As a more interesting example, let’s take a look at the freely-available data on births in the USA, provided by the Centers for Disease Control (CDC). This data can be found at https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv This dataset has been analyzed rather extensively by Andrew Gelman and his group; see for example <a href="http://andrewgelman.com/2012/06/14/cool-ass-signal-processing-using-gaussian-processes/">this blog post</a>.</p> <pre data-code-language="python" data-executable="true" data-type="programlisting"> # shell command to download the data: !curl -O https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv births = pd.read_csv('births.csv') </pre> <p>Taking a look at the data, we see that it’s relatively simple: it contains the number of births grouped by date and gender:</p> <pre data-code-language="python" data-executable="true" data-type="programlisting"> births.head() </pre> <p>We can start to understand this data a bit more by using a pivot table. Let’s add a decade column, and take a look at male and female births as a function of decade:</p> <pre data-code-language="python" data-executable="true" data-type="programlisting"> births['decade'] = 10 * (births['year'] // 10) births.pivot_table('births', index='decade', columns='gender', aggfunc='sum') </pre> <p>We immediately see that male births outnumber female births in every decade. To see this trend a bit more clearly, we can use Pandas’ built-in plotting tools to visualize the total number of births by year (see Chapter X.X for a discussion of plotting with matplotlib):</p> <pre data-code-language="python" data-executable="true" data-type="programlisting"> %matplotlib inline import matplotlib.pyplot as plt sns.set() # use seaborn styles births.pivot_table('births', index='year', columns='gender', aggfunc='sum').plot() plt.ylabel('total births per year'); </pre> <p>With a simple pivot table and <code>plot()</code> method, we can immediately see the annual trend in births by gender. By eye, we find that over the past 50 years male births have outnumbered female births by around 5%.</p> <section data-type="sect2"> <h2>Further Data Exploration</h2> <p>Though this doesn’t necessarily relate to the pivot table, there are a few more interesting features we can pull out of this dataset using the Pandas tools covered up to this point. We must start by cleaning the data a bit, removing outliers caused by mistyped dates (e.g. June 31st) or missing values (e.g. June 99th). One easy way to remove these all at once is to cut outliers; we’ll do this via a robust sigma-clipping operation:</p> <pre data-code-language="python" data-executable="true" data-type="programlisting"> # Some data is mis-reported; e.g. June 31st, etc. # remove these outliers via robust sigma-clipping quartiles = np.percentile(births['births'], [25, 50, 75]) mu = quartiles[1] sig = 0.7413 * (quartiles[2] - quartiles[0]) births = births.query('(births &gt; @mu - 5 * @sig) &amp; (births &lt; @mu + 5 * @sig)') </pre> <p>Next we set the <code>day</code> column to integers; previously it had been a string because some columns in the dataset contained the value <code>'null'</code>:</p> <pre data-code-language="python" data-executable="true" data-type="programlisting"> # set 'day' column to integer; it originally was a string due to nulls births['day'] = births['day'].astype(int) </pre> <p>Finally, we can combine the day, month, and year to create a Date index (see section X.X). This allows us to quickly compute the weekday corresponding to each row:</p> <pre data-code-language="python" data-executable="true" data-type="programlisting"> # create a datetime index from the year, month, day births.index = pd.to_datetime(10000 * births.year + 100 * births.month + births.day, format='%Y%m%d') births['dayofweek'] = births.index.dayofweek </pre> <p>Using this we can plot births by weekday for several decades:</p> <pre data-code-language="python" data-executable="true" data-type="programlisting"> import matplotlib.pyplot as plt import matplotlib as mpl births.pivot_table('births', index='dayofweek', columns='decade', aggfunc='mean').plot() plt.gca().set_xticklabels(['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun']) plt.ylabel('mean births by day'); </pre> <p>Apparently births are slightly less common on weekends than on weekdays! Note that the 1990s and 2000s are missing because the CDC stopped reports only the month of birth starting in 1989.</p> <p>Another intersting view is to plot the mean number of births by the day of the <em>year</em>. We can do this by constructing a datetime array for a particular year, making sure to choose a leap year so as to account for February 29th.</p> <pre data-code-language="python" data-executable="true" data-type="programlisting"> # Choose a leap year to display births by date dates = [pd.datetime(2012, month, day) for (month, day) in zip(births['month'], births['day'])] </pre> <p>We can now group by the data by day of year and plot the results. We’ll additionally annotate the plot with the location of several US holidays:</p> <pre data-code-language="python" data-executable="true" data-type="programlisting"> # Plot the results fig, ax = plt.subplots(figsize=(8, 6)) births.pivot_table('births', dates).plot(ax=ax) # Label the plot ax.text('2012-1-1', 3950, "New Year's Day") ax.text('2012-7-4', 4250, "Independence Day", ha='center') ax.text('2012-9-4', 4850, "Labor Day", ha='center') ax.text('2012-10-31', 4600, "Halloween", ha='right') ax.text('2012-11-25', 4450, "Thanksgiving", ha='center') ax.text('2012-12-25', 3800, "Christmas", ha='right') ax.set(title='USA births by day of year (1969-1988)', ylabel='average daily births', xlim=('2011-12-20','2013-1-10'), ylim=(3700, 5400)); # Format the x axis with centered month labels ax.xaxis.set_major_locator(mpl.dates.MonthLocator()) ax.xaxis.set_minor_locator(mpl.dates.MonthLocator(bymonthday=15)) ax.xaxis.set_major_formatter(plt.NullFormatter()) ax.xaxis.set_minor_formatter(mpl.dates.DateFormatter('%h')); </pre> <p>The lower birthrate on holidays is striking, but is likely the result of selection for scheduled/induced births rather than any deep psychosomatic causes. For more discussion on this trend, see the discussion and links in <a href="http://andrewgelman.com/2012/06/14/cool-ass-signal-processing-using-gaussian-processes/">Andrew Gelman’s blog posts</a> on the subject.</p> <p>This short example should give you a good idea of how many of the Pandas tools we’ve seen to this point can be put together and used to gain insight from a variety of datasets. We will see some more sophisticated analysis of this data, and other datasets like it, in future sections!</p> </section> </section> </section> </body></html>