epubjs
Version:
Render ePub documents in the browser, across many devices
328 lines (203 loc) • 32.3 kB
HTML
<html xmlns:epub="http://www.idpf.org/2007/ops" xmlns="http://www.w3.org/1999/xhtml"><head><title>Visualizing Data</title><link rel="stylesheet" type="text/css" href="epub.css"/></head><body data-type="book"><section data-type="chapter" epub:type="chapter" data-pdf-bookmark="Chapter 2. Visualizing Data"><div class="chapter" id="visualizing_data">
<h1><span class="label">Chapter 2. </span>Visualizing Data</h1>
<blockquote data-type="epigraph" epub:type="epigraph">
<p>I believe that visualization is one of the most powerful means of achieving personal goals.</p>
<p data-type="attribution">Harvey Mackay</p>
</blockquote>
<p>A fundamental part of the data scientist’s toolkit is data visualization. Although it is very easy to create visualizations, it’s much harder to produce <em>good</em> ones.<a data-type="indexterm" data-primary="visualizing data" data-see="data visualization" id="idp8985440"/><a data-type="indexterm" data-primary="data visualization" id="ix_dataviz"/></p>
<p>There are two primary uses for data visualization:</p>
<ul>
<li>
<p>To <em>explore</em> data</p>
</li>
<li>
<p>To <em>communicate</em> data</p>
</li>
</ul>
<p>In this chapter, we will concentrate on building the skills that you’ll need to start exploring your own data and to produce the visualizations we’ll be using throughout the rest of the book. Like most of our chapter topics, data visualization is a rich field of study that deserves its own book. Nonetheless, we’ll try to give you a sense of what makes for a good visualization and what doesn’t.</p>
<section data-type="sect1" data-pdf-bookmark="matplotlib"><div class="sect1" id="idp8991872">
<h1>matplotlib</h1>
<p>A wide variety of tools exists for visualizing data.<a data-type="indexterm" data-primary="matplotlib" id="idp8993440"/><a data-type="indexterm" data-primary="data visualization" data-secondary="matplotlib" id="idp8994144"/> We will be using the <a href="http://matplotlib.org/"><code>matplotlib</code> library</a>, which is widely used (although sort of showing its age). If you are interested in producing elaborate interactive visualizations for the Web, it is likely not the right choice, but for simple bar charts, line charts, and scatterplots, it works pretty well.</p>
<p>In particular, we will be using the <code>matplotlib.pyplot</code> module. In its simplest use, <code>pyplot</code> maintains an internal state in which you build up a visualization step by step. Once you’re done, you can save it (with <code>savefig()</code>) or display it (with <code>show()</code>).<a data-type="indexterm" data-primary="line charts" data-secondary="creating with matplotlib" id="idp8998720"/></p>
<p>For example, making simple plots (like <a data-type="xref" href="#simple_line_chart">Figure 2-1</a>) is pretty simple:</p>
<pre data-type="programlisting" data-code-language="py" class="data-executable-true"><code class="o">%</code><code class="n">matplotlib</code> <code class="n">inline</code>
<code class="kn">from</code> <code class="nn">matplotlib</code> <code class="kn">import</code> <code class="n">pyplot</code> <code class="k">as</code> <code class="n">plt</code>
<code class="n">years</code> <code class="o">=</code> <code class="p">[</code><code class="mi">1950</code><code class="p">,</code> <code class="mi">1960</code><code class="p">,</code> <code class="mi">1970</code><code class="p">,</code> <code class="mi">1980</code><code class="p">,</code> <code class="mi">1990</code><code class="p">,</code> <code class="mi">2000</code><code class="p">,</code> <code class="mi">2010</code><code class="p">]</code>
<code class="n">gdp</code> <code class="o">=</code> <code class="p">[</code><code class="mf">300.2</code><code class="p">,</code> <code class="mf">543.3</code><code class="p">,</code> <code class="mf">1075.9</code><code class="p">,</code> <code class="mf">2862.5</code><code class="p">,</code> <code class="mf">5979.6</code><code class="p">,</code> <code class="mf">10289.7</code><code class="p">,</code> <code class="mf">14958.3</code><code class="p">]</code>
<code class="c"># create a line chart, years on x-axis, gdp on y-axis</code>
<code class="n">plt</code><code class="o">.</code><code class="n">plot</code><code class="p">(</code><code class="n">years</code><code class="p">,</code> <code class="n">gdp</code><code class="p">,</code> <code class="n">color</code><code class="o">=</code><code class="s">'green'</code><code class="p">,</code> <code class="n">marker</code><code class="o">=</code><code class="s">'o'</code><code class="p">,</code> <code class="n">linestyle</code><code class="o">=</code><code class="s">'solid'</code><code class="p">)</code>
<code class="c"># add a title</code>
<code class="n">plt</code><code class="o">.</code><code class="n">title</code><code class="p">(</code><code class="s">"Nominal GDP"</code><code class="p">)</code>
<code class="c"># add a label to the y-axis</code>
<code class="n">plt</code><code class="o">.</code><code class="n">ylabel</code><code class="p">(</code><code class="s">"Billions of $"</code><code class="p">)</code>
<code class="n">plt</code><code class="o">.</code><code class="n">show</code><code class="p">()</code></pre>
<figure><div id="simple_line_chart" class="figure">
<img src="assets/dsfs_0301.png" alt="A simple line chart."/>
<h6><span class="label">Figure 2-1. </span>A simple line chart</h6>
</div></figure>
<p>Making plots that look publication-quality good is more complicated and beyond the scope of this chapter. There are many ways you can customize your charts with (for example) axis labels, line styles, and point markers. Rather than attempt a comprehensive treatment of these options, we’ll just use (and call attention to) some of them in our examples.</p>
<div data-type="note" epub:type="note">
<p>Although we won’t be using much of this functionality,
<code>matplotlib</code> is capable of producing complicated plots within plots,
sophisticated formatting, and interactive visualizations.
Check out its documentation if you want to go deeper than we do in this book.</p>
</div>
</div></section>
<section data-type="sect1" data-pdf-bookmark="Bar Charts"><div class="sect1" id="idp9222080">
<h1>Bar Charts</h1>
<p>A bar chart is<a data-type="indexterm" data-primary="bar charts" id="ix_barchart"/> a good choice when you want to show how some quantity varies among some <em>discrete</em> set of items.<a data-type="indexterm" data-primary="data visualization" data-secondary="bar charts" id="ix_datavisbarchart"/> For instance, <a data-type="xref" href="#simple_bar_chart">Figure 2-2</a> shows how many Academy Awards were won by each of a variety of movies:</p>
<pre data-type="programlisting" data-code-language="py" class="data-executable-true"><code class="n">movies</code> <code class="o">=</code> <code class="p">[</code><code class="s">"Annie Hall"</code><code class="p">,</code> <code class="s">"Ben-Hur"</code><code class="p">,</code> <code class="s">"Casablanca"</code><code class="p">,</code> <code class="s">"Gandhi"</code><code class="p">,</code> <code class="s">"West Side Story"</code><code class="p">]</code>
<code class="n">num_oscars</code> <code class="o">=</code> <code class="p">[</code><code class="mi">5</code><code class="p">,</code> <code class="mi">11</code><code class="p">,</code> <code class="mi">3</code><code class="p">,</code> <code class="mi">8</code><code class="p">,</code> <code class="mi">10</code><code class="p">]</code>
<code class="c"># bars are by default width 0.8, so we'll add 0.1 to the left coordinates</code>
<code class="c"># so that each bar is centered</code>
<code class="n">xs</code> <code class="o">=</code> <code class="p">[</code><code class="n">i</code> <code class="o">+</code> <code class="mf">0.1</code> <code class="k">for</code> <code class="n">i</code><code class="p">,</code> <code class="n">_</code> <code class="ow">in</code> <code class="nb">enumerate</code><code class="p">(</code><code class="n">movies</code><code class="p">)]</code>
<code class="c"># plot bars with left x-coordinates [xs], heights [num_oscars]</code>
<code class="n">plt</code><code class="o">.</code><code class="n">bar</code><code class="p">(</code><code class="n">xs</code><code class="p">,</code> <code class="n">num_oscars</code><code class="p">)</code>
<code class="n">plt</code><code class="o">.</code><code class="n">ylabel</code><code class="p">(</code><code class="s">"# of Academy Awards"</code><code class="p">)</code>
<code class="n">plt</code><code class="o">.</code><code class="n">title</code><code class="p">(</code><code class="s">"My Favorite Movies"</code><code class="p">)</code>
<code class="c"># label x-axis with movie names at bar centers</code>
<code class="n">plt</code><code class="o">.</code><code class="n">xticks</code><code class="p">([</code><code class="n">i</code> <code class="o">+</code> <code class="mf">0.5</code> <code class="k">for</code> <code class="n">i</code><code class="p">,</code> <code class="n">_</code> <code class="ow">in</code> <code class="nb">enumerate</code><code class="p">(</code><code class="n">movies</code><code class="p">)],</code> <code class="n">movies</code><code class="p">)</code>
<code class="n">plt</code><code class="o">.</code><code class="n">show</code><code class="p">()</code></pre>
<figure><div id="simple_bar_chart" class="figure">
<img src="assets/dsfs_0302.png" alt="A simple bar chart."/>
<h6><span class="label">Figure 2-2. </span>A simple bar chart</h6>
</div></figure>
<p>A bar chart can also be a good choice for plotting histograms<a data-type="indexterm" data-primary="histograms" data-secondary="plotting using bar charts" id="idp9320144"/> of bucketed numeric values, in order to visually explore how the values are <em>distributed</em>, as in <a data-type="xref" href="#bar_chart_histogram">Figure 2-3</a>:</p>
<pre data-type="programlisting" data-code-language="py" class="data-executable-true"><code class="n">grades</code> <code class="o">=</code> <code class="p">[</code><code class="mi">83</code><code class="p">,</code><code class="mi">95</code><code class="p">,</code><code class="mi">91</code><code class="p">,</code><code class="mi">87</code><code class="p">,</code><code class="mi">70</code><code class="p">,</code><code class="mi">0</code><code class="p">,</code><code class="mi">85</code><code class="p">,</code><code class="mi">82</code><code class="p">,</code><code class="mi">100</code><code class="p">,</code><code class="mi">67</code><code class="p">,</code><code class="mi">73</code><code class="p">,</code><code class="mi">77</code><code class="p">,</code><code class="mi">0</code><code class="p">]</code>
<code class="n">decile</code> <code class="o">=</code> <code class="k">lambda</code> <code class="n">grade</code><code class="p">:</code> <code class="n">grade</code> <code class="o">//</code> <code class="mi">10</code> <code class="o">*</code> <code class="mi">10</code>
<code class="n">histogram</code> <code class="o">=</code> <code class="n">Counter</code><code class="p">(</code><code class="n">decile</code><code class="p">(</code><code class="n">grade</code><code class="p">)</code> <code class="k">for</code> <code class="n">grade</code> <code class="ow">in</code> <code class="n">grades</code><code class="p">)</code>
<code class="n">plt</code><code class="o">.</code><code class="n">bar</code><code class="p">([</code><code class="n">x</code> <code class="o">-</code> <code class="mi">4</code> <code class="k">for</code> <code class="n">x</code> <code class="ow">in</code> <code class="n">histogram</code><code class="o">.</code><code class="n">keys</code><code class="p">()],</code> <code class="c"># shift each bar to the left by 4</code>
<code class="n">histogram</code><code class="o">.</code><code class="n">values</code><code class="p">(),</code> <code class="c"># give each bar its correct height</code>
<code class="mi">8</code><code class="p">)</code> <code class="c"># give each bar a width of 8</code>
<code class="n">plt</code><code class="o">.</code><code class="n">axis</code><code class="p">([</code><code class="o">-</code><code class="mi">5</code><code class="p">,</code> <code class="mi">105</code><code class="p">,</code> <code class="mi">0</code><code class="p">,</code> <code class="mi">5</code><code class="p">])</code> <code class="c"># x-axis from -5 to 105,</code>
<code class="c"># y-axis from 0 to 5</code>
<code class="n">plt</code><code class="o">.</code><code class="n">xticks</code><code class="p">([</code><code class="mi">10</code> <code class="o">*</code> <code class="n">i</code> <code class="k">for</code> <code class="n">i</code> <code class="ow">in</code> <code class="nb">range</code><code class="p">(</code><code class="mi">11</code><code class="p">)])</code> <code class="c"># x-axis labels at 0, 10, ..., 100</code>
<code class="n">plt</code><code class="o">.</code><code class="n">xlabel</code><code class="p">(</code><code class="s">"Decile"</code><code class="p">)</code>
<code class="n">plt</code><code class="o">.</code><code class="n">ylabel</code><code class="p">(</code><code class="s">"# of Students"</code><code class="p">)</code>
<code class="n">plt</code><code class="o">.</code><code class="n">title</code><code class="p">(</code><code class="s">"Distribution of Exam 1 Grades"</code><code class="p">)</code>
<code class="n">plt</code><code class="o">.</code><code class="n">show</code><code class="p">()</code></pre>
<figure><div id="bar_chart_histogram" class="figure">
<img src="assets/dsfs_0303.png" alt="A bar chart histogram."/>
<h6><span class="label">Figure 2-3. </span>Using a bar chart for a histogram</h6>
</div></figure>
<p>The third argument to <code>plt.bar</code> specifies the bar width. Here we chose a width of 8 (which leaves a small gap between bars, since our buckets have width 10). And we shifted the bar left by 4, so that (for example) the “80” bar has its left and right sides at 76 and 84, and (hence) its center at 80.</p>
<p>The call to <code>plt.axis</code> indicates that we want the x-axis to range from -5 to 105 (so that the “0” and “100” bars are fully shown), and that the y-axis should range from 0 to 5. And the call to <code>plt.xticks</code> puts x-axis labels at 0, 10, 20, …, 100.</p>
<p>Be judicious when using <code>plt.axis()</code>. When creating bar charts it is considered especially bad form for your y-axis not to start at 0, since this is an easy way to mislead people (<a data-type="xref" href="#misleading_y_axis">Figure 2-4</a>):</p>
<pre data-type="programlisting" data-code-language="py"><code class="n">mentions</code> <code class="o">=</code> <code class="p">[</code><code class="mi">500</code><code class="p">,</code> <code class="mi">505</code><code class="p">]</code>
<code class="n">years</code> <code class="o">=</code> <code class="p">[</code><code class="mi">2013</code><code class="p">,</code> <code class="mi">2014</code><code class="p">]</code>
<code class="n">plt</code><code class="o">.</code><code class="n">bar</code><code class="p">([</code><code class="mf">2012.6</code><code class="p">,</code> <code class="mf">2013.6</code><code class="p">],</code> <code class="n">mentions</code><code class="p">,</code> <code class="mf">0.8</code><code class="p">)</code>
<code class="n">plt</code><code class="o">.</code><code class="n">xticks</code><code class="p">(</code><code class="n">years</code><code class="p">)</code>
<code class="n">plt</code><code class="o">.</code><code class="n">ylabel</code><code class="p">(</code><code class="s">"# of times I heard someone say 'data science'"</code><code class="p">)</code>
<code class="c"># if you don't do this, matplotlib will label the x-axis 0, 1</code>
<code class="c"># and then add a +2.013e3 off in the corner (bad matplotlib!)</code>
<code class="n">plt</code><code class="o">.</code><code class="n">ticklabel_format</code><code class="p">(</code><code class="n">useOffset</code><code class="o">=</code><code class="bp">False</code><code class="p">)</code>
<code class="c"># misleading y-axis only shows the part above 500</code>
<code class="n">plt</code><code class="o">.</code><code class="n">axis</code><code class="p">([</code><code class="mf">2012.5</code><code class="p">,</code><code class="mf">2014.5</code><code class="p">,</code><code class="mi">499</code><code class="p">,</code><code class="mi">506</code><code class="p">])</code>
<code class="n">plt</code><code class="o">.</code><code class="n">title</code><code class="p">(</code><code class="s">"Look at the 'Huge' Increase!"</code><code class="p">)</code>
<code class="n">plt</code><code class="o">.</code><code class="n">show</code><code class="p">()</code></pre>
<figure><div id="misleading_y_axis" class="figure">
<img src="assets/dsfs_0304.png" alt="Misleading y-axis."/>
<h6><span class="label">Figure 2-4. </span>A chart with a misleading y-axis</h6>
</div></figure>
<p>In <a data-type="xref" href="#non_misleading_y_axis">Figure 2-5</a>, we use more-sensible axes, and it looks far less impressive:</p>
<pre data-type="programlisting" data-code-language="py"><code class="n">plt</code><code class="o">.</code><code class="n">axis</code><code class="p">([</code><code class="mf">2012.5</code><code class="p">,</code><code class="mf">2014.5</code><code class="p">,</code><code class="mi">0</code><code class="p">,</code><code class="mi">550</code><code class="p">])</code>
<code class="n">plt</code><code class="o">.</code><code class="n">title</code><code class="p">(</code><code class="s">"Not So Huge Anymore"</code><code class="p">)</code>
<code class="n">plt</code><code class="o">.</code><code class="n">show</code><code class="p">()</code></pre>
<figure><div id="non_misleading_y_axis" class="figure">
<img src="assets/dsfs_0305.png" alt="Non-misleading y-axis."/>
<h6><span class="label">Figure 2-5. </span>The same chart with a nonmisleading y-axis</h6>
</div></figure>
</div></section>
<section data-type="sect1" data-pdf-bookmark="Line Charts"><div class="sect1" id="idp9222704">
<h1>Line Charts</h1>
<p><a data-type="indexterm" data-primary="data visualization" data-startref="ix_datavisbarchart" id="idp9581888"/><a data-type="indexterm" data-primary="bar charts" data-startref="ix_barchart" id="idp9583088"/>As we saw already, we can make line charts using <code>plt.plot()</code>. These are a good choice for showing <em>trends</em>, as <a data-type="indexterm" data-primary="trends, showing with line charts" id="idp9601216"/><a data-type="indexterm" data-primary="data visualization" data-secondary="line charts" id="idp9601920"/><a data-type="indexterm" data-primary="line charts" data-secondary="showing trends" id="idp9602864"/>illustrated in <a data-type="xref" href="#several_line_charts">Figure 2-6</a>:</p>
<pre data-type="programlisting" data-code-language="py"><code class="n">variance</code> <code class="o">=</code> <code class="p">[</code><code class="mi">1</code><code class="p">,</code> <code class="mi">2</code><code class="p">,</code> <code class="mi">4</code><code class="p">,</code> <code class="mi">8</code><code class="p">,</code> <code class="mi">16</code><code class="p">,</code> <code class="mi">32</code><code class="p">,</code> <code class="mi">64</code><code class="p">,</code> <code class="mi">128</code><code class="p">,</code> <code class="mi">256</code><code class="p">]</code>
<code class="n">bias_squared</code> <code class="o">=</code> <code class="p">[</code><code class="mi">256</code><code class="p">,</code> <code class="mi">128</code><code class="p">,</code> <code class="mi">64</code><code class="p">,</code> <code class="mi">32</code><code class="p">,</code> <code class="mi">16</code><code class="p">,</code> <code class="mi">8</code><code class="p">,</code> <code class="mi">4</code><code class="p">,</code> <code class="mi">2</code><code class="p">,</code> <code class="mi">1</code><code class="p">]</code>
<code class="n">total_error</code> <code class="o">=</code> <code class="p">[</code><code class="n">x</code> <code class="o">+</code> <code class="n">y</code> <code class="k">for</code> <code class="n">x</code><code class="p">,</code> <code class="n">y</code> <code class="ow">in</code> <code class="nb">zip</code><code class="p">(</code><code class="n">variance</code><code class="p">,</code> <code class="n">bias_squared</code><code class="p">)]</code>
<code class="n">xs</code> <code class="o">=</code> <code class="p">[</code><code class="n">i</code> <code class="k">for</code> <code class="n">i</code><code class="p">,</code> <code class="n">_</code> <code class="ow">in</code> <code class="nb">enumerate</code><code class="p">(</code><code class="n">variance</code><code class="p">)]</code>
<code class="c"># we can make multiple calls to plt.plot</code>
<code class="c"># to show multiple series on the same chart</code>
<code class="n">plt</code><code class="o">.</code><code class="n">plot</code><code class="p">(</code><code class="n">xs</code><code class="p">,</code> <code class="n">variance</code><code class="p">,</code> <code class="s">'g-'</code><code class="p">,</code> <code class="n">label</code><code class="o">=</code><code class="s">'variance'</code><code class="p">)</code> <code class="c"># green solid line</code>
<code class="n">plt</code><code class="o">.</code><code class="n">plot</code><code class="p">(</code><code class="n">xs</code><code class="p">,</code> <code class="n">bias_squared</code><code class="p">,</code> <code class="s">'r-.'</code><code class="p">,</code> <code class="n">label</code><code class="o">=</code><code class="s">'bias^2'</code><code class="p">)</code> <code class="c"># red dot-dashed line</code>
<code class="n">plt</code><code class="o">.</code><code class="n">plot</code><code class="p">(</code><code class="n">xs</code><code class="p">,</code> <code class="n">total_error</code><code class="p">,</code> <code class="s">'b:'</code><code class="p">,</code> <code class="n">label</code><code class="o">=</code><code class="s">'total error'</code><code class="p">)</code> <code class="c"># blue dotted line</code>
<code class="c"># because we've assigned labels to each series</code>
<code class="c"># we can get a legend for free</code>
<code class="c"># loc=9 means "top center"</code>
<code class="n">plt</code><code class="o">.</code><code class="n">legend</code><code class="p">(</code><code class="n">loc</code><code class="o">=</code><code class="mi">9</code><code class="p">)</code>
<code class="n">plt</code><code class="o">.</code><code class="n">xlabel</code><code class="p">(</code><code class="s">"model complexity"</code><code class="p">)</code>
<code class="n">plt</code><code class="o">.</code><code class="n">title</code><code class="p">(</code><code class="s">"The Bias-Variance Tradeoff"</code><code class="p">)</code>
<code class="n">plt</code><code class="o">.</code><code class="n">show</code><code class="p">()</code></pre>
<figure><div id="several_line_charts" class="figure">
<img src="assets/dsfs_0306.png" alt="Several line charts with a legend."/>
<h6><span class="label">Figure 2-6. </span>Several line charts with a legend</h6>
</div></figure>
</div></section>
<section data-type="sect1" data-pdf-bookmark="Scatterplots"><div class="sect1" id="idp9684624">
<h1>Scatterplots</h1>
<p>A scatterplot is the right choice for visualizing the relationship between two paired sets of data.<a data-type="indexterm" data-primary="data visualization" data-secondary="scatterplots" id="ix_datavisscatter"/><a data-type="indexterm" data-primary="scatterplots" id="ix_scatterplt"/> For example, <a data-type="xref" href="#friends_and_minutes">Figure 2-7</a> illustrates the relationship between the number of friends
your users have and the number of minutes they spend on the site every day:</p>
<pre data-type="programlisting" data-code-language="py"><code class="n">friends</code> <code class="o">=</code> <code class="p">[</code> <code class="mi">70</code><code class="p">,</code> <code class="mi">65</code><code class="p">,</code> <code class="mi">72</code><code class="p">,</code> <code class="mi">63</code><code class="p">,</code> <code class="mi">71</code><code class="p">,</code> <code class="mi">64</code><code class="p">,</code> <code class="mi">60</code><code class="p">,</code> <code class="mi">64</code><code class="p">,</code> <code class="mi">67</code><code class="p">]</code>
<code class="n">minutes</code> <code class="o">=</code> <code class="p">[</code><code class="mi">175</code><code class="p">,</code> <code class="mi">170</code><code class="p">,</code> <code class="mi">205</code><code class="p">,</code> <code class="mi">120</code><code class="p">,</code> <code class="mi">220</code><code class="p">,</code> <code class="mi">130</code><code class="p">,</code> <code class="mi">105</code><code class="p">,</code> <code class="mi">145</code><code class="p">,</code> <code class="mi">190</code><code class="p">]</code>
<code class="n">labels</code> <code class="o">=</code> <code class="p">[</code><code class="s">'a'</code><code class="p">,</code> <code class="s">'b'</code><code class="p">,</code> <code class="s">'c'</code><code class="p">,</code> <code class="s">'d'</code><code class="p">,</code> <code class="s">'e'</code><code class="p">,</code> <code class="s">'f'</code><code class="p">,</code> <code class="s">'g'</code><code class="p">,</code> <code class="s">'h'</code><code class="p">,</code> <code class="s">'i'</code><code class="p">]</code>
<code class="n">plt</code><code class="o">.</code><code class="n">scatter</code><code class="p">(</code><code class="n">friends</code><code class="p">,</code> <code class="n">minutes</code><code class="p">)</code>
<code class="c"># label each point</code>
<code class="k">for</code> <code class="n">label</code><code class="p">,</code> <code class="n">friend_count</code><code class="p">,</code> <code class="n">minute_count</code> <code class="ow">in</code> <code class="nb">zip</code><code class="p">(</code><code class="n">labels</code><code class="p">,</code> <code class="n">friends</code><code class="p">,</code> <code class="n">minutes</code><code class="p">):</code>
<code class="n">plt</code><code class="o">.</code><code class="n">annotate</code><code class="p">(</code><code class="n">label</code><code class="p">,</code>
<code class="n">xy</code><code class="o">=</code><code class="p">(</code><code class="n">friend_count</code><code class="p">,</code> <code class="n">minute_count</code><code class="p">),</code> <code class="c"># put the label with its point</code>
<code class="n">xytext</code><code class="o">=</code><code class="p">(</code><code class="mi">5</code><code class="p">,</code> <code class="o">-</code><code class="mi">5</code><code class="p">),</code> <code class="c"># but slightly offset</code>
<code class="n">textcoords</code><code class="o">=</code><code class="s">'offset points'</code><code class="p">)</code>
<code class="n">plt</code><code class="o">.</code><code class="n">title</code><code class="p">(</code><code class="s">"Daily Minutes vs. Number of Friends"</code><code class="p">)</code>
<code class="n">plt</code><code class="o">.</code><code class="n">xlabel</code><code class="p">(</code><code class="s">"# of friends"</code><code class="p">)</code>
<code class="n">plt</code><code class="o">.</code><code class="n">ylabel</code><code class="p">(</code><code class="s">"daily minutes spent on the site"</code><code class="p">)</code>
<code class="n">plt</code><code class="o">.</code><code class="n">show</code><code class="p">()</code></pre>
<figure><div id="friends_and_minutes" class="figure">
<img src="assets/dsfs_0307.png" alt="A scatterplot of friends and time on the site."/>
<h6><span class="label">Figure 2-7. </span>A scatterplot of friends and time on the site</h6>
</div></figure>
<p>If you’re scattering comparable variables,
you might get a misleading picture if
you let <code>matplotlib</code> choose the scale, as in <a data-type="xref" href="#scatterplot_incomparable_axes">Figure 2-8</a>:</p>
<pre data-type="programlisting" data-code-language="py"><code class="n">test_1_grades</code> <code class="o">=</code> <code class="p">[</code> <code class="mi">99</code><code class="p">,</code> <code class="mi">90</code><code class="p">,</code> <code class="mi">85</code><code class="p">,</code> <code class="mi">97</code><code class="p">,</code> <code class="mi">80</code><code class="p">]</code>
<code class="n">test_2_grades</code> <code class="o">=</code> <code class="p">[</code><code class="mi">100</code><code class="p">,</code> <code class="mi">85</code><code class="p">,</code> <code class="mi">60</code><code class="p">,</code> <code class="mi">90</code><code class="p">,</code> <code class="mi">70</code><code class="p">]</code>
<code class="n">plt</code><code class="o">.</code><code class="n">scatter</code><code class="p">(</code><code class="n">test_1_grades</code><code class="p">,</code> <code class="n">test_2_grades</code><code class="p">)</code>
<code class="n">plt</code><code class="o">.</code><code class="n">title</code><code class="p">(</code><code class="s">"Axes Aren't Comparable"</code><code class="p">)</code>
<code class="n">plt</code><code class="o">.</code><code class="n">xlabel</code><code class="p">(</code><code class="s">"test 1 grade"</code><code class="p">)</code>
<code class="n">plt</code><code class="o">.</code><code class="n">ylabel</code><code class="p">(</code><code class="s">"test 2 grade"</code><code class="p">)</code>
<code class="n">plt</code><code class="o">.</code><code class="n">show</code><code class="p">()</code></pre>
<figure><div id="scatterplot_incomparable_axes" class="figure">
<img src="assets/dsfs_0308.png" alt="A scatterplot with uncomparable axes."/>
<h6><span class="label">Figure 2-8. </span>A scatterplot with uncomparable axes</h6>
</div></figure>
<p>If we include a call to <code>plt.axis("equal")</code>, the plot (<a data-type="xref" href="#scatterplot_equal_axes">Figure 2-9</a>) more accurately shows that most of the variation occurs on test 2.</p>
<p>That’s enough to get you started doing visualization.
We’ll learn much more about visualization throughout the book.<a data-type="indexterm" data-primary="scatterplots" data-startref="ix_scatterplt" id="idp10267632"/><a data-type="indexterm" data-primary="visualizing data" data-secondary="scatterplots" data-startref="ix_datavisscatter" id="idp10268608"/></p>
<figure><div id="scatterplot_equal_axes" class="figure">
<img src="assets/dsfs_0309.png" alt="A scatterplot with equal axes."/>
<h6><span class="label">Figure 2-9. </span>The same scatterplot with equal axes</h6>
</div></figure>
</div></section>
<section data-type="sect1" data-pdf-bookmark="For Further Exploration"><div class="sect1" id="idp9860576">
<h1>For Further Exploration</h1>
<ul>
<li>
<p><a href="http://stanford.io/1ycOjdI">seaborn</a> is built on top of <code>matplotlib</code> and allows you to easily produce prettier (and more complex) visualizations.</p>
</li>
<li>
<p><a href="http://d3js.org">D3.js</a> is a JavaScript library
for producing sophisticated interactive visualizations for the web.
Although it is not in Python, it is both trendy and widely used,
and it is well worth your while to be familiar with it.</p>
</li>
<li>
<p><a href="http://bokeh.pydata.org">Bokeh</a> is a newer library
that brings D3-style visualizations into Python.</p>
</li>
<li>
<p><a href="http://bit.ly/1ycOk1u">ggplot</a> is a Python port
of the popular R library <code>ggplot2</code>,
which is widely used for creating “publication quality” charts and graphics. It’s probably most interesting if you’re already an avid <code>ggplot2</code> user, and possibly a little opaque if you’re not.<a data-type="indexterm" data-primary="data visualization" data-startref="ix_dataviz" id="idp9927184"/></p>
</li>
</ul>
</div></section>
</div></section></body></html>