epubjs

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html> <html lang="en" xmlns:epub="http://www.idpf.org/2007/ops" xmlns="http://www.w3.org/1999/xhtml"> <head> <meta charset="utf-8"/> <title>Data Science from Scratch - A Crash Course in Python</title> <script type="text/javascript" src="http://code.jquery.com/jquery-2.1.4.min.js"></script> <script type="text/javascript" src="http://rawgit.com/oreillymedia/thebe/master/static/main-built.js"></script> <script type="text/javascript"> $(function(){ new Thebe({url:"https://oreillyorchard.com:8000/", selector:"pre[data-code-language='py']"}); }); </script> </head> <body data-type="book"> <section data-type="chapter" id="python"> <h1>A Crash Course in Python</h1> <blockquote data-type="epigraph"> <p>People are still crazy about Python after twenty-five years, which I find hard to believe.</p> <p data-type="attribution">Michael Palin</p> </blockquote> <p>All new employees <a data-type="indexterm" id="ix_Python" data-primary="Python"></a>at DataSciencester are required to go through new employee orientation, the most interesting part of which is a crash course in Python.</p> <p>This is not a comprehensive Python tutorial but instead is intended to highlight the parts of the language that will be most important to us (some of which are often not the focus of Python tutorials).</p> <section data-type="sect1" id="the-basics-wQ2He"> <h1>The Basics</h1> <section data-type="sect2" id="getting-python-w31ue"> <h2>Getting Python</h2> <p>You can download Python from <a href="https://www.python.org/">python.org</a>. But if you don’t already have Python, I recommend instead installing the <a href="https://store.continuum.io/cshop/anaconda/">Anaconda</a> distribution, <a data-type="indexterm" data-primary="Anaconda distribution of Python" id="id-vjRCb"></a>which already includes most of the libraries that you need to do data science.</p> <p>As I write this, the latest version of Python is 3.4. At DataSciencester, however, we use old, reliable Python 2.7. Python 3 is not backward-compatible with Python 2, and many important libraries only work well with 2.7. The data science community is still firmly stuck on 2.7, which means we will be, too. Make sure to get that version.</p> <p>If you don’t get Anaconda, make sure to install <a href="https://pypi.python.org/pypi/pip">pip</a>, which is a Python package manager <a data-type="indexterm" data-primary="pip (Python package manager)" id="id-vJ9hy"></a>that allows you to easily install third-party packages (some of which we’ll need). <a data-type="indexterm" data-primary="IPython" id="id-vYRIB"></a> It’s also worth getting <a href="http://ipython.org/">IPython</a>, which is a much nicer Python shell to work with.</p> <p>(If you installed Anaconda then it should have come with pip and IPython.)</p> <p>Just run:</p> <pre data-type="programlisting" data-code-language="dosbatch">pip install ipython</pre> <p>and then search the Internet for solutions to whatever cryptic error messages that causes.</p> </section> <section data-type="sect2" id="the-zen-of-python-x5jin"> <h2>The Zen of Python</h2> <p>Python has a somewhat Zen <a href="http://legacy.python.org/dev/peps/pep-0020/">description of its design principles</a>, which you can also find inside the Python interpreter itself by typing import this.</p> <p>One of the most discussed of these is:</p> <blockquote> <p>There should be one—and preferably only one—obvious way to do it.</p></blockquote> <p>Code written in accordance with this "obvious" way (which may not be obvious at all to a newcomer) is often described as "Pythonic." Although this is not a book about Python, we will occasionally contrast Pythonic and non-Pythonic ways of accomplishing the same things, and we will generally favor Pythonic solutions to our problems.</p> </section> <section data-type="sect2" id="whitespace-formatting-wGpIM"> <h2>Whitespace Formatting</h2> <p>Many languages use curly braces to delimit blocks of code. <a data-type="indexterm" data-primary="Python" data-secondary="whitespace formatting" id="id-wMzTB"></a><a data-type="indexterm" data-primary="whitespace in Python code" id="id-v7Dul"></a> Python uses indentation:</p> <pre data-type="programlisting" data-code-language="py">for i in [1, 2, 3, 4, 5]: print i # first line in "for i" block for j in [1, 2, 3, 4, 5]: print j # first line in "for j" block print i + j # last line in "for j" block print i # last line in "for i" block print "done looping"</pre> <p>This makes Python code very readable, but it also means that you have to be very careful with your formatting. Whitespace is ignored inside parentheses and brackets, which can be helpful for long-winded computations:</p> <pre data-type="programlisting" data-code-language="py">long_winded_computation = (1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + 11 + 12 + 13 + 14 + 15 + 16 + 17 + 18 + 19 + 20)</pre> <p>and for making code easier to read:</p> <pre data-type="programlisting" data-code-language="py">list_of_lists = [[1, 2, 3], [4, 5, 6], [7, 8, 9]] easier_to_read_list_of_lists = [ [1, 2, 3], [4, 5, 6], [7, 8, 9] ]</pre> <p>You can also use a backslash to indicate that a statement continues onto the next line, although we’ll rarely do this:</p> <pre data-type="programlisting" data-code-language="py">two_plus_three = 2 + \ 3</pre> <p>One consequence of whitespace formatting is that it can be hard to copy and paste code into the Python shell. For example, if you tried to paste the code:</p> <pre data-type="programlisting" data-code-language="py">for i in [1, 2, 3, 4, 5]: # notice the blank line print i</pre> <p>into the ordinary Python shell, you would get a:</p> <pre data-type="programlisting">IndentationError: expected an indented block</pre> <p>because the interpreter thinks the blank line signals the end of the for loop’s block.</p> <p>IPython has a magic function %paste, which correctly pastes whatever is on your clipboard, whitespace and all. This alone is a good reason to use IPython.</p> </section> <section data-type="sect2" id="modules-Q02fe"> <h2>Modules</h2> <p>Certain features of Python are not loaded by default. <a data-type="indexterm" data-primary="modules (Python)" id="id-L36fg"></a> These include both features included as part of the language as well as third-party features that you download yourself. In order to use these features, you’ll need to import the modules that contain them.</p> <p>One approach is to simply import the module itself:</p> <pre data-type="programlisting" data-code-language="py">import re my_regex = re.compile("[0-9]+", re.I)</pre> <p>Here re is the module containing functions and constants for working with regular expressions. After this type of import you can only access those functions by prefixing them with re..</p> <p>If you already had a different re in your code you could use an alias:</p> <pre data-type="programlisting" data-code-language="py">import re as regex my_regex = regex.compile("[0-9]+", regex.I)</pre> <p>You might also do this if your module has an unwieldy name or if you’re going to be typing it a lot. For example, when visualizing data with matplotlib, a standard convention is:</p> <pre data-type="programlisting" data-code-language="py">import matplotlib.pyplot as plt</pre> <p>If you need a few specific values from a module, you can import them explicitly and use them without qualification:</p> <pre data-type="programlisting" data-code-language="py">from collections import defaultdict, Counter lookup = defaultdict(int) my_counter = Counter()</pre> <p>If you were a bad person, you could import the entire contents of a module into your namespace, which might inadvertently overwrite variables you’ve already defined:</p> <pre data-type="programlisting" data-code-language="py">match = 10 from re import * # uh oh, re has a match function print match # "<function re.match>"</pre> <p>However, since you are not a bad person, you won’t ever do this.</p> </section> <section data-type="sect2" id="arithmetic-kNMir"> <h2>Arithmetic</h2> <p>Python 2.7 uses integer division by default,<a data-type="indexterm" data-primary="Python" data-secondary="arithmetic" id="id-3g9C3"></a><a data-type="indexterm" data-primary="arithmetic" data-secondary="in Python" id="id-VLXsa"></a> so that 5 / 2 equals 2. Almost always this is not what we want, so we will always start our files with:</p> <pre data-type="programlisting" data-code-language="py">from __future__ import division</pre> <p>after which 5 / 2 equals 2.5. Every code example in this book uses this new-style division. In the handful of cases where we need integer division, we can get it with a double slash: 5 // 2.</p> </section> <section data-type="sect2" id="functions-o9ofd"> <h2>Functions</h2> <p>A function is a rule for taking zero or more inputs and returning a corresponding output.<a data-type="indexterm" data-primary="functions (Python)" id="id-09dSe"></a><a data-type="indexterm" data-primary="Python" data-secondary="functions" id="id-4dPTg"></a> In Python, we typically define functions using def:</p> <pre data-type="programlisting" data-code-language="py">def double(x): """this is where you put an optional docstring that explains what the function does. for example, this function multiplies its input by 2""" return x * 2</pre> <p>Python functions are <em>first-class</em>, which means that we can assign them to variables and pass them into functions just like any other arguments:</p> <pre data-type="programlisting" data-code-language="py">def apply_to_one(f): """calls the function f with 1 as its argument""" return f(1) my_double = double # refers to the previously defined function x = apply_to_one(my_double) # equals 2</pre> <p>It is also easy to create short anonymous functions, or lambdas:</p> <pre data-type="programlisting" data-code-language="py">y = apply_to_one(lambda x: x + 4) # equals 5</pre> <p>You can assign lambdas to variables, although most people will tell you that you should just use def instead:</p> <pre data-type="programlisting" data-code-language="py">another_double = lambda x: 2 * x # don't do this def another_double(x): return 2 * x # do this instead</pre> <p>Function parameters can also be given default arguments, which only need to be specified when you want a value other than the default:</p> <pre data-type="programlisting" data-code-language="py">def my_print(message="my default message"): print message my_print("hello") # prints 'hello' my_print() # prints 'my default message'</pre> <p>It is sometimes useful to specify arguments by name:</p> <pre data-type="programlisting" data-code-language="py">def subtract(a=0, b=0): return a - b subtract(10, 5) # returns 5 subtract(0, 5) # returns -5 subtract(b=5) # same as previous</pre> <p>We will be creating many, many functions.</p> </section> <section data-type="sect2" id="strings-ygBTV"> <h2>Strings</h2> <p>Strings can be delimited by single<a data-type="indexterm" data-primary="Python" data-secondary="strings" id="id-2gQiP"></a><a data-type="indexterm" data-primary="strings (in Python)" id="id-P0mCY"></a> or double quotation marks (but the quotes have to match):</p> <pre data-type="programlisting" data-code-language="py">single_quoted_string = 'data science' double_quoted_string = "data science"</pre> <p>Python uses backslashes to encode special characters. For example:</p> <pre data-type="programlisting" data-code-language="py">tab_string = "\t" # represents the tab character len(tab_string) # is 1</pre> <p>If you want backslashes as backslashes (which you might in Windows directory names or in regular expressions), you can create <em>raw</em> strings using r"":</p> <pre data-type="programlisting" data-code-language="py">not_tab_string = r"\t" # represents the characters '\' and 't' len(not_tab_string) # is 2</pre> <p>You can create multiline strings using triple-[double-]-quotes:</p> <pre data-type="programlisting" data-code-language="py">multi_line_string = """This is the first line. and this is the second line and this is the third line"""</pre> </section> <section data-type="sect2" id="exceptions-bN7t7"> <h2>Exceptions</h2> <p>When something goes wrong, Python raises an <em>exception</em>. <a data-type="indexterm" data-primary="Python" data-secondary="exceptions" id="id-D40IV"></a><a data-type="indexterm" data-primary="exceptions in Python" id="id-Ly9Ug"></a> Unhandled, these will cause your program to crash. You can handle them using try and except:</p> <pre data-type="programlisting" data-code-language="py">try: print 0 / 0 except ZeroDivisionError: print "cannot divide by zero"</pre> <p>Although in many languages exceptions are considered bad, in Python there is no shame in using them to make your code cleaner, and we will occasionally do so.</p> </section> <section data-type="sect2" id="lists-R4kuZ"> <h2>Lists</h2> <p>Probably the most fundamental data structure in Python is the list.<a data-type="indexterm" data-primary="Python" data-secondary="lists" id="id-ldeFd"></a><a data-type="indexterm" data-primary="lists (in Python)" id="id-9LGtn"></a> A list is simply an ordered collection. (It is similar to what in other languages might be called an array, but with some added functionality.)</p> <pre data-type="programlisting" data-code-language="py">integer_list = [1, 2, 3] heterogeneous_list = ["string", 0.1, True] list_of_lists = [ integer_list, heterogeneous_list, [] ] list_length = len(integer_list) # equals 3 list_sum = sum(integer_list) # equals 6</pre> <p>You can get or set the <em>n</em>th element of a list<a data-type="indexterm" data-primary="square brackets ([]), working with lists in Python" id="id-yg6hX"></a> with square brackets:</p> <pre data-type="programlisting" data-code-language="py">x = range(10) # is the list [0, 1, ..., 9] zero = x[0] # equals 0, lists are 0-indexed one = x[1] # equals 1 nine = x[-1] # equals 9, 'Pythonic' for last element eight = x[-2] # equals 8, 'Pythonic' for next-to-last element x[0] = -1 # now x is [-1, 1, 2, 3, ..., 9]</pre> <p>You can also use square brackets to "slice" lists:</p> <pre data-type="programlisting" data-code-language="py">first_three = x[:3] # [-1, 1, 2] three_to_end = x[3:] # [3, 4, ..., 9] one_to_four = x[1:5] # [1, 2, 3, 4] last_three = x[-3:] # [7, 8, 9] without_first_and_last = x[1:-1] # [1, 2, ..., 8] copy_of_x = x[:] # [-1, 1, 2, ..., 9]</pre> <p>Python has an in operator to chec<a data-type="indexterm" data-primary="in operator (Python)" id="id-34VT3"></a>k for list membership:</p> <pre data-type="programlisting" data-code-language="py">1 in [1, 2, 3] # True 0 in [1, 2, 3] # False</pre> <p>This check involves examining the elements of the list one at a time, which means that you probably shouldn’t use it unless you know your list is pretty small (or unless you don’t care how long the check takes).</p> <p>It is easy to concatenate lists together:</p> <pre data-type="programlisting" data-code-language="py">x = [1, 2, 3] x.extend([4, 5, 6]) # x is now [1,2,3,4,5,6]</pre> <p>If you don’t want to modify x you can use list addition:</p> <pre data-type="programlisting" data-code-language="py">x = [1, 2, 3] y = x + [4, 5, 6] # y is [1, 2, 3, 4, 5, 6]; x is unchanged</pre> <p>More frequently we will append to lists one item at a time:</p> <pre data-type="programlisting" data-code-language="py">x = [1, 2, 3] x.append(0) # x is now [1, 2, 3, 0] y = x[-1] # equals 0 z = len(x) # equals 4</pre> <p>It is often convenient to <em>unpack</em> lists if you know how many elements they contain:</p> <pre data-type="programlisting" data-code-language="py">x, y = [1, 2] # now x is 1, y is 2</pre> <p>although you will get a ValueError if you don’t have the same numbers of elements on both sides.</p> <p>It’s common to use an underscore for a value you’re going to throw away:</p> <pre data-type="programlisting" data-code-language="py">_, y = [1, 2] # now y == 2, didn't care about the first element</pre> </section> <section data-type="sect2" id="tuples-jnBf2"> <h2>Tuples</h2> <p>Tuples are lists' immutable cousins.<a data-type="indexterm" data-primary="tuples (Python)" id="id-Kkbud"></a><a data-type="indexterm" data-primary="Python" data-secondary="tuples" id="id-NNWi6"></a> Pretty much anything you can do to a list that doesn’t involve modifying it, you can do to a tuple. You specify a tuple by using parentheses (or nothing) instead of square brackets:</p> <pre data-type="programlisting" data-code-language="py">my_list = [1, 2] my_tuple = (1, 2) other_tuple = 3, 4 my_list[1] = 3 # my_list is now [1, 3] try: my_tuple[1] = 3 except TypeError: print "cannot modify a tuple"</pre> <p>Tuples are a convenient way to return multiple values from functions:</p> <pre data-type="programlisting" data-code-language="py">def sum_and_product(x, y): return (x + y),(x * y) sp = sum_and_product(2, 3) # equals (5, 6) s, p = sum_and_product(5, 10) # s is 15, p is 50</pre> <p>Tuples (and lists) can also<a data-type="indexterm" data-primary="multiple assignment (Python)" id="id-GNjFN"></a><a data-type="indexterm" data-primary="assignment, multiple, in Python" id="id-2ELtP"></a> be used for <em>multiple assignment</em>:</p> <pre data-type="programlisting" data-code-language="py">x, y = 1, 2 # now x is 1, y is 2 x, y = y, x # Pythonic way to swap variables; now x is 2, y is 1</pre> </section> <section data-type="sect2" id="dictionaries-VeViZ"> <h2>Dictionaries</h2> <p>Another fundamental data structure is a dictionary, which<a data-type="indexterm" id="ix_Pythondict" data-primary="Python" data-secondary="dictionaries"></a><a data-type="indexterm" data-primary="dictionaries (Python)" id="id-jnrSb"></a> associates <em>values</em> with <em>keys</em> and allows you to quickly <a data-type="indexterm" data-primary="key/value pairs (in Python dictionaries)" id="id-9WGhn"></a>retrieve the value corresponding to a given key:</p> <pre data-type="programlisting" data-code-language="py">empty_dict = {} # Pythonic empty_dict2 = dict() # less Pythonic grades = { "Joel" : 80, "Tim" : 95 } # dictionary literal</pre> <p>You can look up the value for a key using square brackets:</p> <pre data-type="programlisting" data-code-language="py">joels_grade = grades["Joel"] # equals 80</pre> <p>But you’ll get a KeyError if you ask for a key that’s not in the dictionary:</p> <pre data-type="programlisting" data-code-language="py">try: kates_grade = grades["Kate"] except KeyError: print "no grade for Kate!"</pre> <p>You can check for <a data-type="indexterm" data-primary="in operator (Python)" id="id-QrzUg"></a>the existence of a key using in:</p> <pre data-type="programlisting" data-code-language="py">joel_has_grade = "Joel" in grades # True kate_has_grade = "Kate" in grades # False</pre> <p>Dictionaries have a get method that returns a default value (instead of raising an exception) when you look up a key that’s not in the dictionary:</p> <pre data-type="programlisting" data-code-language="py">joels_grade = grades.get("Joel", 0) # equals 80 kates_grade = grades.get("Kate", 0) # equals 0 no_ones_grade = grades.get("No One") # default default is None</pre> <p>You assign key-value pairs using the same square brackets:</p> <pre data-type="programlisting" data-code-language="py">grades["Tim"] = 99 # replaces the old value grades["Kate"] = 100 # adds a third entry num_students = len(grades) # equals 3</pre> <p>We will frequently use dictionaries as a simple way to represent structured data:</p> <pre data-type="programlisting" data-code-language="py">tweet = { "user" : "joelgrus", "text" : "Data Science is Awesome", "retweet_count" : 100, "hashtags" : ["#data", "#science", "#datascience", "#awesome", "#yolo"] }</pre> <p>Besides looking for specific keys we can look at all of them:</p> <pre data-type="programlisting" data-code-language="py">tweet_keys = tweet.keys() # list of keys tweet_values = tweet.values() # list of values tweet_items = tweet.items() # list of (key, value) tuples "user" in tweet_keys # True, but uses a slow list in "user" in tweet # more Pythonic, uses faster dict in "joelgrus" in tweet_values # True</pre> <p>Dictionary keys must be immutable; in particular, you cannot use lists as keys. If you need a multipart key, you should use a tuple or figure out a way to turn the key into a string.</p> <section data-type="sect3" id="defaultdict-PbRip"> <h3>defaultdict</h3> <p>Imagine that you’re trying to count the words in a document.<a data-type="indexterm" data-primary="dictionaries (Python)" data-secondary="defaultdict" id="id-EbyCV"></a> An obvious approach is to create a dictionary in which the keys are words and the values are counts. As you check each word, you can increment its count if it’s already in the dictionary and add it to the dictionary if it’s not:</p> <pre data-type="programlisting" data-code-language="py">word_counts = {} for word in document: if word in word_counts: word_counts[word] += 1 else: word_counts[word] = 1</pre> <p>You could also use the "forgiveness is better than permission" approach and just handle the exception from trying to look up a missing key:</p> <pre data-type="programlisting" data-code-language="py">word_counts = {} for word in document: try: word_counts[word] += 1 except KeyError: word_counts[word] = 1</pre> <p>A third approach is to use get, which behaves gracefully for missing keys:</p> <pre data-type="programlisting" data-code-language="py">word_counts = {} for word in document: previous_count = word_counts.get(word, 0) word_counts[word] = previous_count + 1</pre> <p>Every one of these is slightly unwieldy, which is why defaultdict is useful. A defaultdict is like a regular dictionary, except that when you try to look up a key it doesn’t contain, it first adds a value for it using a zero-argument function you provided when you created it. In order to use defaultdicts, you have to import them from collections:</p> <pre data-type="programlisting" data-code-language="py">from collections import defaultdict word_counts = defaultdict(int) # int() produces 0 for word in document: word_counts[word] += 1</pre> <p>They can also be useful with list or dict or even your own functions:</p> <pre data-type="programlisting" data-code-language="py">dd_list = defaultdict(list) # list() produces an empty list dd_list[2].append(1) # now dd_list contains {2: [1]} dd_dict = defaultdict(dict) # dict() produces an empty dict dd_dict["Joel"]["City"] = "Seattle" # { "Joel" : { "City" : Seattle"}} dd_pair = defaultdict(lambda: [0, 0]) dd_pair[2][1] = 1 # now dd_pair contains {2: [0,1]}</pre> <p>These will be useful when we’re using dictionaries to "collect" results by some key and don’t want to have to check every time to see if the key exists yet.<a data-type="indexterm" data-primary="Python" data-secondary="dictionaries" data-startref="ix_Pythondict" id="id-dm7CM"></a></p> </section> <section data-type="sect3" id="counter-6l7FJ"> <h3>Counter</h3> <p>A Counter turns a sequence of values into a defaultdict(int)-like object mapping keys to counts.<a data-type="indexterm" data-primary="Counter (Python)" id="id-Dy0SV"></a><a data-type="indexterm" data-primary="Python" data-secondary="Counter" id="id-Lk9Hg"></a> We will primarily use it to create histograms:</p> <pre data-type="programlisting" data-code-language="py">from collections import Counter c = Counter([0, 1, 2, 0]) # c is (basically) { 0 : 2, 1 : 1, 2 : 1 }</pre> <p>This gives us a very simple way to solve our word_counts problem:</p> <pre data-type="programlisting" data-code-language="py">word_counts = Counter(document)</pre> <p>A Counter instance has a most_common method that is frequently useful:</p> <pre data-type="programlisting" data-code-language="py"># print the 10 most common words and their counts for word, count in word_counts.most_common(10): print word, count</pre> </section> </section> <section data-type="sect2" id="sets-yBlSV"> <h2>Sets</h2> <p>Another data structure is set, which<a data-type="indexterm" data-primary="sets (Python)" id="id-2jpfP"></a><a data-type="indexterm" data-primary="Python" data-secondary="sets" id="id-PblUY"></a> represents a collection of <em>distinct</em> elements:</p> <pre data-type="programlisting" data-code-language="py">s = set() s.add(1) # s is now { 1 } s.add(2) # s is now { 1, 2 } s.add(2) # s is still { 1, 2 } x = len(s) # equals 2 y = 2 in s # equals True z = 3 in s # equals False</pre> <p>We’ll use sets for two main reasons.<a data-type="indexterm" data-primary="in operator (Python)" data-secondary="using on sets" id="id-WNdCR"></a> The first is that in is a very fast operation on sets. If we have a large collection of items that we want to use for a membership test, a set is more appropriate than a list:</p> <pre data-type="programlisting" data-code-language="py">stopwords_list = ["a","an","at"] + hundreds_of_other_words + ["yet", "you"] "zip" in stopwords_list # False, but have to check every element stopwords_set = set(stopwords_list) "zip" in stopwords_set # very fast to check</pre> <p>The second reason is to find the <em>distinct</em> items in a collection:</p> <pre data-type="programlisting" data-code-language="py">item_list = [1, 2, 3, 1, 2, 3] num_items = len(item_list) # 6 item_set = set(item_list) # {1, 2, 3} num_distinct_items = len(item_set) # 3 distinct_item_list = list(item_set) # [1, 2, 3]</pre> <p>We’ll use sets much less frequently than dicts and lists.</p> </section> <section data-type="sect2" id="control-flow-8A0un"> <h2>Control Flow</h2> <p>As in most programming languages, you can perform an action <a data-type="indexterm" data-primary="Python" data-secondary="control flow" id="id-ybZiX"></a><a data-type="indexterm" data-primary="control flow (in Python)" id="id-beNSo"></a><a data-type="indexterm" data-primary="if statements (Python)" id="id-6ZKsM"></a>conditionally using if:</p> <pre data-type="programlisting" data-code-language="py">if 1 > 2: message = "if only 1 were greater than two..." elif 1 > 3: message = "elif stands for 'else if'" else: message = "when all else fails use else (if you want to)"</pre> <p>You can also write a <em>ternary</em> if-then-else <a data-type="indexterm" data-primary="if-then-else statements (Python)" id="id-VgWHa"></a>on one line, which we will do occasionally:</p> <pre data-type="programlisting" data-code-language="py">parity = "even" if x % 2 == 0 else "odd"</pre> <p>Python<a data-type="indexterm" data-primary="while loops (Python)" id="id-R3qUX"></a> has a while loop:</p> <pre data-type="programlisting" data-code-language="py">x = 0 while x < 10: print x, "is less than 10" x += 1</pre> <p>although more <a data-type="indexterm" data-primary="for loops (Python)" id="id-0Wbfe"></a><a data-type="indexterm" data-primary="in operator (Python)" data-secondary="in for loops" id="id-47aCg"></a>often we’ll use for and in:</p> <pre data-type="programlisting" data-code-language="py">for x in range(10): print x, "is less than 10"</pre> <p>If you need more-complex logic, you <a data-type="indexterm" data-primary="continue statement (Python)" id="id-747sl"></a><a data-type="indexterm" data-primary="break statement (Python)" id="id-ENWhV"></a>can use continue and break:</p> <pre data-type="programlisting" data-code-language="py">for x in range(10): if x == 3: continue # go immediately to the next iteration if x == 5: break # quit the loop entirely print x</pre> <p>This will print 0, 1, 2, and 4.</p> </section> <section data-type="sect2" id="truthiness-jeJi2"> <h2>Truthiness</h2> <p>Booleans in Python work as in most<a data-type="indexterm" data-primary="Python" data-secondary="Booleans" id="id-KR7cd"></a><a data-type="indexterm" data-primary="booleans (Python)" id="id-NdEf6"></a><a data-type="indexterm" data-primary="truthiness (in Python)" id="id-zAkuJ"></a> other languages, except that they’re capitalized:</p> <pre data-type="programlisting" data-code-language="py">one_is_less_than_two = 1 < 2 # equals True true_equals_false = True == False # equals False</pre> <p>Python uses the value None to indicate a nonexistent value.<a data-type="indexterm" data-primary="None (Python)" id="id-dBruM"></a> It is similar to other languages' null:</p> <pre data-type="programlisting" data-code-language="py">x = None print x == None # prints True, but is not Pythonic print x is None # prints True, and is Pythonic</pre> <p>Python lets you use any value where it expects a Boolean. The following are all "Falsy":</p> <ul> <li> <p>False</p> </li> <li> <p>None</p> </li> <li> <p>[] (an empty list)</p> </li> <li> <p>{} (an empty dict)</p> </li> <li> <p>""</p> </li> <li> <p>set()</p> </li> <li> <p>0</p> </li> <li> <p>0.0</p> </li> </ul> <p>Pretty much anything else gets treated as True. This allows you to easily use if statements to test for empty lists or empty strings or empty dictionaries or so on. It also sometimes causes tricky bugs if you’re not expecting this behavior:</p> <pre data-type="programlisting" data-code-language="py">s = some_function_that_returns_a_string() if s: first_char = s[0] else: first_char = ""</pre> <p>A simpler way of doing the same is:</p> <pre data-type="programlisting" data-code-language="py">first_char = s and s[0]</pre> <p>since and returns its second value when the first is "truthy," the first value when it’s not. Similarly, if x is either a number or possibly None:</p> <pre data-type="programlisting" data-code-language="py">safe_x = x or 0</pre> <p>is definitely a number.</p> <p>Python has an all function, which takes<a data-type="indexterm" data-primary="all function (Python)" id="id-0Xbte"></a><a data-type="indexterm" data-primary="any function (Python)" id="id-42aCg"></a> a list and returns True precisely when every element is truthy, and an any function, which returns True when at least one element is truthy:</p> <pre data-type="programlisting" data-code-language="py">all([True, 1, { 3 }]) # True all([True, 1, {}]) # False, {} is falsy any([True, 1, {}]) # True, True is truthy all([]) # True, no falsy elements in the list any([]) # False, no truthy elements in the list</pre> </section> </section> <section data-type="sect1" id="the-not-so-basics-DeWTR"> <h1>The Not-So-Basics</h1> <p>Here we’ll look at some more-advanced Python features that we’ll find useful for working with data.<a data-type="indexterm" data-primary="Python" data-secondary="sorting in" id="id-mLWtD"></a></p> <section data-type="sect2" id="sorting-XeGUb"> <h2>Sorting</h2> <p>Every Python list has a sort method that sorts it in place.<a data-type="indexterm" data-primary="sorting (in Python)" id="id-ndMfl"></a><a data-type="indexterm" data-primary="lists (in Python)" data-secondary="sort method" id="id-Kr7hd"></a> If you don’t want to mess up your list, you can use the sorted function, which returns a new list:</p> <pre data-type="programlisting" data-code-language="py">x = [4,1,2,3] y = sorted(x) # is [1,2,3,4], x is unchanged x.sort() # now x is [1,2,3,4]</pre> <p>By default, sort (and sorted) sort a list from smallest to largest based on naively comparing the elements to one another.</p> <p>If you want elements sorted from largest to smallest, you can specify a reverse=True parameter. And instead of comparing the elements themselves, you can compare the results of a function that you specify with key:</p> <pre data-type="programlisting" data-code-language="py"># sort the list by absolute value from largest to smallest x = sorted([-4,1,-2,3], key=abs, reverse=True) # is [-4,3,-2,1] # sort the words and counts from highest count to lowest wc = sorted(word_counts.items(), key=lambda (word, count): count, reverse=True)</pre> </section> <section data-type="sect2" id="list-comprehensions-baJF7"> <h2>List Comprehensions</h2> <p>Frequently, you’ll want to transform a list into another list, by choosing only certain elements, or by transforming elements, or both.<a data-type="indexterm" data-primary="list comprehensions (Python)" id="id-PBlhY"></a><a data-type="indexterm" data-primary="Python" data-secondary="list comprehensions" id="id-DeMiV"></a> The Pythonic way of doing this is <em>list comprehensions</em>:</p> <pre data-type="programlisting" data-code-language="py">even_numbers = [x for x in range(5) if x % 2 == 0] # [0, 2, 4] squares = [x * x for x in range(5)] # [0, 1, 4, 9, 16] even_squares = [x * x for x in even_numbers] # [0, 4, 16]</pre> <p>You can similarly turn lists into dictionaries or sets:</p> <pre data-type="programlisting" data-code-language="py">square_dict = { x : x * x for x in range(5) } # { 0:0, 1:1, 2:4, 3:9, 4:16 } square_set = { x * x for x in [1, -1] } # { 1 }</pre> <p>If you don’t need the value from the list, it’s conventional to use an underscore as the variable:</p> <pre data-type="programlisting" data-code-language="py">zeroes = [0 for _ in even_numbers] # has the same length as even_numbers</pre> <p>A list comprehension can include <a data-type="indexterm" data-primary="for loops (Python)" data-secondary="in list comprehensions" id="id-d5LUM"></a>multiple fors:</p> <pre data-type="programlisting" data-code-language="py">pairs = [(x, y) for x in range(10) for y in range(10)] # 100 pairs (0,0) (0,1) ... (9,8), (9,9)</pre> <p>and later fors can use the results of earlier ones:</p> <pre data-type="programlisting" data-code-language="py">increasing_pairs = [(x, y) # only pairs with x < y, for x in range(10) # range(lo, hi) equals for y in range(x + 1, 10)] # [lo, lo + 1, ..., hi - 1]</pre> <p>We will use list comprehensions a lot.</p> </section> <section data-type="sect2" id="generators"> <h2>Generators and Iterators</h2> <p>A problem with lists is that they can easily<a data-type="indexterm" data-primary="Python" data-secondary="generators and iterators" id="id-X90CG"></a> grow very big. range(1000000) creates an actual list of 1 million elements.<a data-type="indexterm" data-primary="range function (Python)" id="id-ja2ub"></a> If you only need to deal with them one at a time, this can be a huge source of inefficiency (or of running out of memory). If you potentially only need the first few values, then calculating them all is a waste.</p> <p>A <em>generator</em> is something that you can<a data-type="indexterm" data-primary="generators (Python)" id="id-9Q1hn"></a> iterate over (for us, usually using for) but whose values are produced only as needed (<em>lazily</em>).</p> <p>One way to create generators is <a data-type="indexterm" data-primary="yield operator (Python)" id="id-r0NhE"></a>with functions and the yield operator:</p> <pre data-type="programlisting" data-code-language="py">def lazy_range(n): """a lazy version of range""" i = 0 while i < n: yield i i += 1</pre> <p>The following loop will consume the yielded values one at a time until none are left:</p> <pre data-type="programlisting" data-code-language="py">for i in lazy_range(10): do_something_with(i)</pre> <p>(Python actually comes with a lazy_range function <a data-type="indexterm" data-primary="xrange function (Python)" id="id-qj3Hj"></a>called xrange, and in Python 3, range itself is lazy.) This means you could even create an infinite sequence:</p> <pre data-type="programlisting" data-code-language="py">def natural_numbers(): """returns 1, 2, 3, ...""" n = 1 while True: yield n n += 1</pre> <p>although you probably shouldn’t iterate over it without using some kind of break logic.</p> <div data-type="tip" id="id-VWKi1"><h6>Tip</h6> <p>The flip side of laziness is that you can only iterate through a generator once. If you need to iterate through something multiple times, you’ll need to either recreate the generator each time or use a list.</p> </div> <p>A second way to create<a data-type="indexterm" data-primary="for comprehensions (Python)" id="id-eGKIp"></a> generators is by using for comprehensions wrapped in parentheses:</p> <pre data-type="programlisting" data-code-language="py">lazy_evens_below_20 = (i for i in lazy_range(20) if i % 2 == 0)</pre> <p>Recall also that every dict has an items() method that returns a list of its key-value pairs.<a data-type="indexterm" data-primary="dictionaries (Python)" data-secondary="items and iteritems methods" id="id-4lJcg"></a> More frequently we’ll use the iteritems() method, which lazily yields the key-value pairs one at a time as we iterate over it.</p> </section> <section data-type="sect2" id="randomness-PG2up"> <h2>Randomness</h2> <p>As we learn data science, we will frequently need to generate random numbers, which we can do <a data-type="indexterm" data-primary="random module (Python)" id="id-EBDTV"></a><a data-type="indexterm" data-primary="Python" data-secondary="random numbers, generating" id="id-mjQHD"></a>with the random module:</p> <pre data-type="programlisting" data-code-language="py">import random four_uniform_randoms = [random.random() for _ in range(4)] # [0.8444218515250481, # random.random() produces numbers # 0.7579544029403025, # uniformly between 0 and 1 # 0.420571580830845, # it's the random function we'll use # 0.25891675029296335] # most often</pre> <p>The random module actually produces pseudorandom (that is, deterministic) numbers based on an internal state that you can set with random.seed if you want to get reproducible results:</p> <pre data-type="programlisting" data-code-language="py">random.seed(10) # set the seed to 10 print random.random() # 0.57140259469 random.seed(10) # reset the seed to 10 print random.random() # 0.57140259469 again</pre> <p>We’ll sometimes use random.randrange, which takes either 1 or 2 arguments and returns an element chosen randomly from the corresponding range():</p> <pre data-type="programlisting" data-code-language="py">random.randrange(10) # choose randomly from range(10) = [0, 1, ..., 9] random.randrange(3, 6) # choose randomly from range(3, 6) = [3, 4, 5]</pre> <p>There are a few more methods that we’ll sometimes find convenient. random.shuffle randomly reorders the elements of a list:</p> <pre data-type="programlisting" data-code-language="py">up_to_ten = range(10) random.shuffle(up_to_ten) print up_to_ten # [2, 5, 1, 9, 7, 3, 8, 6, 4, 0] (your results will probably be different)</pre> <p>If you need to randomly pick one element from a list you can use random.choice:</p> <pre data-type="programlisting" data-code-language="py">my_best_friend = random.choice(["Alice", "Bob", "Charlie"]) # "Bob" for me</pre> <p>And if you need to randomly choose a sample of elements without replacement (i.e., with no duplicates), you can use random.sample:</p> <pre data-type="programlisting" data-code-language="py">lottery_numbers = range(60) winning_numbers = random.sample(lottery_numbers, 6) # [16, 36, 10, 6, 25, 9]</pre> <p>To choose a sample of elements <em>with</em> replacement (i.e., allowing duplicates), you can just make multiple calls to random.choice:</p> <pre data-type="programlisting" data-code-language="py">four_with_replacement = [random.choice(range(10)) for _ in range(4)] # [9, 4, 4, 2]</pre> </section> <section data-type="sect2" id="regular-expressions-Vr0UZ"> <h2>Regular Expressions</h2> <p>Regular expressions provide a way of searching text. <a data-type="indexterm" data-primary="regular expressions" id="id-Xb0HG"></a><a data-type="indexterm" data-primary="Python" data-secondary="regular expressions" id="id-jW2cb"></a> They are incredibly useful but also fairly complicated, so much so that there are entire books written about them. We will explain their details the few times we encounter them; here are a few examples of how to use them in Python:</p> <pre data-type="programlisting" data-code-language="py">import re print all([ # all of these are true, because not re.match("a", "cat"), # * 'cat' doesn't start with 'a' re.search("a", "cat"), # * 'cat' has an 'a' in it not re.search("c", "dog"), # * 'dog' doesn't have a 'c' in it 3 == len(re.split("[ab]", "carbs")), # * split on a or b to ['c','r','s'] "R-D-" == re.sub("[0-9]", "-", "R2D2") # * replace digits with dashes ]) # prints True</pre> </section> <section data-type="sect2" id="object-oriented-programming-055Cr"> <h2>Object-Oriented Programming</h2> <p>Like many languages, Python allows you to define <em>classes</em> that encapsulate data and the functions that operate on them.<a data-type="indexterm" data-primary="classes (Python)" id="id-rENIE"></a><a data-type="indexterm" data-primary="Python" data-secondary="object-oriented programming" id="id-ya4hX"></a> We’ll use them sometimes to make our code cleaner and simpler. It’s probably simplest to explain them by constructing a heavily annotated example.</p> <p>Imagine we didn’t have the built-in Python set. Then we might want to create our own Set class.</p> <p>What behavior should our class have? Given an instance of Set, we’ll need to be able to add items to it, remove items from it, and check whether it contains a certain value. <a data-type="indexterm" data-primary="member functions" id="id-Q4Dhg"></a> We’ll create all of these as <em>member</em> functions, which means we’ll access them with a dot after a Set object:</p> <pre data-type="programlisting" data-code-language="py"># by convention, we give classes PascalCase names class Set: # these are the member functions # every one takes a first parameter "self" (another convention) # that refers to the particular Set object being used def __init__(self, values=None): """This is the constructor. It gets called when you create a new Set. You would use it like s1 = Set() # empty set s2 = Set([1,2,2,3]) # initialize with values""" self.dict = {} # each instance of Set has its own dict property # which is what we'll use to track memberships if values is not None: for value in values: self.add(value) def __repr__(self): """this is the string representation of a Set object if you type it at the Python prompt or pass it to str()""" return "Set: " + str(self.dict.keys()) # we'll represent membership by being a key in self.dict with value True def add(self, value): self.dict[value] = True # value is in the Set if it's a key in the dictionary def contains(self, value): return value in self.dict def remove(self, value): del self.dict[value]</pre> <p>Which we could then use like:</p> <pre data-type="programlisting" data-code-language="py">s = Set([1,2,3]) s.add(4) print s.contains(4) # True s.remove(3) print s.contains(3) # False</pre> </section> <section data-type="sect2" id="functional-tools-zWdiz"> <h2>Functional Tools</h2> <p>When passing functions around, sometimes we’ll want to partially apply (or <em>curry</em>) functions to create new functions.<a data-type="indexterm" data-primary="currying (Python)" id="id-05DSe"></a><a data-type="indexterm" data-primary="Python" data-secondary="functional tools" id="id-4VJSg"></a> As a simple example, imagine that we have a function of two variables:</p> <pre data-type="programlisting" data-code-language="py">def exp(base, power): return base ** power</pre> <p>and we want to use it to create a function of one variable two_to_the whose input is a power and whose output is the result of exp(2, power).</p> <p>We can, of course, do this with def, but this can sometimes get unwieldy:</p> <pre data-type="programlisting" data-code-language="py">def two_to_the(power): return exp(2, power)</pre> <p>A different approach<a data-type="indexterm" data-primary="partial functions (Python)" id="id-kXahN"></a> is to use functools.partial:</p> <pre data-type="programlisting" data-code-language="py">from functools import partial two_to_the = partial(exp, 2) # is now a function of one variable print two_to_the(3) # 8</pre> <p>You can also use partial to fill in later arguments if you specify their names:</p> <pre data-type="programlisting" data-code-language="py">square_of = partial(exp, power=2) print square_of(3) # 9</pre> <p>It starts to get messy if you curry arguments in the middle of the function, so we’ll try to avoid doing that.</p> <p>We will also occasionally use map, reduce, and filter, which provide functional <a data-type="indexterm" data-primary="map function (Python)" id="id-zWbCJ"></a>alternatives to list comprehensions:</p> <pre data-type="programlisting" data-code-language="py">def double(x): return 2 * x xs = [1, 2, 3, 4] twice_xs = [double(x) for x in xs] # [2, 4, 6, 8] twice_xs = map(double, xs) # same as above list_doubler = partial(map, double) # *function* that doubles a list twice_xs = list_doubler(xs) # again [2, 4, 6, 8]</pre> <p>You can use map with multiple-argument functions if you provide multiple lists:</p> <pre data-type="programlisting" data-code-language="py">def multiply(x, y): return x * y products = map(multiply, [1, 2], [4, 5]) # [1 * 4, 2 * 5] = [4, 10]</pre> <p>Similarly, filter does the work<a data-type="indexterm" data-primary="filter function (Python)" id="id-26rfP"></a> of a list-comprehension if:</p> <pre data-type="programlisting" data-code-language="py">def is_even(x): """True if x is even, False if x is odd""" return x % 2 == 0 x_evens = [x for x in xs if is_even(x)] # [2, 4] x_evens = filter(is_even, xs) # same as above list_evener = partial(filter, is_even) # *function* that filters a list x_evens = list_evener(xs) # again [2, 4]</pre> <p>And reduce combines the first<a data-type="indexterm" data-primary="reduce function (Python)" id="id-L8ySg"></a> two elements of a list, then that result with the third, that result with the fourth, and so on, producing a single result:</p> <pre data-type="programlisting" data-code-language="py">x_product = reduce(multiply, xs) # = 1 * 2 * 3 * 4 = 24 list_product = partial(reduce, multiply) # *function* that reduces a list x_product = list_product(xs) # again = 24</pre> </section> <section data-type="sect2" id="enumerate-91AT8"> <h2>enumerate</h2> <p>Not infrequently, you’ll want to iterate<a data-type="indexterm" data-primary="Python" data-secondary="enumerate function" id="id-oNEt8"></a><a data-type="indexterm" data-primary="enumerate function (Python)" id="id-pGJHN"></a> over a list and use both its elements and their indexes:</p> <pre data-type="programlisting" data-code-language="py"># not Pythonic for i in range(len(documents)): document = documents[i] do_something(i, document) # also not Pythonic i = 0 for document in documents: do_something(i, document) i += 1</pre> <p>The Pythonic solution is enumerate, which produces tuples (index, element):</p> <pre data-type="programlisting" data-code-language="py">for i, document in enumerate(documents): do_something(i, document)</pre> <p>Similarly, if we just want the indexes:</p> <pre data-type="programlisting" data-code-language="py">for i in range(len(documents)): do_something(i) # not Pythonic for i, _ in enumerate(documents): do_something(i) # Pythonic</pre> <p>We’ll use this a lot.</p> </section> <section data-type="sect2" id="zip-and-argument-unpacking-AN9Ha"> <h2>zip and Argument Unpacking</h2> <p>Often we will need to zip two or more lists<a data-type="indexterm" data-primary="Python" data-secondary="zip function and argument unpacking" id="id-jBXhb"></a><a data-type="indexterm" data-primary="zip function (Python)" id="id-gnGsq"></a><a data-type="indexterm" data-primary="lists (in Python)" data-secondary="zipping and unzipping" id="id-