UNPKG

skimr

Version:

CLI EDA for CSVs

334 lines (305 loc) 538 kB
<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="generator" content="pandoc" /> <meta http-equiv="X-UA-Compatible" content="IE=EDGE" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <title>Vroom Benchmarks</title> <script>// Pandoc 2.9 adds attributes on both header and div. We remove the former (to // be compatible with the behavior of Pandoc < 2.8). document.addEventListener('DOMContentLoaded', function(e) { var hs = document.querySelectorAll("div.section[class*='level'] > :first-child"); var i, h, a; for (i = 0; i < hs.length; i++) { h = hs[i]; if (!/^h[1-6]$/i.test(h.tagName)) continue; // it should be a header h1-h6 a = h.attributes; while (a.length > 0) h.removeAttribute(a[0].name); } }); </script> <style type="text/css"> code{white-space: pre-wrap;} span.smallcaps{font-variant: small-caps;} span.underline{text-decoration: underline;} div.column{display: inline-block; vertical-align: top; width: 50%;} div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;} ul.task-list{list-style: none;} </style> <style type="text/css">body { background-color: #fff; margin: 1em auto; max-width: 700px; overflow: visible; padding-left: 2em; padding-right: 2em; font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.35; } #TOC { clear: both; margin: 0 0 10px 10px; padding: 4px; width: 400px; border: 1px solid #CCCCCC; border-radius: 5px; background-color: #f6f6f6; font-size: 13px; line-height: 1.3; } #TOC .toctitle { font-weight: bold; font-size: 15px; margin-left: 5px; } #TOC ul { padding-left: 40px; margin-left: -1.5em; margin-top: 5px; margin-bottom: 5px; } #TOC ul ul { margin-left: -2em; } #TOC li { line-height: 16px; } table { margin: 1em auto; border-width: 1px; border-color: #DDDDDD; border-style: outset; border-collapse: collapse; } table th { border-width: 2px; padding: 5px; border-style: inset; } table td { border-width: 1px; border-style: inset; line-height: 18px; padding: 5px 5px; } table, table th, table td { border-left-style: none; border-right-style: none; } table thead, table tr.even { background-color: #f7f7f7; } p { margin: 0.5em 0; } blockquote { background-color: #f6f6f6; padding: 0.25em 0.75em; } hr { border-style: solid; border: none; border-top: 1px solid #777; margin: 28px 0; } dl { margin-left: 0; } dl dd { margin-bottom: 13px; margin-left: 13px; } dl dt { font-weight: bold; } ul { margin-top: 0; } ul li { list-style: circle outside; } ul ul { margin-bottom: 0; } pre, code { background-color: #f7f7f7; border-radius: 3px; color: #333; white-space: pre-wrap; } pre { border-radius: 3px; margin: 5px 0px 10px 0px; padding: 10px; } pre:not([class]) { background-color: #f7f7f7; } code { font-family: Consolas, Monaco, 'Courier New', monospace; font-size: 85%; } p > code, li > code { padding: 2px 0px; } div.figure { text-align: center; } img { background-color: #FFFFFF; padding: 2px; border: 1px solid #DDDDDD; border-radius: 3px; border: 1px solid #CCCCCC; margin: 0 5px; } h1 { margin-top: 0; font-size: 35px; line-height: 40px; } h2 { border-bottom: 4px solid #f7f7f7; padding-top: 10px; padding-bottom: 2px; font-size: 145%; } h3 { border-bottom: 2px solid #f7f7f7; padding-top: 10px; font-size: 120%; } h4 { border-bottom: 1px solid #f7f7f7; margin-left: 8px; font-size: 105%; } h5, h6 { border-bottom: 1px solid #ccc; font-size: 105%; } a { color: #0033dd; text-decoration: none; } a:hover { color: #6666ff; } a:visited { color: #800080; } a:visited:hover { color: #BB00BB; } a[href^="http:"] { text-decoration: underline; } a[href^="https:"] { text-decoration: underline; } code > span.kw { color: #555; font-weight: bold; } code > span.dt { color: #902000; } code > span.dv { color: #40a070; } code > span.bn { color: #d14; } code > span.fl { color: #d14; } code > span.ch { color: #d14; } code > span.st { color: #d14; } code > span.co { color: #888888; font-style: italic; } code > span.ot { color: #007020; } code > span.al { color: #ff0000; font-weight: bold; } code > span.fu { color: #900; font-weight: bold; } code > span.er { color: #a61717; background-color: #e3d2d2; } </style> </head> <body> <h1 class="title toc-ignore">Vroom Benchmarks</h1> <p>vroom is a new approach to reading delimited and fixed width data into R.</p> <p>It stems from the observation that when parsing files reading data from disk and finding the delimiters is generally not the main bottle neck. Instead (re)-allocating memory and parsing the values into R data types (particularly for characters) takes the bulk of the time.</p> <p>Therefore you can obtain very rapid input by first performing a fast indexing step and then using the Altrep framework available in R versions 3.5+ to access the values in a lazy / delayed fashion.</p> <div id="how-it-works" class="section level2"> <h2>How it works</h2> <p>The initial reading of the file simply records the locations of each individual record, the actual values are not read into R. Altrep vectors are created for each column in the data which hold a pointer to the index and the memory mapped file. When these vectors are indexed the value is read from the memory mapping.</p> <p>This means initial reading is extremely fast, in the real world dataset below it is ~ 1/4 the time of the multi-threaded <code>data.table::fread()</code>. Sampling operations are likewise extremely fast, as only the data actually included in the sample is read. This means things like the tibble print method, calling <code>head()</code>, <code>tail()</code> <code>x[sample(), ]</code> etc. have very low overhead. Filtering also can be fast, only the columns included in the filter selection have to be fully read and only the data in the filtered rows needs to be read from the remaining columns. Grouped aggregations likewise only need to read the grouping variables and the variables aggregated.</p> <p>Once a particular vector is fully materialized the speed for all subsequent operations should be identical to a normal R vector.</p> <p>This approach potentially also allows you to work with data that is larger than memory. As long as you are careful to avoid materializing the entire dataset at once it can be efficiently queried and subset.</p> </div> <div id="reading-delimited-files" class="section level1"> <h1>Reading delimited files</h1> <p>The following benchmarks all measure reading delimited files of various sizes and data types. Because vroom delays reading the benchmarks also do some manipulation of the data afterwards to try and provide a more realistic performance comparison.</p> <p>Because the <code>read.delim</code> results are so much slower than the others they are excluded from the plots, but are retained in the tables.</p> <div id="taxi-trip-dataset" class="section level2"> <h2>Taxi Trip Dataset</h2> <p>This real world dataset is from Freedom of Information Law (FOIL) Taxi Trip Data from the NYC Taxi and Limousine Commission 2013, originally posted at <a href="https://chriswhong.com/open-data/foil_nyc_taxi/" class="uri">https://chriswhong.com/open-data/foil_nyc_taxi/</a>. It is also hosted on <a href="https://archive.org/details/nycTaxiTripData2013">archive.org</a>.</p> <p>The first table trip_fare_1.csv is 1.55G in size.</p> <pre><code>#&gt; Observations: 14,776,615 #&gt; Variables: 11 #&gt; $ medallion &lt;chr&gt; &quot;89D227B655E5C82AECF13C3F540D4CF4&quot;, &quot;0BD7C8F5B... #&gt; $ hack_license &lt;chr&gt; &quot;BA96DE419E711691B9445D6A6307C170&quot;, &quot;9FD8F69F0... #&gt; $ vendor_id &lt;chr&gt; &quot;CMT&quot;, &quot;CMT&quot;, &quot;CMT&quot;, &quot;CMT&quot;, &quot;CMT&quot;, &quot;CMT&quot;, &quot;CMT... #&gt; $ pickup_datetime &lt;chr&gt; &quot;2013-01-01 15:11:48&quot;, &quot;2013-01-06 00:18:35&quot;, ... #&gt; $ payment_type &lt;chr&gt; &quot;CSH&quot;, &quot;CSH&quot;, &quot;CSH&quot;, &quot;CSH&quot;, &quot;CSH&quot;, &quot;CSH&quot;, &quot;CSH... #&gt; $ fare_amount &lt;dbl&gt; 6.5, 6.0, 5.5, 5.0, 9.5, 9.5, 6.0, 34.0, 5.5, ... #&gt; $ surcharge &lt;dbl&gt; 0.0, 0.5, 1.0, 0.5, 0.5, 0.0, 0.0, 0.0, 1.0, 0... #&gt; $ mta_tax &lt;dbl&gt; 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0... #&gt; $ tip_amount &lt;int&gt; 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... #&gt; $ tolls_amount &lt;dbl&gt; 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.8, 0.0, 0... #&gt; $ total_amount &lt;dbl&gt; 7.0, 7.0, 7.0, 6.0, 10.5, 10.0, 6.5, 39.3, 7.0...</code></pre> <div id="taxi-benchmarks" class="section level3"> <h3>Taxi Benchmarks</h3> <p>code: <a href="https://github.com/tidyverse/vroom/tree/main/inst/bench/taxi">bench/taxi</a></p> <p>All benchmarks were run on a Amazon EC2 <a href="https://aws.amazon.com/ec2/instance-types/m5/">m5.4xlarge</a> instance with 16 vCPUs and an <a href="https://aws.amazon.com/ebs/">EBS</a> volume type.</p> <p>The benchmarks labeled <code>vroom_base</code> uses <code>vroom</code> with base functions for manipulation. <code>vroom_dplyr</code> uses <code>vroom</code> to read the file and dplyr functions to manipulate. <code>data.table</code> uses <code>fread()</code> to read the file and <code>data.table</code> functions to manipulate and <code>readr</code> uses <code>readr</code> to read the file and <code>dplyr</code> to manipulate. By default vroom only uses Altrep for character vectors, these are labeled <code>vroom(altrep: normal)</code>. The benchmarks labeled <code>vroom(altrep: full)</code> instead use Altrep vectors for all supported types and <code>vroom(altrep: none)</code> disable Altrep entirely.</p> <p>The following operations are performed.</p> <ul> <li>The data is read</li> <li><code>print()</code> - <em>N.B. read.delim uses <code>print(head(x, 10))</code> because printing the whole dataset takes &gt; 10 minutes</em></li> <li><code>head()</code></li> <li><code>tail()</code></li> <li>Sampling 100 random rows</li> <li>Filtering for “UNK” payment, this is 6434 rows (0.0435% of total).</li> <li>Aggregation of mean fare amount per payment type.</li> </ul> <img src="