dataship-frame

group filter join pivot unique values for each column min and max for each column (esp date) g = groupby("country", "type") g["deficiencies"].sum() / g.count(); port -> field -> aggregate port -> type -> field -> aggregate result["china"]["deficiencies"] // returns sum result.china.deficiencies // type is a sub-aggregate result["china"]["oil tanker"]["deficiencies"] // returns sum result["china"]["type"] array return array of reduced values indices match array of hierarchical categories result = groupby("country") result.levels() // => 1 (tiers, dimensions) result.sum("deficiencies", "detained"); // => [[6, 5, 10, 7], [2, 0, 1, 2]] result.china.sum("deficiencies", "detained"); // => [6, 2] result.china.sum("deficiencies"); // => 6 result.china.values("deficiencies") // => [1, 1, 0, 1, 1, 0, 1, 0, 1] result.groups() // => ["china", "brazil", "new zealand", "korea"] result.china.groups() // => ["oil tanker", "cargo ship"] ports = f.groupby("port") ports.values("country") // => [["china", "china", "china"], ["brazil", "brazil"], "new zealand", "korea"] ports.distinct("country") // => ["china", "brazil", "new zealand", "korea"] object return each call to an aggregation function returns a single value (on a leaf node) ## Priorities * speed * simple, intuitive interface * deployable directly to production * produce results usable in machine learning applications * 1M - 10M row data sets (for now) ## Example Tasks should at least be able to create limited prototypes for each of these * domain category task (Tailwind) data: pin data, with domain and category pivot (two dimensional groupby) dimensions: domain, category groups: distinct values membership: equals (if the value equals a distinct value, it is a member) reduction: count * board recommendation task (Tailwind) data: pin data, with board name and pin descriptions board name, vocabulary item, and occurrences pivot dimensions: board_name, description groups: board_name: distinct values description: all vocabulary words, non-partitioning (membership in more than one group is allowed) membership: contains (if the description contains the vocabulary word, it is a member of the group) reduction: count/sum * dimensionality reduction * direct usage create new sparse vectors from source similarity on sparse vectors with existing set text analysis parsing tokenization lemmatization/stemming document vector per board (bag-of-words (sparse?)) * user game matrix creation task (Crunch Magic) data: user gaming data, with userid, gameid and hours played sculpt: turn JSON into row data each entry in games array is expanded into a new row: userid, gameid, hours_played waylon, skyrim, 72 waylon, horizon, 50 janell, stardew, 40 pivot: dimensions: gameid, userid groups: distinct values (sparse) reduction: sum of hours played after framing, we have a dimensionality reduction task, using Alternating Least Squares (ALS) * inspection data visualization task (Navis) ## Common Visualizations Data Structure ### Stacked bar chart Is a display of two dimensionally pivoted data, where one axis (typically x) is one reduction dimension the other axis is the variable reduced over, and the bars are split into groups by the second reduction dimension. ## Platform Layers notebook (webnotebook) --------- data management (dataship) --------- analysis (frame and webnn) --------- deployment ---------