dataship-frame
Version:
A Data Frame for Javascript. Crunch numbers in node and the browser.
145 lines (94 loc) • 3.49 kB
Markdown
group
filter
join
pivot
unique values for each column
min and max for each column (esp date)
g = groupby("country", "type")
g["deficiencies"].sum() / g.count();
port -> field -> aggregate
port -> type -> field -> aggregate
result["china"]["deficiencies"] // returns sum
result.china.deficiencies
// type is a sub-aggregate
result["china"]["oil tanker"]["deficiencies"] // returns sum
result["china"]["type"]
array return
array of reduced values
indices match array of hierarchical categories
result = groupby("country")
result.levels() // => 1 (tiers, dimensions)
result.sum("deficiencies", "detained"); // => [[6, 5, 10, 7], [2, 0, 1, 2]]
result.china.sum("deficiencies", "detained"); // => [6, 2]
result.china.sum("deficiencies"); // => 6
result.china.values("deficiencies") // => [1, 1, 0, 1, 1, 0, 1, 0, 1]
result.groups() // => ["china", "brazil", "new zealand", "korea"]
result.china.groups() // => ["oil tanker", "cargo ship"]
ports = f.groupby("port")
ports.values("country") // => [["china", "china", "china"], ["brazil", "brazil"], "new zealand", "korea"]
ports.distinct("country") // => ["china", "brazil", "new zealand", "korea"]
object return
each call to an aggregation function returns a single value (on a leaf node)
## Priorities
* speed
* simple, intuitive interface
* deployable directly to production
* produce results usable in machine learning applications
* 1M - 10M row data sets (for now)
## Example Tasks
should at least be able to create limited prototypes for each of these
* domain category task (Tailwind)
data: pin data, with domain and category
pivot (two dimensional groupby)
dimensions: domain, category
groups: distinct values
membership: equals (if the value equals a distinct value, it is a member)
reduction: count
* board recommendation task (Tailwind)
data: pin data, with board name and pin descriptions
board name, vocabulary item, and occurrences
pivot
dimensions: board_name, description
groups:
board_name: distinct values
description: all vocabulary words, non-partitioning (membership in more than one group is allowed)
membership: contains (if the description contains the vocabulary word, it is a member of the group)
reduction: count/sum
* dimensionality reduction
* direct usage
create new sparse vectors from source
similarity on sparse vectors with existing set
text analysis
parsing
tokenization
lemmatization/stemming
document vector per board (bag-of-words (sparse?))
* user game matrix creation task (Crunch Magic)
data: user gaming data, with userid, gameid and hours played
sculpt: turn JSON into row data
each entry in games array is expanded into a new row:
userid, gameid, hours_played
waylon, skyrim, 72
waylon, horizon, 50
janell, stardew, 40
pivot:
dimensions: gameid, userid
groups: distinct values (sparse)
reduction: sum of hours played
after framing, we have a dimensionality reduction task, using
Alternating Least Squares (ALS)
* inspection data visualization task (Navis)
## Common Visualizations Data Structure
### Stacked bar chart
Is a display of two dimensionally pivoted data, where one axis (typically x) is one reduction dimension
the other axis is the variable reduced over, and the bars are split into groups by
the second reduction dimension.
## Platform Layers
notebook (webnotebook)
---------
data management (dataship)
---------
analysis (frame and webnn)
---------
deployment
---------