jl
Version:
Command-line JSON manipulation library
283 lines (211 loc) • 8.73 kB
Markdown
node-jl
=====
([Russian translation](README-RU.md))
Utility kit for working with JSON files/streams divided by `\n` symbol.
JSON-representation of each record shouldn't contain line breaking symbol.
Install
-----
```
npm install -g jl
```
Examples
-----
We have a log file `/tmp/test.json`, that contains next records
```json
{"ts": 1416595508, "type": "click"}
{"ts": 1416478467, "type": "buy", "price": 10}
{"ts": 1416466930, "type": "click"}
{"ts": 1416622653, "type": "buy", "price": 20}
{"ts": 1416699396, "type": "click"}
{"ts": 1416624334, "type": "click"}
{"ts": 1416518859, "type": "click"}
{"ts": 1416569870, "type": "click"}
{"ts": 1416573325, "type": "click"}
{"ts": 1416682270, "type": "click"}
```
### SQL-interface
`jl-sql` utility supports query language very similar to SQL
```sh
cat /tmp/test.json \
| jl-sql 'SELECT type, COUNT(*) AS c, SUM(price) AS sum GROUP BY type ORDER BY c NUMERIC DESC'
```
```json
{"type":"click","c":8,"sum":0}
{"type":"buy","c":2,"sum":30}
```
As you can see we receive that we expect.
Please note non-standard keyword `NUMERIC` after `ORDER BY`:
without this keyword, sorting will be by string representation, not by numerical value.
#### Constants
- `NULL`
- `TRUE`
- `FALSE`
#### Operators
##### Arithmetic
- `+`
- `-`
- `*`
- `/`
- `%`
##### Comparison
- `=`, `==`
- `!=`
- `===` - Comparison with type check
- `!==` - Negative comparison with type check
- `>`
- `<`
- `>=`
- `<=`
##### Logical
- `AND`, `&&`
- `OR`, `||`
- `!`
#### Functions
##### Functions for working with data
- `FROM_UNIXTIME(unixTimestamp)` - converts unix timestamp to Date
- `UNIX_TIMESTAMP(date)` - converts Date to unix timestamp, date should be string in `'YYYY-MM-DD hh:mm:ss'` format
- `DATE(date)` - returns date in `'YYYY-MM-DD'` format
##### Misc
- `IF(expression, ifTrue, ifFalse)`
- `COALESCE(expression1[, expression2[, ...]])`
##### Aggregating
Right now, only very basic aggregating functions supported:
- `SUM(expression)` - counts summary of field's values. Non-numerical values will be ignored
- `MIN(expression)` - returns minimal field's value. Non-numerical values will be ignored
- `MAX(expression)` - returns maximal field's value. Non-numerical values will be ignored
- `COUNT(expression)` - counts amount of elements for which comparison with `expression` will be !== null or !== undefined
- `COUNT(*)` - counts amount of elements
- `HLL_COUNT_DISTINCT(arg1[, args2[, ...]])` - counts amount of unique argument combinations. Uses `O(1)` of memory and `O(N)` of CPU, but returns result with 0.1% error. For more information read [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog) algorithm.
#### Restrictions
- `JOIN` is not supported(and probably never will be)
- Sorting by multiple fields is not supported
- `LIMIT` is not supported, but could be replaced with `head` and `tail`
### Low-level interface
SQL-interface only wrapper for utility kit described down below.
By combining those utilities you can do much more than with SQL.
#### Filtering and aggregation
Let's filter only events with type `buy`
```sh
cat /tmp/test.json \
| jl-filter 'type == "buy"'
```
```json
{"ts": 1416478467, "type": "buy", "price": 10}
{"ts": 1416622653, "type": "buy", "price": 20}
```
Add sum calculation for all `price` fields
```sh
cat /tmp/test.json \
| jl-filter 'type == "buy"'
| jl-sum price
```
```json
{"value":30}
```
To get just `30` you can add `jl-extract value` at the end - this command extract value of the field.
#### Sorting
Let sort log file by `ts` field
```sh
cat /tmp/test.json \
| jl-sort ts
```
```json
{"ts": 1416466930, "type": "click"}
{"ts": 1416478467, "type": "buy", "price": 10}
{"ts": 1416518859, "type": "click"}
{"ts": 1416569870, "type": "click"}
{"ts": 1416573325, "type": "click"}
{"ts": 1416595508, "type": "click"}
{"ts": 1416622653, "type": "buy", "price": 20}
{"ts": 1416624334, "type": "click"}
{"ts": 1416682270, "type": "click"}
{"ts": 1416699396, "type": "click"}
```
The standard arguments of `sort` utility is also supported: `-r`, `-n`, `-u`, `-m`, `-s`, `-T`, `-S`
#### Modification
Let's add to each object field `date`, that contains event date in UTC format
```sh
cat /tmp/test.json \
| jl-transform '{r.date = (new Date(r.ts * 1000)).toUTCString()}'
```
```json
{"ts":1416595508,"type":"click","date":"Fri, 21 Nov 2014 18:45:08 GMT"}
{"ts":1416478467,"type":"buy","price":10,"date":"Thu, 20 Nov 2014 10:14:27 GMT"}
{"ts":1416466930,"type":"click","date":"Thu, 20 Nov 2014 07:02:10 GMT"}
{"ts":1416622653,"type":"buy","price":20,"date":"Sat, 22 Nov 2014 02:17:33 GMT"}
{"ts":1416699396,"type":"click","date":"Sat, 22 Nov 2014 23:36:36 GMT"}
{"ts":1416624334,"type":"click","date":"Sat, 22 Nov 2014 02:45:34 GMT"}
{"ts":1416518859,"type":"click","date":"Thu, 20 Nov 2014 21:27:39 GMT"}
{"ts":1416569870,"type":"click","date":"Fri, 21 Nov 2014 11:37:50 GMT"}
{"ts":1416573325,"type":"click","date":"Fri, 21 Nov 2014 12:35:25 GMT"}
{"ts":1416682270,"type":"click","date":"Sat, 22 Nov 2014 18:51:10 GMT"}
```
Internal pipe
-----
The system pipes are inefficient because of serialization/deserialization of JSON each time when working with multiple `jl-`utilities because each utility receives and returns JSON using stdin/stdout.
To solve this problem `jl-`utilities supports internal piping which allows to multiple utilities interact with each other in the context of a single process without spending resources for parsing. To use internal pipes replace `|` to `\|` between `jl-`commands in the command line. For example:
```sh
cat /tmp/test.json \
| jl-filter 'type == "buy"'
| jl-sum price
```
you can change to:
```sh
cat /tmp/test.json \
| jl-filter 'type == "buy"'
\| jl-sum price
```
Performance
-----
CPU consumption of all utilities except for `jl-sort` is `O(n)`, by memory is `O(1)`.
The performance of `jl-sort` completely depends on system's `sort` realization, additional CPU consumptions also `O(n)`, and by memory `O(1)`.
By using internal pipes significantly increases performance because JSON parsing is very expensive operation.
But in another hand, all operations in pipes is computing in a single thread, so with very long pipes you should keep the balance.
Utilities
-----
#### `jl-sql` - SQL
Requires single argument - SQL and has few options:
- `-T DIR` - temporary directory which will be used for sorting if it didn't use RAM. `$TMPDIR` or `/tmp` by default. Pay attention when sorting big amount of data, by having `/tmp` on the RAM disk.
- `-S BUF` - size of sorting buffer in bytes. On buffer overflow, filesystem will be used(see `-T` option)
#### `jl-sort` - sorting
Wrapper for GNU Sort, allowing to sort JSON. Uses all advantages of `sort`, also, supports sorting in filesystem, merge-sort, stable sorting, limit buffer size.
#### `jl-filter` - filtering
Alternative to `grep`, but for JSON.
#### `jl-reduce` - aggregation and grouping by key
General utility that produces welding elements into a single value for each group.
Expects 3 JS-functions arguments:
- `-i FUNC` - accumulator initialization: gets called for each group. The function executes in a group context.
- `-u FUNC` - accumulator update: the function that will be executed for each element of a group.
The function executes in a group context and receives argument `r`, containing each element of a thread.
- `-r FUNC` - received accumulator's value: will be executed at the end of group with the passed result. The function executes in a group context.
and one required argument
- `-k KEYDEF` - key which will be used for grouping, could be a function. For correct execution thread's input should be sorted by this key in any direction.
Here is simple example of using `jl-sum` utility, which counts sum for `amount` field for each user that gets identified by `uid` field:
```sh
jl-reduce -k uid -i '{this.sum = 0}' -u '{this.sum += r.amount}' -r '{return this.sum}'
```
For thread's input containing:
```json
{"uid": 1, "amount": 10}
{"uid": 1, "amount": 11}
{"uid": 2, "amount": 12}
{"uid": 3, "amount": 13}
```
Output will be:
```json
{"key":1,"value":21}
{"key":2,"value":12}
{"key":3,"value":13}
```
If `-k` argument will be missing, convolution will go over whole thread and will return:
```json
{"value":46}
```
#### `jl-sum` - summation
Predefined `jl-reduce`, which counts sum of parameter in group
#### `jl-count` - подсчёт количества
Predefined `jl-reduce`, which counts amount of elements in group
#### `jl-transform` - modification
#### `jl-extract` - field's value extraction
#### `jl-from-csv` - CSV to JSON conversation
#### `jl-plainify` - Flattens object to a single level deep k-v