s3-csv-to-json
Version:
Converts AWS S3 files from CSV to JSON lines via stream with support to gzip for both input and output. Ready to be used as a Node.js module, as a Lambda or via CLI.
97 lines (67 loc) • 2.61 kB
Markdown
# s3-csv-to-json
Converts AWS S3 files from CSV to JSON lines via stream with support to gzip for both input and output. Ready to be used as a Node.js module, as a Lambda or via CLI.
Currently it downloads the input file from S3 and upload the result back to it. A support to have one of the both ends pointing to a local file will be worked on.
## Goals
* Work with files stored on AWS S3
* Handle large files (stream: Input from S3 -> Convert -> Upload to S3)
* Support to **gzip** for both input and output
* AWS Lambda ready
## Installation
### Node.js Module
```
npm install s3-csv-to-json
```
### Lambda
Clone the repo and install its dependencies without optionals, as `aws-sdk` is already available on Lambda functions.
```
npm install --no-optional
```
After installing all dependencies, package it as a zip (e.g. `zip -r s3-csv-to-json.zip s3-csv-to-json/*`).
Lambda config:
* **Code entry type:** Upload the zip package.
* **Runtime:** Node.js 8.10+
* **Handler:** src/aws-lambda.handler
## Usage
Using it as a module, a Lambda or via CLI, it just expects an **input** and a **output** path from AWS S3.
```json
{
"input": "https://s3.amazonaws.com/bucket-name/path/to/input.csv.gz",
"output": "s3://bucket-name/path/to/output.jsonl.gz"
}
```
As shown above, S3 protocol is accepted as well for both cases (e.g. **s3:**//bucket-name/path/to/input.csv).
Gzip is supported for both input and output. The **.gz** extension is all that is needed to determine when the input must be decompressed and if the output should be compressed. If there is no *.gz* extension it will be handled as a text file (*UTF-8*).
### Node.js Module
```javascript
const s3CsvToJson = require('s3-csv-to-json.js');
// Assuming that the following is within an async function
const response = await s3CsvToJson({
input: "https://s3.amazonaws.com/bucket-name/path/to/input.csv.gz",
output: "s3://bucket-name/path/to/output.jsonl.gz"
});
```
### Lambda
Just post a JSON as already shown above:
```json
{
"input": "https://s3.amazonaws.com/bucket-name/path/to/input.csv.gz",
"output": "s3://bucket-name/path/to/output.jsonl.gz"
}
```
### CLI
Clone the repo and execute it using the module [MagiCLI](https://github.com/DiegoZoracKy/magicli) via `npx`:
```bash
npx magicli ./path-to-s3-csv-to-json-module --input="s3://bucket-name/path/to/input.csv.gz" --output="s3://bucket-name/path/to/output.jsonl.gz"
```
## Example
**CSV Input:**
```
ID,NAME
1,John Doe
2,Jane Doe
```
**JSON lines output:**
```
{"ID":"1","NAME":"John Doe"}
{"ID":"2","NAME":"Jane Doe"}
```