UNPKG

datapumps

Version:

Node.js ETL (Extract, Transform, Load) toolkit for easy data import, export or transfer between systems.

180 lines (159 loc) 6.41 kB
# Datapumps: Simple ETL for node.js ## Overview Create a group of pumps to import, export, transform or transfer data. ## Usage example: csv export from mysql * Create a group: ```js var datapumps = require('datapumps'); var exporter = datapumps.group(); ``` * Create a pump that loads the data from mysql: ```js exporter.addPump('customers') .from(mysqlConnection.query('SELECT id,last_name,first_name FROM customer').stream({highWaterMark: 5})); ``` This pump will read the query results into a buffer. The pump controls the data flow, i.e. it pauses read of query results when buffer is full. * Create a pump that writes the data to csv: ```js exporter.addPump('csvWriter') .from(exporter.pump('customers').buffer()) .mixin(datapumps.mixin.CsvWriterMixin({ path: 'test.csv', headers: [ 'Id', 'First Name', 'Last Name' ] })) .process(function(customer) { this.writeRow([ customer.id, customer.first_name, customer.last_name ]); }); ``` `.from` indicates that the pump will load data from the buffer of *customers* pump. The `CsvWriterMixin` extends the functionality of the pump, it creates csv file with given headers and adds the `.writeRow` method to the pump. Finally, the `.process` method (which copies data to the output buffer by default) is overridden with writing rows to the csv. * Register a listener for `end` event and `.start()` the pump: ```js exporter .on('end', function() { console.log('CSV export complete.'); }) .start(); ``` The group will emit *end* event when all of its pumps completed their jobs. It is also possible to get a promise for that (`.whenFinished()`). ## Pump A pump reads data from its input and copies it to the output buffer by default: ```js (pump = new Pump()) .from(<put a nodejs stream or datapumps buffer here>) .start() ``` To access the output buffer, use the `.buffer()` method, which returns a Buffer instance: ```js buffer = pump.buffer('output'); buffer = pump.buffer(); # equivalent with previous as the default buffer of the pump is called 'output' ``` Use the `.buffers()` method when you need multiple buffers: ```js ticketsPump .buffers({ openTickets: pump.createBuffer(), closedTickets: pump.createBuffer(), }); reminderMailer = new Pump() reminderMailer .from(ticketPump.buffer('openTickets')) ... ``` Note that the *tickets* pump has two output buffers: *openTickets* and *closedTickets*. The *reminderMailer* pump reads data from the *openTickets* buffer of the *tickets* pump. ### Transforming data Use the `.process()` method to set the function which processes a data: ```js ticketsPump .process(function(ticket) { ticket.title = 'URGENT: ' + ticket.title; return this.buffer('openTickets').writeAsync(ticket); }); ``` The argument of `.process()` is a function that will be executed after the pump reads a data item. The function is executed in the context of the pump object, i.e. `this` refers to the pump itself. The function should return a Promise that fulfills when the data is processed (i.e. written into a buffer or stored elsewhere). ### Start and end of pumping A pump is started by calling the `.start()` method. The `end` event will be emitted when the input stream or buffer ended and all output buffers became empty. ```js pump.on('end', function() { console.log('Pumped everything, and all my output buffers are empty. Bye.') }) ``` ## Pump group You often need multiple pumps to complete an ETL task. Pump groups help starting multiple pump in one step, and also enables handling the event when every pump ended: ```js sendMails = datapumps.group(); sendMails.addPump('tickets') ...; sendMails.addPump('reminderMailer') ...; sendMails .start() .whenFinished().then(function() { console.log('Tickets processed.'); }); ``` The `.addPump()` method creates a new pump with given name and returns it for configuration. `.start()` will start all pumps in the group, while `.whenFinished()` returns a Promise the fulfills when every pump ended (Note: `end` event is also emitted). ### Encapsulation Sometimes you wish to encapsulate a part of an ETL process and also use it elsewhere. It is possible to set an input pump and expose buffers from the group, so it will provide the same interface as a simple pump (i.e. it has `.from()`, `.start()`, `.buffer()` methods and emits `end` event). Most likely, you want to extend `datapumps.Group` class (example is written in CoffeeScript): ```coffee { Group, mixin: { MysqlMixin } } = require 'datapumps' class Notifier extends Group constructor: -> super() @addPump 'emailLookup' .mixin(MysqlMixin(connection)) .process (data) -> @query('SELECT email FROM user where username = ?', [ data.username ]) .then (result) => data.emailAddress = result.email @buffer().writeAsync data @addPump 'sendMail' .from @pump 'emailLookup' .process (data) -> ... # send email to data.emailAddress @buffer().writeAsync recipient: name: data.name email: data.emailAddress @setInputPump 'emailLookup' @expose 'output', 'sendMail/output' ``` The `Notifier` will behave like pump, but in the inside, it does an email address lookup using mysql, and sends mail to those addresses. The output buffer of `sendMail` pump is filled with recipient data. Use the created class like this: ```coffee etlProcess = datapumps.group() etlProcess .addPump 'notifier', new Notifier .from <node stream or datapumps buffer> etlProcess .addPump 'logger' .from etlProcess.pump('notifier').buffer() .process (data) -> console.log "Email sent to #{data.name} (#{data.email})" ``` Please note that you cannot use `.process` method on a group. ## Mixins The core components of datapumps is only responsible for passing data in a flow-controlled manner. The features required for import, export or transfer is provided by mixins: * CsvWriterMixin - Writes csv files using fast-csv package * ExcelWriterMixin - Writes excel xlsx workbooks * MysqlMixin - Reads and writes on a mysql connection For more details, see the documented source code until the docco docs become available. If you implement new mixins, please fork datapumps and make a pull request.