UNPKG

databridge

Version:

Data bridging software to modularize, automate and schedule the transfer of data between different sources and destinations.

267 lines (206 loc) 7.8 kB
# DataBridge A framework for automated and programmatic data transfer. Separate source and destination modules allow for a high degree of customization and deployment-specific data handling. ## Installation See [INSTALL](https://github.com/psalmody/databridge/blob/master/INSTALL.md). ## Usage #### <a name="clusage"></a> Command-Line Run bridge or batch. In project directory at command-line: ```shell node app --help npm start -- --help ``` Re-setup: ``` node setup ``` Cleanup any extraneous output files (make appropriate backups first). Takes `-d` flag for number of previous days' files to keep. ``` node clean node clean -d 3 node clean -d 0 ``` Manage bind variables. Follow prompts: ``` node bind ``` #### Node Module ```shell npm install databridge --save ``` ```js var databridge = require('databridge'); ``` The databridge module exposes a number of functions/objects. See the documentation at the beginning of each module for specific function reference. ```javascript { setupConfig: [Function], // bin/config-setup config: { // config.json (if exists) dirs: { batches: '{...}/batches/', creds: '{...}/creds/', destinations: '{...}/destinations/', input: '{...}/input/', logs: '{...}/logs/', output: '{...}/output/', sources: '{...}/sources/' }, logto: 'console', defaultBindVars: {}, schedule: '{...}/schedule.json' }, bridge: [Function], // bin/bridge batchParse: [Function], // bin/batch-parse bridgeRunner: [Function], // bin/bridge-runner list: { src: [Function], // bin/list-src dest: [Function], // bin/list-dest tables: [Function], // bin/list-tables batches: [Function] // bin/list-batches } } ``` ## Data Types Data type detection is handled by `typeof()` and a combination of source column strings. 1. `GPA` anywhere or `_DEC` at the end of the column name will be parsed as `DECIMAL(8,2)`. (`_DEC` will be removed from the destination column name.) 2. `DATE` or `TIMESTAMP` in the column name is parsed as `DATE`. 3. For all other types, the first row of data will be used. 1. If `typeof(value) == 'number'` parsed as `INT`. 2. Else parsed as `VARCHAR(255)` This behavior is run by `bin/col-parser` and should be customized for particular situations especially involving large amounts of data. ## Indexes For SQL-based destinations, indexes are created for column names with the trailing string `_IND`. This trailing string is removed from the destination column name. ## Running as a Service / PM2 Uses `config.schedule` file and can setup service. Requires [pm2](http://pm2.keymetrics.io/) installed globally `npm install -g pm2`. ```shell # start pm2 service npm run service-start # restart pm2 npm run service-restart # stop pm2 npm run service-stop ``` It is possible to [run pm2 at startup](http://pm2.keymetrics.io/docs/usage/startup/). #### Schedule / Batch Configuration > NOTE: When using the service, it cannot prompt for bind variables if > needed. Therefore, any sources that require bind variables will throw > an error if bind variables are not defined. Each job object in that case > MUST have a `binds: true` or defined `binds: {...}` attribute. Option for truncate allows for truncating the table without completely dropping and recreating (sql databases only). Use `"truncate": true` or `-n`. Option for update allows for appending values to the table rather than completely deleting table and recreating (databases only). Use `"update": true` or `-u`. Run a custom script with `"type": "script"` and the name of the file (no extension) inside `local/input/` under `"name": "script"`. Each schedule object requires `cron` attribute in the following format. ``` * * * * * * ┬ ┬ ┬ ┬ ┬ ┬ │ │ │ │ │ | │ │ │ │ │ └ day of week (0 - 7) (0 or 7 is Sun) │ │ │ │ └───── month (1 - 12) │ │ │ └────────── day of month (1 - 31) │ │ └─────────────── hour (0 - 23) │ └──────────────────── minute (0 - 59) └───────────────────────── second (0 - 59, optional) ``` Example `schedule.json` file: ```json [{ "name": "test", "type": "batch", "cron": "*/30 * * * *" }, { "type": "bridge", "name": "oracle employees.ferpa_certified => mssql", "cron": "*/10 * * * *", "binds": true, "source": "oracle", "destination": "mssql", "table": "employees.ferpa_certified", "truncate": true }, { "type": "bridge", "name": "mssql surveys.population_open => csv", "cron": "*/20 * * * *", "binds": true, "source": "mssql", "table": "surveys.population_open", "destination": "csv" }, { "type": "bridge", "name": "xlsx employees.ferpa_certified => mssql", "cron": "*/25 * * * *", "binds": true, "source": "xlsx", "destination": "mssql", "table": "employees.ferpa_certified", "update": true }, { "type": "script", "name": "name of script inside input directory, file extension", "cron": "0 1 * * *" }] ``` ## Testing Install Mocha globally: ```shell npm install -g mocha ``` All tests: ```shell npm test ``` All destinations or sources: ```shell mocha spec/destinations mocha spec/sources ``` Just one destination/source: ```shell mocha spec/destinations --one=mssql mocha spec/sources --one=mysql ``` ## Customizing sources / destinations #### Note about `require()` If you plan on using a separate input/output/source dir (as in when you ran npm install, you told databridge to make the "local" folder outside the main databridge folder), you'll need to either: - Install necessary packages (like database connection packages) inside that local directory. - Use something like `require.main.require` to include the database source or scripts from the main databridge directory. Databridge will be requiring the source directory but it will try to require from a different path than the main databridge directory. See this [great GitHubGist article by branneman](https://gist.github.com/branneman/8048520) for more information and options. #### Multiple similar sources (oracle => oracle) One way to accomplish transfer from two separate databases of the same type is to create a source or destination from one to the other that changes the source name (and therefore the credentials/connection file). For example, a custom oracle source could be added inside `local/sources/newsource.js` would make databridge look for `creds/newsource.js`. Here's how to do that: ```js //include oracle source from bin/src/ module.exports = (opt, moduleCallback) => { const oracle = require.main.require('./bin/src/oracle') oracle(opt, (e, r, c) => { moduleCallback(e, r, c) }) } ``` #### Source modules Source modules are passed the config/opt object from the bridge. Source modules MUST return: 1. `null` or error. 2. Number of rows pulled from source. 3. Array of column names from source. #### Destination modules Destination modules are passed config/opt object from the bridge and the columns definitions generated by `bin/col-parser`. Destination modules MUST return: 1. `null` or error. 2. Number of rows written to destination. 3. Array of column names at destination.