dga-sync
Version:
Sync datasets from data.gov.au, the Australian government's open data website.
137 lines (108 loc) • 5.78 kB
Markdown
# dga-sync README
## Sync data.gov.au datasets easily
The Australian government's data.gov.au website references a growing
abundance of public and open data government data resources - more
than 3700 datasets at the time of writing. While in some cases, data.gov.au
provides an API to access a dataset, it doesn't always. For this reason and
others, there are often advantages in downloading the data for local use or
to re-package it. The dga-sync utility eases the task of synchronising
that data to a local file system.
dga-sync uses the JSON metadata stored on data.gov.au for
each dataset to ensure that data files are only downloaded if they are
newer than what has previously been downloaded. A local copy of the metadata
is also stored.
## Getting started
```
npm install dga-sync
```
### Simple usage
For each data.gov.au dataset, there is a JSON metadata file (accessed from
the JSON button on the web page) that leads to a URL of the following form:
```
http://data.gov.au/api/3/action/package_show?id=23218e8f-babe-4e37-81d1-5424a4d1c568
```
Use the `id` parameter to identify the package to sync:
```
var sync = require('dga-sync');
sync.syncByPackageId('23218e8f-babe-4e37-81d1-5424a4d1c568');
```
This is what the console output looks like (actual output is colourised where
supported):
```
fetching metadata for package ID: 23218e8f-babe-4e37-81d1-5424a4d1c568
found: "Public Barbeques"
reply lists 5 resources:
barbeque.kmz "2014 Public Barbeques" @ 2014-09-16T02:05:54.523Z
wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=csv "Public Barbeques CSV" @ 2014-09-16T02:05:54.523Z
wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=json "Public Barbeques GeoJSON" @ 2014-09-16T02:05:54.523Z
wms?request=GetCapabilities "Public Barbeques - Preview this Dataset (WMS)" @ 2014-09-16T02:05:54.523Z
wfs?request=GetCapabilities "Public Barbeques Web Feature Service API Link" @ 2014-09-16T02:05:54.523Z
preparing to download barbeque.kmz
preparing to download wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=csv
preparing to download wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=json
preparing to download wms?request=GetCapabilities
preparing to download wfs?request=GetCapabilities
downloading completed
.. moving data/._DGA_DOWNLOAD_barbeque.kmz to data/barbeque.kmz
.. moving data/._DGA_DOWNLOAD_wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=csv to data/wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=csv
.. moving data/._DGA_DOWNLOAD_wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=json to data/wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=json
.. moving data/._DGA_DOWNLOAD_wms?request=GetCapabilities to data/wms?request=GetCapabilities
.. moving data/._DGA_DOWNLOAD_wfs?request=GetCapabilities to data/wfs?request=GetCapabilities
writing download metadata to: data/._METADATA_.json
```
At this point, a directory called `data` under the current working directory
will have been created and will contain the downloaded resources plus a metadata
file created by dga-sync:
```
$ ls -lhA data
total 744K
-rw-r--r-- 1 sam sam 44K Sep 24 11:14 barbeque.kmz
-rw-r--r-- 1 sam sam 6.0K Sep 24 11:15 ._METADATA_.json
-rw-r--r-- 1 sam sam 72K Sep 24 11:15 wfs?request=GetCapabilities
-rw-r--r-- 1 sam sam 95K Sep 24 11:14 wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=csv
-rw-r--r-- 1 sam sam 384K Sep 24 11:14 wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=json
-rw-r--r-- 1 sam sam 139K Sep 24 11:14 wms?request=GetCapabilities
```
The metadata file will ensure that next time we check, only newer resources
than we already have will be downloaded, saving on bandwidth.
### Limiting what gets downloaded
As you can see from above, all resources are downloaded by default. This can
be changed by adding an `idFilter` regex option. So if we only want the KMZ
files in our example:
```
sync.syncByPackageId(
'23218e8f-babe-4e37-81d1-5424a4d1c568',
{
idFilter: /.*\.kmz$/,
deleteUnlisted: true
}
);
```
The use of `deleteUnlisted` is optional - it tells dga-sync to delete
previously downloaded files now excluded by the filter. The contents of
`data` is now:
```
$ ls -lhA data
total 48K
-rw-r--r-- 1 sam sam 44K Sep 24 11:14 barbeque.kmz
-rw-r--r-- 1 sam sam 1.4K Sep 24 11:26 ._METADATA_.json
```
## API
There is currently only one method:
**syncByPackageId(packageId, options, andThen)**
`packageId` - the ID of the package/dataset
`options` - an object with the following options:
- `idFieldName` - specifies the field in a resource to use as the resource
ID [default: `'url'`]
- `idCanonicaliser` - a function that takes the resource ID (according to the
`idFieldName` option) and creates a canonical ID for future comparison
in later sync operations [default: split the ID at '/'s and use use the last part:
this assumes that `idFieldName` is the default value of `'url'`]
- `idFilter` - applied to the (canonicalised) resource ID to choose which
resources will be synced [default: `undefined` - that is, accept all IDs]
- `dataDestination` - the directory to store the downloaded resources in
- `deleteUnlisted` - boolean: `true` means delete extraneous files in the
destination directory that don't correspond to a resource IDs in the
filtered list [default: `false`]
`andThen(err)` - optional callback, where `err` is any error encountered that
prevented successful completion