transcript-parser
Version:
Parses plaintext speech/debate/radio transcripts into JavaScript objects.
197 lines (136 loc) • 5.82 kB
Markdown
transcript-parser
=================
[](https://travis-ci.org/willshiao/transcript-parser)
[](https://coveralls.io/github/willshiao/transcript-parser?branch=master)
[](https://www.npmjs.com/package/transcript-parser)
- [Description](#description)
- [Usage](#usage)
- [Config](#config)
- [Documentation](#documentation)
* [\.parseStream()](#parsestream)
* [\.parseOneSync()](#parseonesync)
* [\.parseOne()](#parseone)
* [\.resolveAliasesSync()](#resolvealiasessync)
* [\.resolveAliases()](#resolvealiases)
- [Example](#example)
## Description
Parses plaintext speech/debate/radio transcripts into JavaScript objects. It is still in early development. Pull requests are welcome.
Tests can be run with `npm test` and a benchmark can be run with `npm run benchmark`. For a full coverage report using [Istanbul](https://github.com/gotwarlost/istanbul), run `npm run travis-test`.
Tested for Node.js >= v4.4.6
## Usage
`npm install transcript-parser`
```node
'use strict';
const fs = require('fs');
const TranscriptParser = require('transcript-parser');
const tp = new TranscriptParser();
//Synchronous example
const parsed = tp.parseOneSync(fs.readFileSync('transcript.txt', 'utf8'));
console.log(parsed);
//Asynchronous example
fs.readFile('transcript.txt', (err, data) => {
if(err) return console.error('Error:', err);
tp.parseOne(data, (err, parsed) => {
if(err) return console.error('Error:', err);
console.log(parsed);
}));
});
//Stream example
const stream = fs.createReadStream('transcript.txt', 'utf8');
tp.parseStream(stream, (err, parsed) => {
if(err) return console.error('Error:', err);
console.log(parsed);
});
```
## Config
The constructor for `TranscriptParser` accepts a settings object.
- `removeActions`
+ default: `true`
+ Specifies if the parser should remove actions (e.g. `(APPLAUSE)`).
- `removeAnnotations`
+ default: `true`
+ Specifies if the parser should remove annotations (surrounded by `[]`).
- `removeTimestamps`
+ default: `true`
+ **True if `removeAnnotations` is true**
+ Specifies if the parser should remove timestamps (in the `[##:##:##]` format).
- `removeUnknownSpeakers`
+ default: `false`
+ Specifies if the parser should remove lines that have no associated speaker.
+ If true, lines that have no associated speaker will be stored under the key `none`.
- `blacklist`
+ default: `[]`
+ A list of speakers (as strings) that the parser should ignore.
- `aliases`
+ default: `{}`
+ A object with the real name as the key and an `Array` of the aliases' regular expressions as the value.
+ Example: `{ "Mr. Robot": [ /[A-Z\ ]*SLATER[A-Z\ ]*/ ] }`
* Renames all speakers who match the regex to "Mr. Robot".
Settings can be changed after object creation by changing the corresponding properties of `tp.settings`, where `tp` is an instance of `TranscriptParser`.
## Documentation
### .parseStream()
The `parseStream()` method parses a [`Stream`](https://nodejs.org/api/stream.html) and returns an object representing it.
This is the preferred method for parsing streams asynchronously as it doesn't have to load the entire transcript into memory (unlike `parseOne()`).
#### Syntax
`tp.parseOneSync(stream, callback)`
##### Parameters
- `stream`
+ The `Readable` stream to read.
- `callback(err, data)`
+ A callback to be executed on function completion or error.
### .parseOneSync()
The `parseOneSync()` method parses a string and returns an object representing it.
#### Syntax
`tp.parseOneSync(transcript)`
##### Parameters
- `transcript`
+ The transcript, as a `string`.
### .parseOne()
The `parseOne()` method parses a string and returns an object representing it.
#### Syntax
`tp.parseOne(transcript, callback)`
##### Parameters
- `transcript`
+ The transcript, as a `string`.
- `callback(err, data)`
+ A callback to be exectuted on function completion or error.
### .resolveAliasesSync()
The `resolveAliasesSync()` method resolves all aliases specified in the configuration passed to the `TranscriptParser`'s constructor (see above).
Renames the names in the `order` list to match the new names in the transcript. Note that there is a signifigant performance penalty, so don't use this method unless you need it.
#### Syntax
`tp.resolveAliasesSync(data)`
##### Parameters
- `data`
+ The transcript object after being parsed.
### .resolveAliases()
The `resolveAliases()` method resolves all aliases specified in the configuration passed to the `TranscriptParser`'s constructor (see above).
Renames the names in the `order` list to match the new names in the transcript. Note that there is a signifigant performance penalty, so don't use this method unless you need it.
#### Syntax
`tp.resolveAliases(data, callback)`
##### Parameters
- `data`
+ The transcript object after being parsed.
- `callback(err, resolved)`
+ A callback to be executed on function completion or error.
## Example
### Input
```
A: I like Node.js.
A: I also like C#.
B: I like Node.js too!
A: I especially like the Node Package Manager.
```
### Output
```node
{
speaker: {
A: [
'I like Node.js.',
'I also like C#.',
'I especially like the Node Package Manager.'
],
B: ['I like Node.js too!']
},
order: ['A', 'A', 'B', 'A']
}
```