@tricoteuses/senat
Version:
Handle French Sénat's open data
159 lines (110 loc) • 5.29 kB
Markdown
# Tricoteuses-Senat
## _Retrieve, clean up & handle French Sénat's open data_
## Requirements
- Node >= 22
## Installation
```bash
git clone https://git.tricoteuses.fr/logiciels/tricoteuses-senat
cd tricoteuses-senat/
```
Create a `.env` file to set PostgreSQL database informations and other configuration variables (you can use `example.env` as a template). Then
```bash
npm install
```
### Database creation (not needed if downloading with Docker image)
#### Using Docker
```bash
docker run --name local-postgres -d -p 5432:5432 -e POSTGRES_PASSWORD=$YOUR_CUSTOM_DB_PASSWORD postgres
```
## Download data
### Basic usage
Create a folder where the data will be downloaded and run the following command to download the data and convert it into JSON files.
```bash
mkdir ../senat-data/
npm run data:download ../senat-data
```
### Available Commands
- `npm run data:download <dir>`: Download, convert data to JSON
- `npm run data:retrieve_documents <dir>`: Retrieval of textes and rapports from Sénat's website
- `npm run data:retrieve_agenda <dir>`: Retrieval of agenda from Sénat's website
- `npm run data:retrieve_cr_seance <dir>`: Retrieval of comptes-rendus de séance from Sénat's data
- `npm run data:retrieve_cr_commission <dir>`: Retrieval of comptes-rendus de commissions from Sénat's website
- `npm run data:retrieve_senateurs_photos <dir>`: Retrieval of sénateurs' pictures from Sénat's website
### Filtering Options
Downloading all the data is long and takes up a lot of disk space. It is possible to choose the type of data that you want to retrieve to reduce the load.
Examples:
```bash
# Only download amendments
npm run data:download ../senat-data -- -k Ameli
# Only process data from session 2023 onwards
npm run data:download ../senat-data -- --fromSession 2023
```
### Common Options
- `--categories` or `-k <name>`: Filter by dataset categories (Available options: `All`, `Ameli`, `Debats`, `DosLeg`, `Questions`, `Sens`)
- `--fromSession <year>`: Specify the session year to retrieve data from (default: 2022)
- `--dataDir <path>` (Mandatory): Path to the working directory where all data is stored (required)
- `--silent` or `-s`: Disable logging
- `--verbose` or `-v`: Enable verbose logging
- `--commit` or `-c`: Automatically commit converted data
- `--pull` or `-p`: Pull repositories before starting
- `--clone` or `-C <url>`: Clone Git repositories from a remote group or organization
- `--remote` or `-r <name>`: Push commits to specified Git remote(s)
- `--keepDir`: Keep directories when cleaning data
- `--only-recent <days>`: Retrieve only documents created within the last N days
### Options for Retrieving Documents
- `--formats <format>`: Specify document formats to retrieve (options: `xml`, `html`, `pdf`)
- `--types <type>`: Specify document types to retrieve (options: `textes`, `rapports`)
- `--parseDocuments`: Parse documents after retrieval
- `--parseAgenda`: Parse agenda after retrieval
- `--parseDebats`: Parse comptes-rendus after retrieval
#### Examples
```bash
# Retrieval of textes and rapports in specific formats
npm run data:retrieve_documents ../senat-data -- --fromSession 2022 --formats xml pdf --types textes
# Retrieval & parsing (textes in xml format only for now)
npm run data:retrieve_documents ../senat-data -- --fromSession 2022 --parseDocuments
# Retrieval & parsing of agenda
npm run data:retrieve_agenda ../senat-data -- --fromSession 2022 --parseAgenda
# Retrieval & parsing of comptes-rendus de séance
npm run data:retrieve_cr_seance ../senat-data -- --parseDebats --keepDir
# Retrieval & parsing of comptes-rendus de commissions
npm run data:retrieve_cr_commission ../senat-data -- --parseDebats --keepDir
```
## Data download using Docker
A Docker image that downloads and converts the data all at once is available. Build it locally or run it from the container registry.
Use the environment variables `FROM_SESSION` and `CATEGORIES` if needed.
```bash
docker run --pull always --name tricoteuses-senat -v ../senat-data:/app/senat-data -d git.tricoteuses.fr/logiciels/tricoteuses-senat:latest
```
Use the environment variable `CATEGORIES` and `FROM_SESSION` if needed.
## Using the data
Once the data is downloaded, you can use loaders to retrieve it.
To use loaders in your project, you can install the _@tricoteuses/senat_ package, and import the iterator functions that you need.
```bash
npm install @tricoteuses/senat
```
```js
import { iterLoadSenatQuestions } from "@tricoteuses/senat/loaders"
// Pass data directory and legislature as arguments
for (const { item: question } of iterLoadSenatQuestions("../senat-data", 17)) {
console.log(question.id)
}
```
## Generation of raw types from SQL schema (for contributors only)
```bash
npm run data:generate_schemas ../senat-data
```
## Publishing
To publish a new version of this package onto npm, bump the package version and publish.
```bash
# Increment version and create a new Git tag automatically
npm version patch # +0.0.1 → small fixes
npm version minor # +0.1.0 → new features
npm version major # +1.0.0 → breaking changes
npx tsc
npm publish
```
The Docker image will be automatically built during a CI Workflow if you push the tag to the remote repository.
```bash
git push --tags
```