@tricoteuses/assemblee
Version:
Retrieve, clean up & handle French Assemblée nationale's open data
136 lines (98 loc) • 5.73 kB
Markdown
# Tricoteuses-Assemblee
## _Retrieve, clean up & handle French Assemblée nationale's open data_
_Tricoteuses Légifrance_ is free and open source software.
- [software repository](https://git.tricoteuses.fr/logiciels/tricoteuses-assemblee)
- [GNU Affero General Public License version 3 or greater](https://git.tricoteuses.fr/logiciels/tricoteuses-assemblee/-/tree/master/LICENSE.md)
## documentation
- [Architecture](doc/architecture.md)
- [Browser Usage](doc/BROWSER_USAGE.md) - Using this package in browser/Vite projects
## Installation
```bash
git clone https://git.tricoteuses.fr/logiciels/tricoteuses-assemblee
cd tricoteuses-assemblee/
```
```bash
npm install
```
## Download and clean data
### Basic usage
Create a directory to store the data, then run the following command to download, reorganize and clean the data.
```bash
mkdir ../assemblee-data/
npm run data:download ../assemblee-data
```
### Available Commands
- `npm run data:download <dir>`: Download, reorganize, and clean data
- `npm run data:retrieve_open_data <dir>`: Download raw data files.
- `npm run data:reorganize_data <dir>`: Reorganize raw files by entity.
- `npm run data:clean_data <dir>`: Clean and validate reorganized files.
- `npm run data:retrieve_deputes_photos <dir>`: Retrieval of députés' pictures from Assemblée nationale's website
- `npm run data:retrieve_senateurs_photos <dir>`: Retrieval of sénateurs' pictures from Assemblée nationale's website
- `npm run data:retrieve_documents <dir>`: Retrieval of legislative documents from Assemblée nationale's website
- `npm run data:retrieve_pending_amendements <dir>`: Retrieval of pending amendments from Assemblée nationale's website (waiting to be processed by Assemblée services)
_Notes_:
- Reorganized files (generated by the _data:reorganize_data_ command) are also available in [Tricoteuses / Data / Données brutes de l'Assemblée](https://git.en-root.org/tricoteuses/data/assemblee-brut). They are updated on a regular basis.
- Split & cleaned files (generated by the _data:clean_data_ command) are also available in [Tricoteuses / Data / Données nettoyées de l'Assemblée](https://git.en-root.org/tricoteuses/data/assemblee-nettoye) with the `_nettoye` suffix. They are updated on a regular basis.
### Filtering Options
Downloading and cleaning all the data is long and takes up a lot of disk space. It is possible to choose the type of data that you want to retrieve to reduce the load.
Examples:
```bash
# Only download amendments
npm run data:download ../assemblee-data -- -k Amendements
# Only process 16th and 17th legislatures
npm run data:download ../assemblee-data -- -l 16 -l 17
```
### Common Options
- `--categories` or `-k <name>`: Filter by dataset categories (Available options : `ActeursEtOrganes`, `Agendas`, `Amendements`, `DossiersLegislatifs`, `Photos`, `Scrutins`, `Questions`, `ComptesRendus`)
- `--legislature` or `-l <number>`: Specify one or more legislatures to process (e.g., `-l 15 -l 16`)
- `--dataDir <path>` (Mandatory): Path to the working directory where all data is stored (required)
- `--silent` or `-s`: Disable logging
- `--verbose` or `-v`: Enable verbose logging
- `--fetch` or `-f`: Force re-download of data even if already present
- `--commit` or `-c`: Automatically commit cleaned data
- `--pull` or `-p`: Pull repositories before starting
- `--clone` or `-C <url>`: Clone Git repositories from a remote group or organization
- `--remote` or `-r <name>`: Push commits to specified Git remote(s)
- `--keepDir`: Keep Dir (Implement before cleaning data)
- `--only-recent` (number): If files are already present, skip files that are above the specified number of days and skip old legislatures (e.g. `-only-recent 30`)
If you use such options, use them in all subsequent commands too (_data:regorganize_data_ and _data:clean_data_).
### Options for Cleaning Data
- `--dataset` or `-d <name>`: Clean a specific dataset only
- `--no-reset-after-commit`: Skip Git reset after committing (useful to preserve local changes)
- `--no-validate` or `-V`: Skip schema validation during cleaning
- `--fetchDocuments` : Specify to retrieve documents
- `--parseDocuments`: Specify to parse documents into cleaned json
- `--fetchVideos`: Retrieve videos
- `--fetchCrCommissions`: Retrieve and parse CR commissions
### Options for Retrieving Documents
- `--full` or `-f`: Retrieve all documents, even those already downloaded
- `--document-type` or `-T <type>`: Restrict to specific document types (e.g., `PION`)
## Download using Docker
A Docker image that downloads and cleans the data all at once is available. Build it locally or run it from the container registry.
Use the environment variables `LEGISLATURE` and `CATEGORIES` if needed.
```bash
docker run --pull always --name tricoteuses-assemblee -v ../assemblee-data:/app/assemblee-data -e LEGISLATURE=17 -d git.tricoteuses.fr/logiciels/tricoteuses-assemblee:latest
```
## Using the data
Once the data is downloaded and cleaned, you can use loaders to retrieve it.
To use loaders in your project, you can install the _@tricoteuses/assemblee_ package, and import the iterator functions that you need.
```bash
npm install @tricoteuses/assemblee
```
```js
import {
iterLoadAssembleeActeurs,
iterLoadAssembleeOrganes,
iterLoadAssembleeReunions,
iterLoadAssembleeScrutins,
iterLoadAssembleeDocuments,
iterLoadAssembleeDossiersParlementaires,
iterLoadAssembleeAmendements,
iterLoadAssembleeQuestions,
iterLoadAssembleeComptesRendus,
} from "@tricoteuses/assemblee/loaders"
// Pass data directory and legislature as arguments
for (const { acteur } of iterLoadAssembleeActeurs("../assemblee-data", 17)) {
console.log(acteur.uid)
}
```