@tricoteuses/senat

# Tricoteuses-Senat ## _Retrieve, clean up & handle French Sénat's open data_ ## Requirements - Node >= 22 ## Installation ```bash git clone https://git.tricoteuses.fr/logiciels/tricoteuses-senat cd tricoteuses-senat/ ``` Create a `.env` file to set PostgreSQL database informations and other configuration variables (you can use `example.env` as a template). Then ```bash npm install ``` ### Database creation (not needed if downloading with Docker image) #### Using Docker ```bash docker run --name local-postgres -d -p 5432:5432 -e POSTGRES_PASSWORD=$YOUR_CUSTOM_DB_PASSWORD postgres ``` ## Download data ### Basic usage Create a folder where the data will be downloaded and run the following command to download the data and convert it into JSON files. ```bash mkdir ../senat-data/ npm run data:download ../senat-data ``` ### Available Commands - `npm run data:download <dir>`: Download, convert data to JSON - `npm run data:retrieve_documents <dir>`: Retrieval of textes and rapports from Sénat's website - `npm run data:retrieve_agenda <dir>`: Retrieval of agenda from Sénat's website - `npm run data:retrieve_cr_seance <dir>`: Retrieval of comptes-rendus de séance from Sénat's data - `npm run data:retrieve_cr_commission <dir>`: Retrieval of comptes-rendus de commissions from Sénat's website - `npm run data:retrieve_senateurs_photos <dir>`: Retrieval of sénateurs' pictures from Sénat's website ### Filtering Options Downloading all the data is long and takes up a lot of disk space. It is possible to choose the type of data that you want to retrieve to reduce the load. Examples: ```bash # Only download amendments npm run data:download ../senat-data -- -k Ameli # Only process data from session 2023 onwards npm run data:download ../senat-data -- --fromSession 2023 ``` ### Common Options - `--categories` or `-k <name>`: Filter by dataset categories (Available options: `All`, `Ameli`, `Debats`, `DosLeg`, `Questions`, `Sens`) - `--fromSession <year>`: Specify the session year to retrieve data from (default: 2022) - `--dataDir <path>` (Mandatory): Path to the working directory where all data is stored (required) - `--silent` or `-s`: Disable logging - `--verbose` or `-v`: Enable verbose logging - `--commit` or `-c`: Automatically commit converted data - `--pull` or `-p`: Pull repositories before starting - `--clone` or `-C <url>`: Clone Git repositories from a remote group or organization - `--remote` or `-r <name>`: Push commits to specified Git remote(s) - `--keepDir`: Keep directories when cleaning data - `--only-recent <days>`: Retrieve only documents created within the last N days ### Options for Retrieving Documents - `--formats <format>`: Specify document formats to retrieve (options: `xml`, `html`, `pdf`) - `--types <type>`: Specify document types to retrieve (options: `textes`, `rapports`) - `--parseDocuments`: Parse documents after retrieval - `--parseAgenda`: Parse agenda after retrieval - `--parseDebats`: Parse comptes-rendus after retrieval #### Examples ```bash # Retrieval of textes and rapports in specific formats npm run data:retrieve_documents ../senat-data -- --fromSession 2022 --formats xml pdf --types textes # Retrieval & parsing (textes in xml format only for now) npm run data:retrieve_documents ../senat-data -- --fromSession 2022 --parseDocuments # Retrieval & parsing of agenda npm run data:retrieve_agenda ../senat-data -- --fromSession 2022 --parseAgenda # Retrieval & parsing of comptes-rendus de séance npm run data:retrieve_cr_seance ../senat-data -- --parseDebats --keepDir # Retrieval & parsing of comptes-rendus de commissions npm run data:retrieve_cr_commission ../senat-data -- --parseDebats --keepDir ``` ## Data download using Docker A Docker image that downloads and converts the data all at once is available. Build it locally or run it from the container registry. Use the environment variables `FROM_SESSION` and `CATEGORIES` if needed. ```bash docker run --pull always --name tricoteuses-senat -v ../senat-data:/app/senat-data -d git.tricoteuses.fr/logiciels/tricoteuses-senat:latest ``` Use the environment variable `CATEGORIES` and `FROM_SESSION` if needed. ## Using the data Once the data is downloaded, you can use loaders to retrieve it. To use loaders in your project, you can install the _@tricoteuses/senat_ package, and import the iterator functions that you need. ```bash npm install @tricoteuses/senat ``` ```js import { iterLoadSenatQuestions } from "@tricoteuses/senat/loaders" // Pass data directory and legislature as arguments for (const { item: question } of iterLoadSenatQuestions("../senat-data", 17)) { console.log(question.id) } ``` ## Generation of raw types from SQL schema (for contributors only) ```bash npm run data:generate_schemas ../senat-data ``` ## Publishing To publish a new version of this package onto npm, bump the package version and publish. ```bash # Increment version and create a new Git tag automatically npm version patch # +0.0.1 → small fixes npm version minor # +0.1.0 → new features npm version major # +1.0.0 → breaking changes npx tsc npm publish ``` The Docker image will be automatically built during a CI Workflow if you push the tag to the remote repository. ```bash git push --tags ```