UNPKG

filesqueeze

Version:

A file compression tool that uses Huffman coding

138 lines (88 loc) 5.15 kB
# FileSqueeze - Huffman Compression Tool This project implements a custom Huffman Compression Algorithm, designed to compress text-based files such as `.txt`, `.json`, `.docx`, and `.pdf`. It uses the Huffman coding technique to reduce the size of the text by encoding characters based on their frequency of occurrence in the source data. ## Features - **Text Compression**: Compresses text-based file formats by analyzing the frequency of characters and encoding them using the Huffman coding algorithm. - **Binary Output**: Generates compressed files in a binary format. - **Metadata**: Saves metadata related to the compression process, including the Huffman tree structure, to allow for decompression. - **Compression Metrics**: Reports on the compression ratio and the original and compressed file sizes. ## Supported Formats - `.txt` - `.json` - `.docx` - `.pdf` (with consideration that only the text is compressed; embedded images are not compressed) **Note**: The algorithm is designed for text-based formats. When handling PDFs containing images, only the text portion will be compressed. ## Setup Instructions ### Prerequisites Before running the project, ensure you have the following dependencies installed: - Node.js (v16 or higher) - `npm` or `yarn` for managing packages ### Installing 1. Clone this repository: ```bash git clone https://github.com/HUMBLEF0OL/file-squeeze.git ``` 2. Navigate to the project directory: ```bash cd file-squeeze ``` 3. Install the required dependencies: ```bash npm install ``` ## Usage ### Command-line Tool 1. **Compress** a file: Use the `filesqueeze` command with the `compress` option to compress a file. ```bash filesqueeze compress <inputFile> [--output <outputDir>] ``` - `<inputFile>`: The file to be compressed (e.g., `sample.txt`). - `[--output <outputDir>]`: The directory to store the compressed files (defaults to `./output`). 2. **Decompress** a file: To decompress a previously compressed file, use the `decompress` command. ```bash filesqueeze decompress <inputDir> [--output <outputDir>] ``` - `<inputDir>`: The directory containing the compressed file (`encoded.bin` and `metaData.bin`). - `[--output <outputDir>]`: The directory to store the decompressed files (defaults to `./output`). ### Report Generation The project generates a compression report for each file processed. The report includes: - **Original File Size**: Size of the file before compression. - **Compressed File Size**: Size of the file after compression. - **Compression Ratio**: The ratio of the original file size to the compressed file size. - **Time Taken**: Time spent to process and compress the file. You can view the results in the console after the compression completes. ## Compression Algorithm Overview ### 1. **Frequency Analysis** - The algorithm starts by analyzing the frequency of each character in the input file. ### 2. **Priority Queue** - A priority queue (min-heap) is built using the frequency data. This queue ensures that the least frequent characters are processed first. ### 3. **Huffman Tree Construction** - The Huffman tree is built by combining nodes based on their frequencies. The two nodes with the least frequency are merged into a parent node, and this process is repeated until only one node (the root) remains. ### 4. **Code Generation** - Once the tree is built, binary codes are assigned to each character based on its position in the tree. Characters closer to the root get shorter codes, ensuring optimal compression. ### 5. **Serialization** - The Huffman tree is serialized and saved in binary format for use in decompression. ### 6. **Compression and Saving** - The input text is encoded using the generated Huffman codes. Both the compressed data and metadata (Huffman tree) are saved into files. ### 7. **Decompression** - The decompression process reads the serialized Huffman tree and decodes the compressed data back into its original form. ## Example ### Sample Input (Text File) ```txt hello world ``` ### Compressed Output (Encoded File) - The file will be compressed into a binary file (`encoded.bin`), and metadata will be saved in a separate file (`metaData.bin`). ## Metrics Example ### File 1: `example.txt` - **Original File Size**: 90 KB - **Compressed File Size**: 48 KB - **Compression Ratio**: 1.875 (compressed size / original size) ## Contributing If you'd like to contribute to this project, feel free to open a pull request. For bug reports or suggestions, please create an issue in the GitHub repository. ## License This project is licensed under the MIT License. ## Acknowledgments - The core compression algorithm is based on the Huffman coding technique. You can read more about it here: [Huffman coding - Wikipedia](https://en.wikipedia.org/wiki/Huffman_coding). - Special thanks to libraries like `pdf-lib` and `pdf-parse` for PDF text extraction and manipulation.