UNPKG

miniml

Version:

A minimal, embeddable semantic data modeling language for generating SQL queries from YAML model definitions. Inspired by LookML.

649 lines (518 loc) 16.2 kB
# MiniML A minimal, embeddable semantic data modeling language for generating SQL queries from YAML model definitions, inspired by LookML. Generates SQL queries driven by a semantic model and selected dimensions and measures. Capable of generating optimal SQL from minimal queries to very complex queries with multiple joins based on the specified configuration. Also generates metadata for each model suitable for use by AI to execute queries. Designed to be a stand-alone and light-weight library, embeddable in any server or app without requiring additional data platforming. Supports BigQuery and Snowflake dialects. ## Table of Contents - [Quick Start](#quick-start) - [Installation](#installation) - [Programmatic Usage](#programmatic-usage) - [Model Definition](#model-definition) - [Query Generation](#query-generation) - [API Reference](#api-reference) - [Examples](#examples) - [Supported SQL Dialects](#supported-sql-dialects) - [Contributing](#contributing) ## Quick Start ### 1. Define a Model Create a YAML model file (`sales.yaml`) that describes your data structure: ```yaml description: Sales fact table capturing transactional-level purchase data across products, stores, and customers. from: acme.sales default_date_range: last 7 days join: customer_join: JOIN acme.customers USING (customer_id) product_join: JOIN acme.products USING (product_id) store_join: JOIN acme.stores USING (store_id) dimensions: date: - The calendar date on which the sale occurred. - DATE(sale_date) sale_id: Unique identifier for each sale transaction. customer_id: Unique identifier for the customer making the purchase. customer_name: - Display name or label associated with the customer. - customer_name - customer_join product_id: Unique identifier for the product sold. product_name: - Display name or label associated with the product. - product_name - product_join store_id: Unique identifier for the store where the sale took place. store_name: - Display name or label of the store location. - store_name - store_join measures: total_amount: - Total sale amount generated by the transaction. - SUM(total_amount) price_avg: - Average unit price across all sales. - AVG(unit_price) count: - The count of records for the row. - COUNT(*) ``` ### 2. Produce a description of the model ```bash npx miniml sales.yaml ``` **OUTPUT** ``` ## DIMENSIONS - `date` The calendar date on which the sale occurred. - `sale_id` Unique identifier for each sale transaction. - `customer_id` Unique identifier for the customer making the purchase. - `customer_name` Display name or label associated with the customer. - `product_id` Unique identifier for the product sold. - `product_name` Display name or label associated with the product. - `store_id` Unique identifier for the store where the sale took place. - `store_name` Display name or label of the store location. ## MEASURES - `total_amount` Total sale amount generated by the transaction. - `price_avg` Average unit price across all sales. - `count` The count of records for the row. ``` ### 3. Generate a SQL query from the model ```bash npx miniml sales.yaml --dimensions=date --measures=total_amount ``` **OUTPUT** ```sql SELECT DATE(sale_date) AS date, SUM(total_amount) AS total_amount FROM acme.sales WHERE DATE(sale_date) >= DATEADD(DAY, -7, DATE(sale_date)) GROUP BY ALL ``` Add more options: ```bash npx miniml sales.yaml \ --dimensions=date,customer_name,product_name \ --measures=total_amount,price_avg,count \ --date-from=2025-01-01 \ --date-to=2025-01-31 \ --order-by=-date \ --limit=10 ``` **OUTPUT** ```sql SELECT DATE(sale_date) AS date, customer_name AS customer_name, product_name AS product_name, SUM(total_amount) AS total_amount, AVG(unit_price) AS price_avg, COUNT(*) AS count FROM acme.sales JOIN acme.customers USING (customer_id) JOIN acme.products USING (product_id) WHERE DATE(sale_date) BETWEEN '2025-01-01' AND '2025-01-31' GROUP BY ALL ORDER BY date DESC LIMIT 10 ``` ## Installation To use MiniML in your project: ```bash npm install miniml ``` ## Programmatic Usage ```typescript import { loadModel, renderQuery } from 'miniml'; // Load the model const model = await loadModel('./sales.yaml'); // Generate a SQL query const sql = renderQuery(model, { dimensions: ['date', 'customer_name'], measures: ['total_amount', 'count'], date_from: '2025-01-01', date_to: '2025-01-31', order_by: ['-date'], limit: 10 }); console.log(sql); ``` **OUTPUT** ```sql SELECT DATE(sale_date) AS date, customer_name AS customer_name, SUM(total_amount) AS total_amount, COUNT(*) AS count FROM acme.sales JOIN acme.customers USING (customer_id) WHERE DATE(sale_date) BETWEEN '2025-01-01' AND '2025-01-31' GROUP BY ALL ORDER BY date DESC LIMIT 10 ``` ## Model Definition ### Basic Structure A MiniML model is a YAML file with the following structure: ```yaml description: Human-readable description of the model dialect: bigquery | snowflake from: schema.table_name date_field: primary_date_column where: base_filter_conditions join: join_name: JOIN clause definition dimensions: dimension_name: definition measures: measure_name: definition ``` ### Dimensions Dimensions are fields used for grouping data. They can be defined in several formats: **Simple string (field name only):** ```yaml dimensions: customer_id: Unique customer identifier ``` **Array format (description, SQL, join):** ```yaml dimensions: customer_name: - Display name of the customer - customer_name - customer_join ``` **Object format:** ```yaml dimensions: date: description: The transaction date sql: DATE(created_at) join: date_dimension_join ``` ### Measures Measures are aggregated fields. They default to `SUM()` if no SQL is provided: **Simple string (defaults to SUM):** ```yaml measures: revenue: Total revenue amount # becomes SUM(revenue) AS revenue ``` **Custom aggregation:** ```yaml measures: avg_price: - Average unit price - AVG(unit_price) ``` ### Joins Define reusable join clauses: ```yaml join: customer_join: LEFT JOIN customers USING (customer_id) product_join: LEFT JOIN products USING (product_id) dimensions: customer_name: - Customer display name - customer_name - customer_join # References the join above ``` ### SQL Dialect Detection MiniML can automatically detect the SQL dialect from the file path: - Files containing "bigquery" use BigQuery syntax - Files containing "snowflake" use Snowflake syntax - Otherwise, specify explicitly with the `dialect` field **Examples:** ``` models/bigquery/sales.yaml # Automatically uses BigQuery dialect models/snowflake/users.yaml # Automatically uses Snowflake dialect models/sales.yaml # Requires explicit dialect: bigquery ``` This means you can omit the `dialect` field when your file paths include the database type. ## Query Generation ### Query Options The `renderQuery` function accepts these options: ```typescript interface MinimlQueryOptions { dimensions?: string[]; // Fields to group by measures?: string[]; // Aggregated fields to calculate date_from?: string; // Start date filter (YYYY-MM-DD) date_to?: string; // End date filter (YYYY-MM-DD) where?: string; // Additional WHERE conditions having?: string; // HAVING clause for aggregate filters order_by?: string[]; // Sort fields (prefix with - for DESC) limit?: number; // Limit result count distinct?: boolean; // Add DISTINCT to SELECT date_granularity?: string; // Date truncation (day, week, month, year) } ``` ### Date Handling **Automatic date filtering:** ```typescript // Last 30 days (default when date_field is defined) renderQuery(model, { measures: ['count'] }); // Specific date range renderQuery(model, { measures: ['count'], date_from: '2024-01-01', date_to: '2024-01-31' }); ``` **Date granularity:** ```typescript // Group by week renderQuery(model, { dimensions: ['date'], measures: ['count'], date_granularity: 'week' }); // Generates: DATE_TRUNC('WEEK', sale_date) AS date ``` ### Filtering **WHERE clause (for dimensions):** ```typescript renderQuery(model, { dimensions: ['customer_id'], measures: ['total_amount'], where: "customer_type = 'premium'" }); ``` **HAVING clause (for measures):** ```typescript renderQuery(model, { dimensions: ['customer_id'], measures: ['total_amount'], having: "total_amount > 1000" }); ``` ### Ordering ```typescript renderQuery(model, { dimensions: ['customer_id'], measures: ['total_amount'], order_by: ['customer_id', '-total_amount'] // ASC, then DESC }); ``` ## API Reference ### Core Functions #### `loadModel(file: string): Promise<MinimlModel>` Asynchronously loads and processes a YAML model file. #### `loadModelSync(file: string): MinimlModel` Synchronously loads and processes a YAML model file. #### `renderQuery(model: MinimlModel, options: MinimlQueryOptions): string` Generates a SQL query from a model and query options. ### Interfaces #### `MinimlModel` ```typescript interface MinimlModel { description: string; dialect: string; from: string; join: Record<string, string>; where: string; date_field: string; dimensions: Record<string, MinimlDef>; measures: Record<string, MinimlDef>; info: string; // Auto-generated documentation } ``` #### `MinimlDef` ```typescript interface MinimlDef { key: string; description: string; sql?: string; join?: string; } ``` ## Examples ### E-commerce Analytics ```yaml # ecommerce.yaml description: E-commerce sales and customer analytics dialect: bigquery from: ecommerce.orders date_field: order_date join: customer_join: LEFT JOIN ecommerce.customers USING (customer_id) product_join: LEFT JOIN ecommerce.products USING (product_id) where: status = 'completed' dimensions: date: Order date customer_id: Customer identifier customer_email: - Customer email address - email - customer_join product_category: - Product category - category - product_join measures: revenue: Total revenue order_count: - Number of orders - COUNT(DISTINCT order_id) avg_order_value: - Average order value - AVG(total_amount) ``` **Usage:** ```typescript const model = await loadModel('./ecommerce.yaml'); // Revenue by category, last 7 days const sql = renderQuery(model, { dimensions: ['product_category'], measures: ['revenue', 'order_count'], date_from: '2024-01-20', date_to: '2024-01-27', order_by: ['-revenue'], limit: 10 }); ``` ### Complex Joins and Calculations ```yaml # advanced.yaml description: Advanced analytics with multiple joins dialect: snowflake from: fact.sales date_field: sale_date join: customer_join: LEFT JOIN dim.customers c ON c.id = customer_id product_join: LEFT JOIN dim.products p ON p.id = product_id region_join: LEFT JOIN dim.regions r ON r.code = c.region_code dimensions: customer_segment: - Customer segment classification - c.segment - customer_join product_brand: - Product brand name - p.brand - product_join region_name: - Geographic region - r.name - region_join measures: revenue_per_customer: - Average revenue per customer - SUM(amount) / COUNT(DISTINCT customer_id) profit_margin: - Profit margin percentage - (SUM(amount) - SUM(cost)) / SUM(amount) * 100 ``` ## Supported SQL Dialects ### BigQuery - Supports `GROUP BY ALL` syntax - Uses `DATE_TRUNC` for date granularity - Compatible with BigQuery standard SQL functions ### Snowflake - Supports Snowflake-specific functions - Uses `DATE_TRUNC` for date granularity - Compatible with Snowflake SQL syntax More dialects coming soon! ## Advanced Features ### Template Variables in Info Models support Jinja-style templating in the `info` section: ```yaml info: | ## Available Dimensions {%- for dimension in dimensions %} - `{{ dimension.key }}`: {{ dimension.description }} {%- endfor %} ## Available Measures {%- for measure in measures %} - `{{ measure.key }}`: {{ measure.description }} {%- endfor %} ``` ### Filter Reference Expansion MiniML automatically expands dimension and measure references in filter clauses: ```typescript // If you have a measure: conversion_rate: "clicks / impressions" // And use it in a HAVING clause: renderQuery(model, { having: "conversion_rate > 0.05" }); // MiniML expands this to: HAVING (clicks / impressions) > 0.05 ``` ### Validation MiniML validates query parameters against model definitions and provides helpful error messages: ```typescript // This will throw an error with specific invalid keys renderQuery(model, { dimensions: ['invalid_dimension'], measures: ['invalid_measure'] }); // Error: The following keys are invalid: // - dimensions: invalid_dimension // - measures: invalid_measure ``` ## Security Because MiniML allows flexible `where` and `having` clauses to be specified in native SQL format, comprehensive SQL validation is implemented to guard against SQL injection attacks for user-provided query parameters: - **AST-based validation**: All user-provided SQL expressions are parsed and validated using node-sql-parser - **Allowlist approach**: Only safe SQL constructs are permitted (comparisons, logical operators, basic functions) - **Model-aware validation**: Column references are validated against your model schema - **Dialect-specific rules**: Validation adapts to BigQuery/Snowflake syntax differences - **Date input validation**: Date parameters (`date_from`, `date_to`) are strictly validated using regex patterns ### Security Model **Trusted Inputs**: YAML model files are considered trusted input and are not validated for SQL injection. Model files should be: - Created and maintained by trusted developers - Stored securely with appropriate access controls - Reviewed before deployment to production environments - Treated as code artifacts subject to version control and security policies **Untrusted Inputs**: All query parameters provided at runtime (`where`, `having`, `date_from`, `date_to`, etc.) are considered untrusted and undergo strict validation. ### Safe Expression Examples **WHERE clauses:** ```typescript renderQuery(model, { where: "account_name = 'Acme Corp'" }); renderQuery(model, { where: "date >= '2024-01-01' AND category_name LIKE 'Electronics%'" }); renderQuery(model, { where: "revenue > 1000 OR quantity IS NOT NULL" }); ``` **HAVING clauses:** ```typescript renderQuery(model, { having: "total_amount > 1000 AND order_count >= 5" }); ``` ### Blocked Constructs For security, the following are automatically blocked: - **Subqueries**: `SELECT`, `EXISTS`, `ANY`, `ALL` - **DDL/DML statements**: `DROP`, `ALTER`, `CREATE`, `INSERT`, `UPDATE`, `DELETE` - **System functions**: Potentially dangerous database functions - **Comments**: SQL comments that could hide malicious code - **Unauthorized columns**: References to columns not defined in your model ### Validation Errors When unsafe expressions are detected, MiniML provides helpful error messages: ```typescript // Error: "Subqueries are not allowed in WHERE clauses. Use simple comparisons instead." renderQuery(model, { where: "customer_id IN (SELECT id FROM users WHERE active = 1)" }); // Error: "Column 'user_password' not found. Available dimensions: account_name, date, category_name" renderQuery(model, { where: "user_password = 'secret'" }); ``` ## Building and Development ```bash # Install dependencies npm install # Build TypeScript npm run build # Run tests (requires Mocha setup) npx mocha test/**/*.test.js ``` ## Contributing 1. Fork the repository 2. Create a feature branch 3. Add tests for new functionality 4. Ensure all tests pass 5. Submit a pull request ## License MIT License - see LICENSE file for details.