miniml
Version:
A minimal, embeddable semantic data modeling language for generating SQL queries from YAML model definitions. Inspired by LookML.
649 lines (518 loc) • 16.2 kB
Markdown
# MiniML
A minimal, embeddable semantic data modeling language for generating SQL queries from YAML model definitions, inspired by LookML.
Generates SQL queries driven by a semantic model and selected dimensions and measures. Capable of generating optimal SQL from minimal queries to very complex queries with multiple joins based on the specified configuration.
Also generates metadata for each model suitable for use by AI to execute queries.
Designed to be a stand-alone and light-weight library, embeddable in any server or app without requiring additional data platforming.
Supports BigQuery and Snowflake dialects.
## Table of Contents
- [Quick Start](#quick-start)
- [Installation](#installation)
- [Programmatic Usage](#programmatic-usage)
- [Model Definition](#model-definition)
- [Query Generation](#query-generation)
- [API Reference](#api-reference)
- [Examples](#examples)
- [Supported SQL Dialects](#supported-sql-dialects)
- [Contributing](#contributing)
## Quick Start
### 1. Define a Model
Create a YAML model file (`sales.yaml`) that describes your data structure:
```yaml
description: Sales fact table capturing transactional-level purchase data across products, stores, and customers.
from: acme.sales
default_date_range: last 7 days
join:
customer_join: JOIN acme.customers USING (customer_id)
product_join: JOIN acme.products USING (product_id)
store_join: JOIN acme.stores USING (store_id)
dimensions:
date:
- The calendar date on which the sale occurred.
- DATE(sale_date)
sale_id: Unique identifier for each sale transaction.
customer_id: Unique identifier for the customer making the purchase.
customer_name:
- Display name or label associated with the customer.
- customer_name
- customer_join
product_id: Unique identifier for the product sold.
product_name:
- Display name or label associated with the product.
- product_name
- product_join
store_id: Unique identifier for the store where the sale took place.
store_name:
- Display name or label of the store location.
- store_name
- store_join
measures:
total_amount:
- Total sale amount generated by the transaction.
- SUM(total_amount)
price_avg:
- Average unit price across all sales.
- AVG(unit_price)
count:
- The count of records for the row.
- COUNT(*)
```
### 2. Produce a description of the model
```bash
npx miniml sales.yaml
```
**OUTPUT**
```
## DIMENSIONS
- `date` The calendar date on which the sale occurred.
- `sale_id` Unique identifier for each sale transaction.
- `customer_id` Unique identifier for the customer making the purchase.
- `customer_name` Display name or label associated with the customer.
- `product_id` Unique identifier for the product sold.
- `product_name` Display name or label associated with the product.
- `store_id` Unique identifier for the store where the sale took place.
- `store_name` Display name or label of the store location.
## MEASURES
- `total_amount` Total sale amount generated by the transaction.
- `price_avg` Average unit price across all sales.
- `count` The count of records for the row.
```
### 3. Generate a SQL query from the model
```bash
npx miniml sales.yaml --dimensions=date --measures=total_amount
```
**OUTPUT**
```sql
SELECT
DATE(sale_date) AS date,
SUM(total_amount) AS total_amount
FROM acme.sales
WHERE DATE(sale_date) >= DATEADD(DAY, -7, DATE(sale_date))
GROUP BY ALL
```
Add more options:
```bash
npx miniml sales.yaml \
--dimensions=date,customer_name,product_name \
--measures=total_amount,price_avg,count \
--date-from=2025-01-01 \
--date-to=2025-01-31 \
--order-by=-date \
--limit=10
```
**OUTPUT**
```sql
SELECT
DATE(sale_date) AS date,
customer_name AS customer_name,
product_name AS product_name,
SUM(total_amount) AS total_amount,
AVG(unit_price) AS price_avg,
COUNT(*) AS count
FROM acme.sales
JOIN acme.customers USING (customer_id)
JOIN acme.products USING (product_id)
WHERE DATE(sale_date) BETWEEN '2025-01-01' AND '2025-01-31'
GROUP BY ALL
ORDER BY date DESC
LIMIT 10
```
## Installation
To use MiniML in your project:
```bash
npm install miniml
```
## Programmatic Usage
```typescript
import { loadModel, renderQuery } from 'miniml';
// Load the model
const model = await loadModel('./sales.yaml');
// Generate a SQL query
const sql = renderQuery(model, {
dimensions: ['date', 'customer_name'],
measures: ['total_amount', 'count'],
date_from: '2025-01-01',
date_to: '2025-01-31',
order_by: ['-date'],
limit: 10
});
console.log(sql);
```
**OUTPUT**
```sql
SELECT
DATE(sale_date) AS date,
customer_name AS customer_name,
SUM(total_amount) AS total_amount,
COUNT(*) AS count
FROM acme.sales
JOIN acme.customers USING (customer_id)
WHERE DATE(sale_date) BETWEEN '2025-01-01' AND '2025-01-31'
GROUP BY ALL
ORDER BY date DESC
LIMIT 10
```
## Model Definition
### Basic Structure
A MiniML model is a YAML file with the following structure:
```yaml
description: Human-readable description of the model
dialect: bigquery | snowflake
from: schema.table_name
date_field: primary_date_column
where: base_filter_conditions
join:
join_name: JOIN clause definition
dimensions:
dimension_name: definition
measures:
measure_name: definition
```
### Dimensions
Dimensions are fields used for grouping data. They can be defined in several formats:
**Simple string (field name only):**
```yaml
dimensions:
customer_id: Unique customer identifier
```
**Array format (description, SQL, join):**
```yaml
dimensions:
customer_name:
- Display name of the customer
- customer_name
- customer_join
```
**Object format:**
```yaml
dimensions:
date:
description: The transaction date
sql: DATE(created_at)
join: date_dimension_join
```
### Measures
Measures are aggregated fields. They default to `SUM()` if no SQL is provided:
**Simple string (defaults to SUM):**
```yaml
measures:
revenue: Total revenue amount # becomes SUM(revenue) AS revenue
```
**Custom aggregation:**
```yaml
measures:
avg_price:
- Average unit price
- AVG(unit_price)
```
### Joins
Define reusable join clauses:
```yaml
join:
customer_join: LEFT JOIN customers USING (customer_id)
product_join: LEFT JOIN products USING (product_id)
dimensions:
customer_name:
- Customer display name
- customer_name
- customer_join # References the join above
```
### SQL Dialect Detection
MiniML can automatically detect the SQL dialect from the file path:
- Files containing "bigquery" use BigQuery syntax
- Files containing "snowflake" use Snowflake syntax
- Otherwise, specify explicitly with the `dialect` field
**Examples:**
```
models/bigquery/sales.yaml # Automatically uses BigQuery dialect
models/snowflake/users.yaml # Automatically uses Snowflake dialect
models/sales.yaml # Requires explicit dialect: bigquery
```
This means you can omit the `dialect` field when your file paths include the database type.
## Query Generation
### Query Options
The `renderQuery` function accepts these options:
```typescript
interface MinimlQueryOptions {
dimensions?: string[]; // Fields to group by
measures?: string[]; // Aggregated fields to calculate
date_from?: string; // Start date filter (YYYY-MM-DD)
date_to?: string; // End date filter (YYYY-MM-DD)
where?: string; // Additional WHERE conditions
having?: string; // HAVING clause for aggregate filters
order_by?: string[]; // Sort fields (prefix with - for DESC)
limit?: number; // Limit result count
distinct?: boolean; // Add DISTINCT to SELECT
date_granularity?: string; // Date truncation (day, week, month, year)
}
```
### Date Handling
**Automatic date filtering:**
```typescript
// Last 30 days (default when date_field is defined)
renderQuery(model, { measures: ['count'] });
// Specific date range
renderQuery(model, {
measures: ['count'],
date_from: '2024-01-01',
date_to: '2024-01-31'
});
```
**Date granularity:**
```typescript
// Group by week
renderQuery(model, {
dimensions: ['date'],
measures: ['count'],
date_granularity: 'week'
});
// Generates: DATE_TRUNC('WEEK', sale_date) AS date
```
### Filtering
**WHERE clause (for dimensions):**
```typescript
renderQuery(model, {
dimensions: ['customer_id'],
measures: ['total_amount'],
where: "customer_type = 'premium'"
});
```
**HAVING clause (for measures):**
```typescript
renderQuery(model, {
dimensions: ['customer_id'],
measures: ['total_amount'],
having: "total_amount > 1000"
});
```
### Ordering
```typescript
renderQuery(model, {
dimensions: ['customer_id'],
measures: ['total_amount'],
order_by: ['customer_id', '-total_amount'] // ASC, then DESC
});
```
## API Reference
### Core Functions
#### `loadModel(file: string): Promise<MinimlModel>`
Asynchronously loads and processes a YAML model file.
#### `loadModelSync(file: string): MinimlModel`
Synchronously loads and processes a YAML model file.
#### `renderQuery(model: MinimlModel, options: MinimlQueryOptions): string`
Generates a SQL query from a model and query options.
### Interfaces
#### `MinimlModel`
```typescript
interface MinimlModel {
description: string;
dialect: string;
from: string;
join: Record<string, string>;
where: string;
date_field: string;
dimensions: Record<string, MinimlDef>;
measures: Record<string, MinimlDef>;
info: string; // Auto-generated documentation
}
```
#### `MinimlDef`
```typescript
interface MinimlDef {
key: string;
description: string;
sql?: string;
join?: string;
}
```
## Examples
### E-commerce Analytics
```yaml
# ecommerce.yaml
description: E-commerce sales and customer analytics
dialect: bigquery
from: ecommerce.orders
date_field: order_date
join:
customer_join: LEFT JOIN ecommerce.customers USING (customer_id)
product_join: LEFT JOIN ecommerce.products USING (product_id)
where: status = 'completed'
dimensions:
date: Order date
customer_id: Customer identifier
customer_email:
- Customer email address
- email
- customer_join
product_category:
- Product category
- category
- product_join
measures:
revenue: Total revenue
order_count:
- Number of orders
- COUNT(DISTINCT order_id)
avg_order_value:
- Average order value
- AVG(total_amount)
```
**Usage:**
```typescript
const model = await loadModel('./ecommerce.yaml');
// Revenue by category, last 7 days
const sql = renderQuery(model, {
dimensions: ['product_category'],
measures: ['revenue', 'order_count'],
date_from: '2024-01-20',
date_to: '2024-01-27',
order_by: ['-revenue'],
limit: 10
});
```
### Complex Joins and Calculations
```yaml
# advanced.yaml
description: Advanced analytics with multiple joins
dialect: snowflake
from: fact.sales
date_field: sale_date
join:
customer_join: LEFT JOIN dim.customers c ON c.id = customer_id
product_join: LEFT JOIN dim.products p ON p.id = product_id
region_join: LEFT JOIN dim.regions r ON r.code = c.region_code
dimensions:
customer_segment:
- Customer segment classification
- c.segment
- customer_join
product_brand:
- Product brand name
- p.brand
- product_join
region_name:
- Geographic region
- r.name
- region_join
measures:
revenue_per_customer:
- Average revenue per customer
- SUM(amount) / COUNT(DISTINCT customer_id)
profit_margin:
- Profit margin percentage
- (SUM(amount) - SUM(cost)) / SUM(amount) * 100
```
## Supported SQL Dialects
### BigQuery
- Supports `GROUP BY ALL` syntax
- Uses `DATE_TRUNC` for date granularity
- Compatible with BigQuery standard SQL functions
### Snowflake
- Supports Snowflake-specific functions
- Uses `DATE_TRUNC` for date granularity
- Compatible with Snowflake SQL syntax
More dialects coming soon!
## Advanced Features
### Template Variables in Info
Models support Jinja-style templating in the `info` section:
```yaml
info: |
## Available Dimensions
{%- for dimension in dimensions %}
- `{{ dimension.key }}`: {{ dimension.description }}
{%- endfor %}
## Available Measures
{%- for measure in measures %}
- `{{ measure.key }}`: {{ measure.description }}
{%- endfor %}
```
### Filter Reference Expansion
MiniML automatically expands dimension and measure references in filter clauses:
```typescript
// If you have a measure: conversion_rate: "clicks / impressions"
// And use it in a HAVING clause:
renderQuery(model, {
having: "conversion_rate > 0.05"
});
// MiniML expands this to: HAVING (clicks / impressions) > 0.05
```
### Validation
MiniML validates query parameters against model definitions and provides helpful error messages:
```typescript
// This will throw an error with specific invalid keys
renderQuery(model, {
dimensions: ['invalid_dimension'],
measures: ['invalid_measure']
});
// Error: The following keys are invalid:
// - dimensions: invalid_dimension
// - measures: invalid_measure
```
## Security
Because MiniML allows flexible `where` and `having` clauses to be specified in native SQL format, comprehensive SQL validation is implemented to guard against SQL injection attacks for user-provided query parameters:
- **AST-based validation**: All user-provided SQL expressions are parsed and validated using node-sql-parser
- **Allowlist approach**: Only safe SQL constructs are permitted (comparisons, logical operators, basic functions)
- **Model-aware validation**: Column references are validated against your model schema
- **Dialect-specific rules**: Validation adapts to BigQuery/Snowflake syntax differences
- **Date input validation**: Date parameters (`date_from`, `date_to`) are strictly validated using regex patterns
### Security Model
**Trusted Inputs**: YAML model files are considered trusted input and are not validated for SQL injection. Model files should be:
- Created and maintained by trusted developers
- Stored securely with appropriate access controls
- Reviewed before deployment to production environments
- Treated as code artifacts subject to version control and security policies
**Untrusted Inputs**: All query parameters provided at runtime (`where`, `having`, `date_from`, `date_to`, etc.) are considered untrusted and undergo strict validation.
### Safe Expression Examples
**WHERE clauses:**
```typescript
renderQuery(model, {
where: "account_name = 'Acme Corp'"
});
renderQuery(model, {
where: "date >= '2024-01-01' AND category_name LIKE 'Electronics%'"
});
renderQuery(model, {
where: "revenue > 1000 OR quantity IS NOT NULL"
});
```
**HAVING clauses:**
```typescript
renderQuery(model, {
having: "total_amount > 1000 AND order_count >= 5"
});
```
### Blocked Constructs
For security, the following are automatically blocked:
- **Subqueries**: `SELECT`, `EXISTS`, `ANY`, `ALL`
- **DDL/DML statements**: `DROP`, `ALTER`, `CREATE`, `INSERT`, `UPDATE`, `DELETE`
- **System functions**: Potentially dangerous database functions
- **Comments**: SQL comments that could hide malicious code
- **Unauthorized columns**: References to columns not defined in your model
### Validation Errors
When unsafe expressions are detected, MiniML provides helpful error messages:
```typescript
// Error: "Subqueries are not allowed in WHERE clauses. Use simple comparisons instead."
renderQuery(model, {
where: "customer_id IN (SELECT id FROM users WHERE active = 1)"
});
// Error: "Column 'user_password' not found. Available dimensions: account_name, date, category_name"
renderQuery(model, {
where: "user_password = 'secret'"
});
```
## Building and Development
```bash
# Install dependencies
npm install
# Build TypeScript
npm run build
# Run tests (requires Mocha setup)
npx mocha test/**/*.test.js
```
## Contributing
1. Fork the repository
2. Create a feature branch
3. Add tests for new functionality
4. Ensure all tests pass
5. Submit a pull request
## License
MIT License - see LICENSE file for details.