@inso_web/els-mcp
Version:
MCP-сервер поверх INSO Error Logs Service. Read-only tools (search, analytics, fingerprinting, correlations) для подключения Claude Desktop/Code и ChatGPT к логам ошибок. Streamable HTTP transport + stdio для npx-запуска.
358 lines (280 loc) • 12.4 kB
Markdown
# Developing `@inso_web/els-mcp`
🇷🇺 **Документация на русском**: [docs/CONTRIBUTING.ru.md](https://unpkg.com/browse/@inso_web/els-mcp/docs/CONTRIBUTING.ru.md)
Reference for anyone working on the package itself or running it locally
for debugging. End users only need [README.md](https://unpkg.com/browse/@inso_web/els-mcp/README.md).
## Local run
```bash
npm install
ELS_API_KEY=els_live_... npm run dev
```
`npm run dev` runs `tsx src/cli.ts` without a build step. After
`npm run build` the binary lives at `dist/cli.js` and can be started via
`node dist/cli.js`.
stdout is reserved for JSON-RPC; all logs go to stderr.
### Tests and checks
```bash
npm test # vitest run — unit tests
npm run typecheck # tsc --noEmit
```
## Environment variables
### ELS connection
| ENV | Default | Description |
|---|---|---|
| `ELS_API_KEY` | — (required) | Bearer key (`els_live_*` or `els_test_*`) |
| `ELS_BASE_URL` | dev → `http://localhost:4010`, prod → `https://api.insoweb.ru/els` | Upstream ELS endpoint |
| `MCP_LOG_LEVEL` | `info` | pino level |
| `MCP_DISABLE_TOOLS` | — | CSV of tool names to disable |
| `MCP_UPSTREAM_TIMEOUT_MS` | `30000` | Single ELS-request timeout |
### HTTP transport
| ENV | Default | Description |
|---|---|---|
| `MCP_TRANSPORT` | `stdio` | `stdio` or `http` |
| `MCP_HTTP_PORT` | `3030` | Listen port in HTTP mode |
| `MCP_PUBLIC_URL` | `https://mcp.insoweb.ru/els` | URL used in WWW-Authenticate and discovery |
| `MCP_OIDC_ISSUER` | `https://auth.insoweb.ru` | OIDC issuer |
| `MCP_OIDC_JWKS_URL` | derived | JWKS endpoint |
| `MCP_OIDC_AUDIENCE` | `els-mcp` | Expected `aud` claim |
| `MCP_OIDC_DEMO_APP_SLUG` | — | Fallback appSlug when the LK resolver is unavailable |
| `MCP_CORS_ORIGINS` | `https://claude.ai,https://chat.openai.com` | CSV of allowed origins (dev adds localhost) |
### Cache, observability, billing
| ENV | Default | Description |
|---|---|---|
| `MCP_REDIS_URL` | `redis://localhost:6379` | Redis URL |
| `MCP_CACHE_ENABLED` | `true` | Enable cache layer |
| `MCP_METRICS_ENABLED` | `true` | Expose `/els/metrics` |
| `MCP_CACHE_TTL_OVERRIDE_*` | — | Override TTL per class (seconds) |
| `OTEL_EXPORTER_OTLP_ENDPOINT` | — | OTLP traces; no-op if unset |
| `MCP_LOG_PRETTY` | `true` in dev | Pretty-print pino |
| `MCP_REDACTION_ENABLED` | `true` | PII redaction toggle |
| `MCP_REDACTION_FIELDS` | — | CSV whitelist of fields (empty → redact everything) |
| `MCP_DATABASE_URL` | — | Postgres URL for audit/billing. No-op if empty |
| `MCP_DEFAULT_APP_ID` | `default` | Used in stdio mode |
| `MCP_DEFAULT_TIER` | `STANDARD` | Default tier for quota checks |
| `MCP_LK_API_BASE_URL` | — | LK API URL for OIDC sub→apps and appSlug→tier (optional) |
| `MCP_LK_API_TOKEN` | — | Bearer token for internal LK API |
## HTTP transport locally
```bash
MCP_TRANSPORT=http \
MCP_HTTP_PORT=3030 \
MCP_OIDC_DEMO_APP_SLUG=acme \
ELS_API_KEY=els_live_xxx \
npm run dev
```
OIDC can be pointed at a dev INSO Auth instance:
```bash
MCP_OIDC_ISSUER=http://localhost:4002 \
MCP_OIDC_JWKS_URL=http://localhost:4002/oidc/.well-known/jwks.json \
MCP_TRANSPORT=http npm run dev
```
### Routes
| Method | URL | Purpose |
|---|---|---|
| `POST /els/mcp` | MCP JSON-RPC (Streamable HTTP) | Requires Bearer (ELS-key or OIDC JWT) |
| `GET /els/mcp` | Long-lived SSE (server → client notifications) | Requires Bearer |
| `DELETE /els/mcp` | Terminate session | Requires Bearer |
| `GET /els/healthz` | Liveness probe (always 200) | Public |
| `GET /els/readyz` | Readiness probe (ELS upstream check) | Public |
| `GET /els/.well-known/oauth-protected-resource` | RFC 9728 resource metadata | Public |
| `GET /els/.well-known/mcp` | MCP discovery (tools list, transports) | Public |
| `GET /els/metrics` | Prometheus text format | Public |
### Quick curl checks
```bash
# Liveness
curl http://localhost:3030/els/healthz
# {"status":"ok"}
# Resource metadata
curl http://localhost:3030/els/.well-known/oauth-protected-resource
# MCP discovery
curl http://localhost:3030/els/.well-known/mcp
# Bearer ELS-key
curl -X POST http://localhost:3030/els/mcp \
-H "Authorization: Bearer els_live_xxx" \
-H "Content-Type: application/json" \
-H "Accept: application/json, text/event-stream" \
-d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-03-26","clientInfo":{"name":"curl","version":"1"},"capabilities":{}}}'
# The response carries an Mcp-Session-Id header — pass it on subsequent requests.
```
## Authentication
Two paths are supported (detected by the Bearer shape):
1. **ELS key** — `Authorization: Bearer els_(live|test)_<key>` —
passthrough to ELS. Used for CI/CD, server-to-server and debug.
2. **OIDC JWT** — `Authorization: Bearer <jwt>` — validated locally via
the INSO Auth JWKS
(`https://auth.insoweb.ru/oidc/.well-known/jwks.json`, RS256,
audience `els-mcp`, scope `errors:mcp-read`).
If both Bearer flavours are missing — 401 + `WWW-Authenticate: Bearer
realm="els-mcp",
resource_metadata="https://mcp.insoweb.ru/els/.well-known/oauth-protected-resource"`.
### Sessions
The `Mcp-Session-Id` header is returned on the first `initialize` call
and must be sent with every subsequent request. TTL — 30 min idle.
Storage is in-memory (Map); a Redis-backed variant is on the roadmap.
### OIDC sub → appSlug resolver
If the LK API endpoint `GET /api/internal/users/{sub}/apps` is reachable,
the service resolves the user's apps and caches the result in Redis for
5 minutes. If the endpoint is unavailable — graceful fallback to
`MCP_OIDC_DEMO_APP_SLUG`. When the user has multiple apps, each tool
accepts an optional `appSlug` parameter; otherwise the first one is
used.
## Prompt-injection mitigation
Every string field from logs is wrapped in `<untrusted>…</untrusted>`
tags. Each tool description carries a system note instructing the LLM
**not** to follow instructions originating from such content. In
parallel a regex deny-list runs (see `src/redaction/promptInjection.ts`):
on a hit (`ignore previous instructions`, `system:`, `jailbreak`, …)
`_meta.suspiciousContentBlocked = true` + `_meta.suspiciousRule = <name>`.
## Audit log
- Append-only, schema `mcp_audit` (separate DB from ELS).
- Hash-chain: `prevHash` + `rowHash` (sha256) per `appId` partition.
- Monthly partitioning (RANGE `createdAt`). See
`prisma/migrations/init/migration.sql`.
- Non-blocking writes: if the DB is unavailable, tool calls keep working
(silent fail with a warn log).
### What is NOT logged
- Full API key (only an 8-char prefix).
- Log content (only tool-call metadata).
- Full IP (anonymised).
- Cookies, Authorization headers.
### Hash-chain integrity check
```bash
# Integrity check for app 'acme'
MCP_DATABASE_URL=postgres://... npm run audit:verify -- --app=acme
# With a date range
els-mcp verify-audit --app=acme --from=2026-05-01 --to=2026-05-17
```
Exit `0` — chain intact; `1` — break detected (offending row printed).
## Prisma setup
```bash
# Generate the client (output → node_modules/.prisma/mcp)
npm run prisma:generate
# Apply the migration (creates schemas + partitioned audit table)
psql $MCP_DATABASE_URL -f prisma/migrations/init/migration.sql
```
## Cache (Redis)
Lookup-aside cache for read-heavy endpoints. TTL per class (see
`src/cache/policies.ts`):
| Class | TTL | Tool(s) |
|---|---|---|
| `log_details` | 1h | `get_log_details` |
| `top_messages` | 2m | `top_error_messages` |
| `histogram` | 1m | `error_histogram` |
| `heatmap` | 5m | `error_heatmap` |
| `traffic_long` | 5m | `traffic_stats` |
| `search_recent` | 15s | `search_logs` |
| `list_apps` | 30s | `list_apps` |
| `stats_breakdown` | 2m | `error_stats_breakdown` |
| `baseline` | 5m | `baseline_compare` |
| `version_timeline` | 5m | `version_regression` |
| `grouped_errors` | 2m | `grouped_errors` |
All cache keys are tenant-prefixed:
`mcp:cache:{class}:{appSlug | k:keyPrefix}:{...}` — protects against
cross-tenant data leaks.
**Graceful degradation**. If Redis is unavailable or
`MCP_CACHE_ENABLED=false`, every request transparently falls through to
ELS without errors. A sub-25ms connect/PING latency doesn't block
process startup (`lazyConnect: true`).
**Compression**. Values larger than 10 KB are gzip-compressed (prefix
`gz:`) and decompressed on read.
## Prometheus metrics
Endpoint: `GET /els/metrics`.
Key metrics:
- `mcp_requests_total{tool,status,cached}`
- `mcp_request_duration_seconds{tool}` — histogram (buckets:
0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10, 20, 30)
- `mcp_errors_total{tool,code}`
- `mcp_cache_hits_total{tool_class}`, `mcp_cache_misses_total{tool_class}`,
`mcp_cache_hit_ratio{tool_class}`
- `mcp_els_upstream_errors_total{endpoint,status}`
- `mcp_sse_connections_active`
- `mcp_redaction_applied_total{field}`
- `mcp_billing_events_total{appSlug,tier}`
### Prometheus scrape config
```yaml
scrape_configs:
- job_name: els-mcp
scrape_interval: 15s
metrics_path: /els/metrics
static_configs:
- targets: ['mcp-1.internal:3030', 'mcp-2.internal:3030']
```
### Grafana dashboard (minimal JSON)
```json
{
"title": "MCP — Overview",
"panels": [
{ "title": "RPS by tool",
"targets": [{ "expr": "sum by (tool) (rate(mcp_requests_total[1m]))" }] },
{ "title": "p95 latency",
"targets": [{ "expr": "histogram_quantile(0.95, sum by (tool, le) (rate(mcp_request_duration_seconds_bucket[5m])))" }] },
{ "title": "Cache hit ratio",
"targets": [{ "expr": "mcp_cache_hit_ratio" }] },
{ "title": "Upstream errors",
"targets": [{ "expr": "sum by (status) (rate(mcp_els_upstream_errors_total[5m]))" }] }
]
}
```
A full SRE dashboard + per-tool + per-tenant — `07-observability.md`.
## Logs (Loki shipper)
pino logs → stderr (stdio mode) or stdout (HTTP mode) → Promtail → Loki.
```yaml
scrape_configs:
- job_name: els-mcp
static_configs:
- targets: [localhost]
labels:
job: els-mcp
service: els-mcp
__path__: /var/log/els-mcp/*.log
pipeline_stages:
- json:
expressions:
level: level
tool: tool
appSlug: appSlug
requestId: requestId
- labels:
level:
tool:
```
Sensitive fields (`*.token`, `*.apiKey`, `Authorization` headers, etc.)
are auto-replaced with `<REDACTED>` in pino logs (see
`src/observability/logger.ts`).
## OpenTelemetry tracing
Optional — toggled via `OTEL_EXPORTER_OTLP_ENDPOINT`. If unset, the SDK
isn't loaded at all (zero overhead).
Auto-instrumentation: HTTP, undici (ELS calls), ioredis, Express.
## Health endpoints
- `GET /els/healthz` — liveness (always 200 while the process is alive).
- `GET /els/readyz` — readiness: checks Redis ping + ELS upstream
reachability. Returns 503 if any dependency is down. Handlers live in
`src/http/routes/metrics.ts`.
## Publishing to npm
Releases are automated via GitLab CI:
1. Bump `version` in `package.json` (`0.3.x` → `0.3.(x+1)` for a bug-fix,
`0.(x+1).0` for new features).
2. Commit the change on `main`.
3. Create a tag in the form `sdk/mcp/v<X.Y.Z>`:
```bash
git tag sdk/mcp/v0.3.2
git push origin main
git push origin sdk/mcp/v0.3.2
```
4. The GitLab job `publish:mcp` (see `.gitlab-ci.yml`) fires on the tag
and runs `npm version`, `npm run build`, `npm publish --access public`.
`NPM_TOKEN` must be set as a protected CI/CD variable in GitLab.
## Limitations / TODO
- **DCR (Dynamic Client Registration).** The rate-limit middleware is
ready (`src/http/middleware/dcrRateLimit.ts`), but the
`/oauth/register` endpoint is planned for v2. For now OIDC discovery
works without runtime client registration.
- **Mistral AI summary inside `explain_error`.** Currently the tool
returns error context without an AI wrapper (`aiAvailable=false`);
the client LLM synthesises the explanation from the returned data.
Native synthesis is on the roadmap.
- **OIDC sub → apps resolver via LK API.** The endpoint
`GET /api/internal/users/{sub}/apps` is expected on the LK backend;
until it's available the service falls back to
`MCP_OIDC_DEMO_APP_SLUG`.
- **Tier resolver via LK API.** The endpoint
`GET /api/internal/apps/{appSlug}/billing/tier` is expected on the LK
backend; until it's available the service uses `MCP_DEFAULT_TIER`.