UNPKG

@inso_web/els-mcp

Version:

MCP-сервер поверх INSO Error Logs Service. Read-only tools (search, analytics, fingerprinting, correlations) для подключения Claude Desktop/Code и ChatGPT к логам ошибок. Streamable HTTP transport + stdio для npx-запуска.

358 lines (280 loc) 12.4 kB
# Developing `@inso_web/els-mcp` 🇷🇺 **Документация на русском**: [docs/CONTRIBUTING.ru.md](https://unpkg.com/browse/@inso_web/els-mcp/docs/CONTRIBUTING.ru.md) Reference for anyone working on the package itself or running it locally for debugging. End users only need [README.md](https://unpkg.com/browse/@inso_web/els-mcp/README.md). ## Local run ```bash npm install ELS_API_KEY=els_live_... npm run dev ``` `npm run dev` runs `tsx src/cli.ts` without a build step. After `npm run build` the binary lives at `dist/cli.js` and can be started via `node dist/cli.js`. stdout is reserved for JSON-RPC; all logs go to stderr. ### Tests and checks ```bash npm test # vitest run — unit tests npm run typecheck # tsc --noEmit ``` ## Environment variables ### ELS connection | ENV | Default | Description | |---|---|---| | `ELS_API_KEY` | — (required) | Bearer key (`els_live_*` or `els_test_*`) | | `ELS_BASE_URL` | dev → `http://localhost:4010`, prod → `https://api.insoweb.ru/els` | Upstream ELS endpoint | | `MCP_LOG_LEVEL` | `info` | pino level | | `MCP_DISABLE_TOOLS` | — | CSV of tool names to disable | | `MCP_UPSTREAM_TIMEOUT_MS` | `30000` | Single ELS-request timeout | ### HTTP transport | ENV | Default | Description | |---|---|---| | `MCP_TRANSPORT` | `stdio` | `stdio` or `http` | | `MCP_HTTP_PORT` | `3030` | Listen port in HTTP mode | | `MCP_PUBLIC_URL` | `https://mcp.insoweb.ru/els` | URL used in WWW-Authenticate and discovery | | `MCP_OIDC_ISSUER` | `https://auth.insoweb.ru` | OIDC issuer | | `MCP_OIDC_JWKS_URL` | derived | JWKS endpoint | | `MCP_OIDC_AUDIENCE` | `els-mcp` | Expected `aud` claim | | `MCP_OIDC_DEMO_APP_SLUG` | — | Fallback appSlug when the LK resolver is unavailable | | `MCP_CORS_ORIGINS` | `https://claude.ai,https://chat.openai.com` | CSV of allowed origins (dev adds localhost) | ### Cache, observability, billing | ENV | Default | Description | |---|---|---| | `MCP_REDIS_URL` | `redis://localhost:6379` | Redis URL | | `MCP_CACHE_ENABLED` | `true` | Enable cache layer | | `MCP_METRICS_ENABLED` | `true` | Expose `/els/metrics` | | `MCP_CACHE_TTL_OVERRIDE_*` | — | Override TTL per class (seconds) | | `OTEL_EXPORTER_OTLP_ENDPOINT` | — | OTLP traces; no-op if unset | | `MCP_LOG_PRETTY` | `true` in dev | Pretty-print pino | | `MCP_REDACTION_ENABLED` | `true` | PII redaction toggle | | `MCP_REDACTION_FIELDS` | — | CSV whitelist of fields (empty → redact everything) | | `MCP_DATABASE_URL` | — | Postgres URL for audit/billing. No-op if empty | | `MCP_DEFAULT_APP_ID` | `default` | Used in stdio mode | | `MCP_DEFAULT_TIER` | `STANDARD` | Default tier for quota checks | | `MCP_LK_API_BASE_URL` | — | LK API URL for OIDC subapps and appSlugtier (optional) | | `MCP_LK_API_TOKEN` | — | Bearer token for internal LK API | ## HTTP transport locally ```bash MCP_TRANSPORT=http \ MCP_HTTP_PORT=3030 \ MCP_OIDC_DEMO_APP_SLUG=acme \ ELS_API_KEY=els_live_xxx \ npm run dev ``` OIDC can be pointed at a dev INSO Auth instance: ```bash MCP_OIDC_ISSUER=http://localhost:4002 \ MCP_OIDC_JWKS_URL=http://localhost:4002/oidc/.well-known/jwks.json \ MCP_TRANSPORT=http npm run dev ``` ### Routes | Method | URL | Purpose | |---|---|---| | `POST /els/mcp` | MCP JSON-RPC (Streamable HTTP) | Requires Bearer (ELS-key or OIDC JWT) | | `GET /els/mcp` | Long-lived SSE (serverclient notifications) | Requires Bearer | | `DELETE /els/mcp` | Terminate session | Requires Bearer | | `GET /els/healthz` | Liveness probe (always 200) | Public | | `GET /els/readyz` | Readiness probe (ELS upstream check) | Public | | `GET /els/.well-known/oauth-protected-resource` | RFC 9728 resource metadata | Public | | `GET /els/.well-known/mcp` | MCP discovery (tools list, transports) | Public | | `GET /els/metrics` | Prometheus text format | Public | ### Quick curl checks ```bash # Liveness curl http://localhost:3030/els/healthz # {"status":"ok"} # Resource metadata curl http://localhost:3030/els/.well-known/oauth-protected-resource # MCP discovery curl http://localhost:3030/els/.well-known/mcp # Bearer ELS-key curl -X POST http://localhost:3030/els/mcp \ -H "Authorization: Bearer els_live_xxx" \ -H "Content-Type: application/json" \ -H "Accept: application/json, text/event-stream" \ -d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-03-26","clientInfo":{"name":"curl","version":"1"},"capabilities":{}}}' # The response carries an Mcp-Session-Id header — pass it on subsequent requests. ``` ## Authentication Two paths are supported (detected by the Bearer shape): 1. **ELS key** — `Authorization: Bearer els_(live|test)_<key>` — passthrough to ELS. Used for CI/CD, server-to-server and debug. 2. **OIDC JWT** — `Authorization: Bearer <jwt>` — validated locally via the INSO Auth JWKS (`https://auth.insoweb.ru/oidc/.well-known/jwks.json`, RS256, audience `els-mcp`, scope `errors:mcp-read`). If both Bearer flavours are missing — 401 + `WWW-Authenticate: Bearer realm="els-mcp", resource_metadata="https://mcp.insoweb.ru/els/.well-known/oauth-protected-resource"`. ### Sessions The `Mcp-Session-Id` header is returned on the first `initialize` call and must be sent with every subsequent request. TTL — 30 min idle. Storage is in-memory (Map); a Redis-backed variant is on the roadmap. ### OIDC sub → appSlug resolver If the LK API endpoint `GET /api/internal/users/{sub}/apps` is reachable, the service resolves the user's apps and caches the result in Redis for 5 minutes. If the endpoint is unavailable — graceful fallback to `MCP_OIDC_DEMO_APP_SLUG`. When the user has multiple apps, each tool accepts an optional `appSlug` parameter; otherwise the first one is used. ## Prompt-injection mitigation Every string field from logs is wrapped in `<untrusted>…</untrusted>` tags. Each tool description carries a system note instructing the LLM **not** to follow instructions originating from such content. In parallel a regex deny-list runs (see `src/redaction/promptInjection.ts`): on a hit (`ignore previous instructions`, `system:`, `jailbreak`, …) `_meta.suspiciousContentBlocked = true` + `_meta.suspiciousRule = <name>`. ## Audit log - Append-only, schema `mcp_audit` (separate DB from ELS). - Hash-chain: `prevHash` + `rowHash` (sha256) per `appId` partition. - Monthly partitioning (RANGE `createdAt`). See `prisma/migrations/init/migration.sql`. - Non-blocking writes: if the DB is unavailable, tool calls keep working (silent fail with a warn log). ### What is NOT logged - Full API key (only an 8-char prefix). - Log content (only tool-call metadata). - Full IP (anonymised). - Cookies, Authorization headers. ### Hash-chain integrity check ```bash # Integrity check for app 'acme' MCP_DATABASE_URL=postgres://... npm run audit:verify -- --app=acme # With a date range els-mcp verify-audit --app=acme --from=2026-05-01 --to=2026-05-17 ``` Exit `0` — chain intact; `1` — break detected (offending row printed). ## Prisma setup ```bash # Generate the client (output → node_modules/.prisma/mcp) npm run prisma:generate # Apply the migration (creates schemas + partitioned audit table) psql $MCP_DATABASE_URL -f prisma/migrations/init/migration.sql ``` ## Cache (Redis) Lookup-aside cache for read-heavy endpoints. TTL per class (see `src/cache/policies.ts`): | Class | TTL | Tool(s) | |---|---|---| | `log_details` | 1h | `get_log_details` | | `top_messages` | 2m | `top_error_messages` | | `histogram` | 1m | `error_histogram` | | `heatmap` | 5m | `error_heatmap` | | `traffic_long` | 5m | `traffic_stats` | | `search_recent` | 15s | `search_logs` | | `list_apps` | 30s | `list_apps` | | `stats_breakdown` | 2m | `error_stats_breakdown` | | `baseline` | 5m | `baseline_compare` | | `version_timeline` | 5m | `version_regression` | | `grouped_errors` | 2m | `grouped_errors` | All cache keys are tenant-prefixed: `mcp:cache:{class}:{appSlug | k:keyPrefix}:{...}` — protects against cross-tenant data leaks. **Graceful degradation**. If Redis is unavailable or `MCP_CACHE_ENABLED=false`, every request transparently falls through to ELS without errors. A sub-25ms connect/PING latency doesn't block process startup (`lazyConnect: true`). **Compression**. Values larger than 10 KB are gzip-compressed (prefix `gz:`) and decompressed on read. ## Prometheus metrics Endpoint: `GET /els/metrics`. Key metrics: - `mcp_requests_total{tool,status,cached}` - `mcp_request_duration_seconds{tool}` — histogram (buckets: 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10, 20, 30) - `mcp_errors_total{tool,code}` - `mcp_cache_hits_total{tool_class}`, `mcp_cache_misses_total{tool_class}`, `mcp_cache_hit_ratio{tool_class}` - `mcp_els_upstream_errors_total{endpoint,status}` - `mcp_sse_connections_active` - `mcp_redaction_applied_total{field}` - `mcp_billing_events_total{appSlug,tier}` ### Prometheus scrape config ```yaml scrape_configs: - job_name: els-mcp scrape_interval: 15s metrics_path: /els/metrics static_configs: - targets: ['mcp-1.internal:3030', 'mcp-2.internal:3030'] ``` ### Grafana dashboard (minimal JSON) ```json { "title": "MCP — Overview", "panels": [ { "title": "RPS by tool", "targets": [{ "expr": "sum by (tool) (rate(mcp_requests_total[1m]))" }] }, { "title": "p95 latency", "targets": [{ "expr": "histogram_quantile(0.95, sum by (tool, le) (rate(mcp_request_duration_seconds_bucket[5m])))" }] }, { "title": "Cache hit ratio", "targets": [{ "expr": "mcp_cache_hit_ratio" }] }, { "title": "Upstream errors", "targets": [{ "expr": "sum by (status) (rate(mcp_els_upstream_errors_total[5m]))" }] } ] } ``` A full SRE dashboard + per-tool + per-tenant — `07-observability.md`. ## Logs (Loki shipper) pino logs → stderr (stdio mode) or stdout (HTTP mode) → Promtail → Loki. ```yaml scrape_configs: - job_name: els-mcp static_configs: - targets: [localhost] labels: job: els-mcp service: els-mcp __path__: /var/log/els-mcp/*.log pipeline_stages: - json: expressions: level: level tool: tool appSlug: appSlug requestId: requestId - labels: level: tool: ``` Sensitive fields (`*.token`, `*.apiKey`, `Authorization` headers, etc.) are auto-replaced with `<REDACTED>` in pino logs (see `src/observability/logger.ts`). ## OpenTelemetry tracing Optional — toggled via `OTEL_EXPORTER_OTLP_ENDPOINT`. If unset, the SDK isn't loaded at all (zero overhead). Auto-instrumentation: HTTP, undici (ELS calls), ioredis, Express. ## Health endpoints - `GET /els/healthz` — liveness (always 200 while the process is alive). - `GET /els/readyz` — readiness: checks Redis ping + ELS upstream reachability. Returns 503 if any dependency is down. Handlers live in `src/http/routes/metrics.ts`. ## Publishing to npm Releases are automated via GitLab CI: 1. Bump `version` in `package.json` (`0.3.x` → `0.3.(x+1)` for a bug-fix, `0.(x+1).0` for new features). 2. Commit the change on `main`. 3. Create a tag in the form `sdk/mcp/v<X.Y.Z>`: ```bash git tag sdk/mcp/v0.3.2 git push origin main git push origin sdk/mcp/v0.3.2 ``` 4. The GitLab job `publish:mcp` (see `.gitlab-ci.yml`) fires on the tag and runs `npm version`, `npm run build`, `npm publish --access public`. `NPM_TOKEN` must be set as a protected CI/CD variable in GitLab. ## Limitations / TODO - **DCR (Dynamic Client Registration).** The rate-limit middleware is ready (`src/http/middleware/dcrRateLimit.ts`), but the `/oauth/register` endpoint is planned for v2. For now OIDC discovery works without runtime client registration. - **Mistral AI summary inside `explain_error`.** Currently the tool returns error context without an AI wrapper (`aiAvailable=false`); the client LLM synthesises the explanation from the returned data. Native synthesis is on the roadmap. - **OIDC sub → apps resolver via LK API.** The endpoint `GET /api/internal/users/{sub}/apps` is expected on the LK backend; until it's available the service falls back to `MCP_OIDC_DEMO_APP_SLUG`. - **Tier resolver via LK API.** The endpoint `GET /api/internal/apps/{appSlug}/billing/tier` is expected on the LK backend; until it's available the service uses `MCP_DEFAULT_TIER`.