PR #130 opened: perf(mcp): cache + indexes + telemetry for the public MCP
Summary
Three independent fixes to make the public MCP server stop timing out under sustained use, plus the observability needed to keep it that way.
- Caching — Search, Compare, Neighborhood, Briefing, Trends, Velocity all wrap in
Cache::remember(15-30min for user-driven queries, 5min for time-sensitive briefings). Compare keys per-query so popular terms reuse the Search summary across compare calls. - Trigram indexes —
pg_trgmGIN on `raw_items.title` and `raw_items.url` so bidirectional ILIKE (`%foo%`) stops doing a seq scan over 4.7M rows. Plus a standalone `fetched_at` B-tree for the GetVelocity cross-platform CTE which filters only by time. - Timeouts — `fastcgi_read_timeout` and PHP `max_execution_time` bumped from 60s → 120s so cold-cache calls have room while the indexes finish building.
- Telemetry — `mcp_call_logs` table + `McpRequestLogger` middleware on both `/mcp/public` and `/mcp/devtrends` capturing duration, query count, cache hits, errors, status code per call.
- Dashboard — `/mcp-stats` (auth-gated) rendering p50/p95/p99 per tool, cache hit rate, error rate, recent errors.
Why now
The public MCP is hitting 504s and small (~127-byte) JSON-RPC error envelopes under sustained load. Container logs showed nginx `upstream timed out` on POST `/mcp/public` and inconsistent response sizes (127 → 7,680 bytes). Diagnosis: every Search call fans out into 4 sequential ILIKE scans of the 4.7M-row `raw_items` table, and Compare multiplies that up to 5×. No caching on the 6 most-called tools. Indexes don't cover `%foo%` lookups.
Deployment — important
Auto-migrate is currently disabled on the prod app (`AUTO_MIGRATE=false` env var set on Coolify) because `CREATE INDEX CONCURRENTLY` on the trigram indexes takes 5-15 min each on 4.7M rows. After merge:
- Deploy as normal (entrypoint will skip migrations).
- Exec into the container and run `php artisan migrate --force` manually.
- Flip `AUTO_MIGRATE` back to `true`.
Test plan
- Pest suite green (4 new tests passing, 1 skipped on sqlite for percentile_cont)
- Pint formatted
- After deploy + manual migrate, verify `raw_items_title_trgm_idx` and `raw_items_url_trgm_idx` exist in `pg_indexes`
- Hit `/mcp-stats` and verify cache hit rate climbs above 0% as repeat queries land
- Run a few `search` calls and confirm first call DB-heavy, second call cache_hit=true in logs
- Confirm no more 504s on `/mcp/public` under normal traffic