How We Manage 49 AI Agents in Production
We don't have a human engineering team. We have 49 AI agents — autonomous programs that write code, run QA, publish content, monitor infrastructure, and handle operations. They run 24/7 on cron schedules, responding to events, building features, and shipping to production without human intervention.
This is the story of how we built the observability layer to keep them working together instead of against each other — and the production incidents that made us realize we needed it.
The Fleet
Our agent fleet breaks down into five roles:
- Nano — CEO/strategist. Makes decisions, delegates work, manages memory and priorities.
- Crank — Engineering. Picks up build tasks every 4 hours, writes code, deploys to Vercel.
- Mimo — Content & distribution. Writes tweets, manages social presence, runs outreach.
- Flux — QA & testing. Builds test suites, validates deploys, catches regressions.
- Patch — Site reliability. Monitors uptime, investigates failures, handles escalations.
Plus 26 cron jobs that fire on schedules from every 15 minutes to daily — heartbeats, build sprints, content runs, monitoring sweeps, and dogfood reporters.
What Broke First: The Silent Failure Problem
The first thing that went wrong wasn't a crash. It was silence. We added frontend analytics events — email_captured, monitoring_cta_clicked, suite_cta_clicked — and deployed them to production. The API silently returned 400 for event types not in the server-side allowlist. No error in the UI. No alert. No log.
We ran for days thinking our conversion tracking was live. It wasn't. Zero data collected.
What Broke Next: Agents Working Against Each Other
When you have multiple autonomous agents writing to the same codebase and deploying to the same infrastructure, conflicts are inevitable. Crank would build a feature while Flux was running tests on the previous version. Mimo would update the landing page while Crank was deploying a backend change.
Without visibility into what each agent was doing, we couldn't tell if a test failure was a real bug or a deployment race condition. We couldn't tell if a cost spike was legitimate work or a runaway loop.
This is the problem that led us to build Fluq.
The Observability Stack
Our agents now report events to a central API. Every action, every decision, every deployment gets recorded with:
- Agent identity — who did it
- Event type — what happened (action, error, decision, heartbeat)
- Resource tracking — what files/services were touched
- Cost data — estimated LLM spend per action
- Timestamps — when, how long it took
curl -X POST https://fluq.ai/api/v1/events/ingest \
-H "Authorization: Bearer fo_your_api_key" \
-H "Content-Type: application/json" \
-d '[{
"agentId": "crank-builder",
"eventType": "action",
"payload": {
"description": "Deployed blog system to production"
},
"estimatedCostUsd": 0.85,
"durationMs": 480000
}]'The dashboard shows real-time agent status, cost trends, event timelines, and — crucially — conflict detection. When two agents touch the same resource within a time window, we surface it as a potential conflict.
Cost Tracking: The $64 Wake-Up Call
In our first week of full fleet operation, we burned through $64 in LLM costs across 222 events. That's not catastrophic, but it's also not nothing — and it was opaque. We couldn't tell which agents were expensive, which cron jobs were wasteful, or whether the spend was proportional to value delivered.
Now every event carries estimatedCostUsd. The dashboard aggregates cost by agent, by time window, by event type. We can see that Crank's 4-hour build sprints cost ~$0.85 each while Mimo's content runs are ~$0.30. We can set budget alerts before things get out of hand.
What We Learned
Running an AI agent fleet in production is closer to managing a distributed system than managing a team. The patterns that matter:
- Observe everything. Agents don't complain. They fail silently, loop endlessly, or produce garbage — and you won't know unless you're watching.
- Track costs per agent. A runaway agent with an API key can burn through budget fast. Per-agent cost attribution is table stakes.
- Detect conflicts automatically. When agents share resources, they'll step on each other. Manual coordination doesn't scale — you need automated detection.
- Heartbeats are non-negotiable. If an agent goes silent, you need to know immediately — not when you happen to check the logs.
- Build the tool you need. We looked at AgentOps, LangSmith, and Langfuse. None of them solved the fleet coordination problem. So we built Fluq.
Try It
Fluq is live. Free tier, no credit card. If you're running AI agents in production — whether it's 2 or 200 — you can start observing them in about 2 minutes.