NOC Intelligence and Alert Batching
The noc_intelligence.go system provides an advanced ingestion and correlation engine for Network Operations Center (NOC) alerts, primarily originating from email-based monitoring tools.
Alert Ingestion and Parsing
The system listens for incoming emails (from Sysafe or London-Hosting monitoring tools) and parses the subjects and senders to categorize the alert.
- Categories: Automatically identifies alerts as Resource, Login, Status, or Backup issues.
- SITREP Generation: Converts raw alert data into a structured SITREP (Status, Impact, History) format for AI consumption.
Debounced Batching
To prevent "alert fatigue" and token exhaustion from redundant AI queries, the system uses a debounced batching mechanism.
- Initial Alert: An alert is received and logged in memory.
- Debounce Window: A 10-minute timer begins.
- Aggregation: Subsequent related alerts received within this window are grouped together.
- Dispatch: Once the window expires, the entire batch is sent to the appropriate AI agent as a single analytical session.
Intelligent Routing
The NOC Intelligence system dynamically routes batches to the correct agent based on the source infrastructure:
- Internal Servers (e.g., Gadgetzan, Nikki container) → Routed to the
sashaagent (Internal Infra Specialist). - External Servers (e.g., Sysafe, LH) → Routed to the
sysadminagent.
Component Diagram: NOC Alert Flow
graph TD
Email[Incoming Monitoring Email] --> Parse[Extract Subject/Sender]
Parse --> Cat[Categorize Alert]
Cat --> Batch{In 10-Min Window?}
Batch -- Yes --> Group[Add to Batch]
Batch -- No --> New[Start New Batch Timer]
Group --> Timer[Timer Expires]
New --> Timer
Timer --> Route{Internal vs External?}
Route -- Internal --> Sasha[Dispatch to Sasha Agent]
Route -- External --> Sys[Dispatch to Sysadmin Agent]
Guidance for AI Agents
- Batch Context: When you receive a NOC intelligence report, realize it may represent multiple events over a 10-minute period. Look for patterns (e.g., CPU spiked, then memory dropped) rather than treating it as a single point in time.
- Correlate: Cross-reference the NOC alert with the
telemetry_pollerdata to get a complete picture of the cluster's health.