NOC Intelligence and Alert Batching

The noc_intelligence.go system provides an advanced ingestion and correlation engine for Network Operations Center (NOC) alerts, primarily originating from email-based monitoring tools.

Alert Ingestion and Parsing

The system listens for incoming emails (from Sysafe or London-Hosting monitoring tools) and parses the subjects and senders to categorize the alert.

Categories: Automatically identifies alerts as Resource, Login, Status, or Backup issues.
SITREP Generation: Converts raw alert data into a structured SITREP (Status, Impact, History) format for AI consumption.

Debounced Batching

To prevent "alert fatigue" and token exhaustion from redundant AI queries, the system uses a debounced batching mechanism.

Initial Alert: An alert is received and logged in memory.
Debounce Window: A 10-minute timer begins.
Aggregation: Subsequent related alerts received within this window are grouped together.
Dispatch: Once the window expires, the entire batch is sent to the appropriate AI agent as a single analytical session.

Intelligent Routing

The NOC Intelligence system dynamically routes batches to the correct agent based on the source infrastructure:

Internal Servers (e.g., Gadgetzan, Nikki container) → Routed to the sasha agent (Internal Infra Specialist).
External Servers (e.g., Sysafe, LH) → Routed to the sysadmin agent.

Component Diagram: NOC Alert Flow

graph TD
    Email[Incoming Monitoring Email] --> Parse[Extract Subject/Sender]
    Parse --> Cat[Categorize Alert]
    Cat --> Batch{In 10-Min Window?}
    Batch -- Yes --> Group[Add to Batch]
    Batch -- No --> New[Start New Batch Timer]
    
    Group --> Timer[Timer Expires]
    New --> Timer
    
    Timer --> Route{Internal vs External?}
    Route -- Internal --> Sasha[Dispatch to Sasha Agent]
    Route -- External --> Sys[Dispatch to Sysadmin Agent]

Guidance for AI Agents

Batch Context: When you receive a NOC intelligence report, realize it may represent multiple events over a 10-minute period. Look for patterns (e.g., CPU spiked, then memory dropped) rather than treating it as a single point in time.
Correlate: Cross-reference the NOC alert with the telemetry_poller data to get a complete picture of the cluster's health.