NiftyBot - Enrich Devops and Cloud Data with Advanced AI

Incidents and cloud bills escalate when signals hide.

NiftyBot explains anomalies, estimates blast radius, and recommends the next runbook step from your telemetry and public patterns. Reliability up, spend down.

Complexity: High

Availability Blast Radius

When things fall over, minutes matter. This call estimates blast radius from regional errors and suggests the quickest traffic shift. Users feel a blip, not a blackout.

REQUEST


{ "source": { "cloud": "Azure", "cdn": "Cloudflare", "error_rate_by_region": { "East US": 0.19, "West Europe": 0.03, "Southeast Asia": 0.02 }, "critical_paths": ["/checkout", "/auth"] }, "requests": [ { "field_name": "impact_estimate" }, { "field_name": "routing_recommendation" } ] }

RESPONSE


{ "enrichments": [ { "field_name": "impact_estimate", "value": "Elevated errors confined to East US; checkout/auth impacted for ~18–22 percent of US traffic.", "confidence": 0.76, "method": "assessment", "reasoning": "Regional error rate and traffic distribution imply partial outage limited to a single Azure region." }, { "field_name": "routing_recommendation", "value": "Fail over East US to Central US with health-check gating; raise per-edge cache TTLs for static dependencies.", "confidence": 0.74, "method": "reasoning", "reasoning": "Shifts affected traffic quickly and reduces origin pressure during recovery." } ] }

Complexity: Medium

Cost Anomaly Triage

Not every spike needs a war room. This call explains an AWS cost anomaly, identifies the likely service, and suggests a rollback. FinOps and SRE align on a clear first move.

REQUEST


{ "source": { "cloud": "AWS", "monitoring": "Datadog", "anomaly_window": "2025-08-28T00:00:00Z/2025-08-29T00:00:00Z", "spend_spike_pct": 38, "top_services": ["EC2", "EKS", "S3"] }, "requests": [ { "field_name": "likely_cause" }, { "field_name": "first_action" } ] }

RESPONSE


{ "enrichments": [ { "field_name": "likely_cause", "value": "EKS node group scale-out without corresponding scale-in due to pending pods from a failed job.", "confidence": 0.72, "method": "assessment", "reasoning": "Pattern matches spike timing with control plane events and sustained EC2 hours." }, { "field_name": "first_action", "value": "Drain orphaned nodes and enforce cluster autoscaler scale-in; add TTL to batch job.", "confidence": 0.7, "method": "reasoning", "reasoning": "Closes the loop on runaway capacity and prevents recurrence." } ] }