Incident Management
Learn how to effectively manage incidents in Harness AI SRE.
Overview
Incidents in Harness AI SRE help you:
- Track and manage service disruptions
- Coordinate response efforts
- Document incident timelines
- Automate remediation steps
- Generate post-mortems
Incident Components
Basic Information
incident:
  title: "[service] - [issue_type]"
  severity: [severity]
  status: [status]
  owner: [team]
  created_at: [timestamp]
Service Context
service_context:
  name: [service]
  environment: [environment]
  team: [team]
  components:
    - name: [component]
      status: [status]
Timeline Events
timeline:
  - timestamp: [timestamp]
    type: "detection"
    description: "Issue detected by [source]"
  - timestamp: [timestamp]
    type: "action"
    description: "Executed [runbook] in response"
  - timestamp: [timestamp]
    type: "update"
    description: "Status changed to [status]"
Incident Creation
From Alert
alert_to_incident:
  title: "[alert.service] - [alert.message]"
  severity: [alert.severity]
  source: [alert.source]
  context:
    service: [alert.service]
    environment: [alert.environment]
Manual Creation
manual_incident:
  title: "[service] Degradation"
  severity: P2
  owner: [team]
  description: "Customer reported issues with [service]"
SLO Violation
slo_incident:
  title: "SLO Breach - [service]"
  severity: P1
  slo:
    name: [slo_name]
    current: [current_value]
    threshold: [threshold]
Incident Management
Status Updates
status_update:
  status: [status]
  timestamp: [timestamp]
  message: "Updated status to [status]"
  notify:
    - channel: "#[service]-incidents"
    - team: [team]
Ownership Changes
ownership_change:
  from: [current_owner]
  to: [new_owner]
  reason: "Escalating to [team] for [reason]"
  notify:
    - current_owner
    - new_owner
Communication
communication:
  channel: "#[service]-incidents"
  mentions: ["@[team]"]
  updates:
    - type: "status"
      template: "ℹ️ Status Update: [message]"
    - type: "action"
      template: "⚡ Action Taken: [action]"
Best Practices
Incident Response
- Acknowledge quickly
- Assess impact
- Communicate status
- Document actions
- Update stakeholders
Documentation
- Use clear titles
- Include context
- Track timeline
- Document decisions
- Note follow-ups
Communication
- Be clear and concise
- Update regularly
- Use right channels
- Include key details
- Follow templates