Get in Touch

Course Outline

Introduction to AIOps

  • Defining AIOps and its significance
  • Traditional monitoring compared to AIOps-driven observability
  • AIOps architecture and essential components

Collecting and Normalizing Operational Data

  • Types of observability data: metrics, logs, and traces
  • Ingesting data from diverse sources (servers, containers, cloud)
  • Utilizing agents and exporters (Prometheus, Beats, Fluentd)

Data Correlation and Anomaly Detection

  • Time series correlation and statistical approaches
  • Employing ML models for anomaly detection
  • Identifying incidents across distributed systems

Alerting and Noise Reduction

  • Designing intelligent alert rules and thresholds
  • Suppression, deduplication, and alert grouping strategies
  • Integrating with Alertmanager, Slack, PagerDuty, or Opsgenie

Root Cause Analysis and Visualization

  • Using dashboards to visualize metrics and identify trends
  • Exploring events and timelines for RCA (Root Cause Analysis)
  • Tracing issues across layers with distributed tracing tools

Automation and Remediation

  • Triggering automated scripts or workflows from incidents
  • Integrating with ITSM systems (ServiceNow, Jira)
  • Use cases: self-healing, scaling, traffic rerouting

Open Source and Commercial AIOps Platforms

  • Overview of tools: Prometheus, Grafana, ELK, Moogsoft, Dynatrace
  • Criteria for evaluating and selecting an AIOps platform
  • Demo and hands-on practice with a selected stack

Summary and Next Steps

Requirements

  • A solid understanding of IT operations and system monitoring concepts
  • Prior experience with monitoring tools or dashboards
  • Familiarity with basic log and metric formats

Audience

  • Operations teams responsible for infrastructure and applications
  • Site Reliability Engineers (SREs)
  • IT monitoring and observability teams
 14 Hours

Number of participants


Price per participant

Provisional Upcoming Courses (Require 5+ participants)

Related Categories