Tutorial: DevOps and Observability

A walkthrough of two critical SRE use cases — deployment risk monitoring and incident early warning with SLO breach prediction. Together they cover the two most consequential moments in a production system's lifecycle: when something changes, and when something is going wrong.

What you'll build

Two deterministic monitoring systems that continuously evaluate your production environment — detecting deployment risk before it materialises and predicting SLO breaches before they occur — with every decision auditable, consistent, and reproducible across environments.

Why Observability Needs Memintel

SRE teams already have observability tooling — metrics, traces, logs, dashboards. The problem is not data. The problem is signal extraction from noise at scale.

Current alerting systems are rules-based: if error rate exceeds 1%, page someone. This approach has two well-known failure modes that every SRE team lives with:

Alert fatigue — rules fire constantly on thresholds that are technically breached but contextually normal. Engineers learn to ignore alerts. The one that matters gets missed.

Late detection — by the time a static threshold is crossed, the degradation has been underway for hours. The intervention window has already narrowed. The incident is no longer preventable — it is only recoverable.

Both failures have the same root cause: the rules were written against a specific understanding of what "bad" looks like that may not match the current system's behaviour, and nobody updated them.

Memintel addresses this differently. Instead of asking an engineer to specify what failure looks like, you express what reliability means — and the system continuously evaluates whether the trajectory of your environment is consistent with that.

The Three Roles

Role	Who they are	What they do
Data Engineer	Platform / observability team	Builds metric pipelines, writes resolver functions, delivers typed primitives from observability stack
Admin	SRE lead / platform owner	Maintains config files — primitives, guardrails — governs strategy registry and severity priors
User	SRE / on-call engineer	Expresses intent via internal tooling — creates, monitors, and manages monitoring tasks

Use Case 1 — Deployment Risk Monitoring

The Problem

Every deployment is a risk event. Even with CI/CD, canary releases, and feature flags, deployments introduce the most common source of production incidents. The challenge is that deployment risk is not binary — it is a function of what changed, what the current system state is, and what patterns previous deployments with similar characteristics have shown.

Current approaches are either too conservative (block everything) or too permissive (ship and watch). What is missing is a continuous assessment of deployment risk that accounts for the specific characteristics of this deployment, this service, and this moment in the production environment.

The Architecture Boundary

Raw signals   →  Signal Extraction  →  Primitives  →  Memintel
(CI/CD events,   (diff analysis,       (typed,          (deterministic
 metrics,         change scoring,       normalised)       risk evaluation)
 test results)    LLM on changelogs)

Primitive Design

# memintel_primitives_deployment.yaml

primitives:

  # Change characteristics
  - id: deployment.lines_changed
    type: int
    source: ci_pipeline
    entity: deployment_id
    description: Total lines of code changed in this deployment

  - id: deployment.files_changed_count
    type: int
    source: ci_pipeline
    entity: deployment_id
    description: Number of files modified in this deployment

  - id: deployment.has_db_migration
    type: boolean
    source: ci_pipeline
    entity: deployment_id
    description: True if deployment includes a database schema migration

  - id: deployment.has_dependency_update
    type: boolean
    source: ci_pipeline
    entity: deployment_id
    description: True if deployment updates external dependencies

  - id: deployment.changelog_risk_score
    type: float
    source: nlp_pipeline
    entity: deployment_id
    description: LLM-extracted risk score from PR description and commit messages, 0-1

  - id: deployment.changelog_confidence
    type: float
    source: nlp_pipeline
    entity: deployment_id
    description: Confidence of changelog risk extraction, 0-1

  # Service state at time of deployment
  - id: service.current_error_rate
    type: float
    source: metrics_pipeline
    entity: service_id
    description: Current error rate — deploying into an already-degraded service increases risk

  - id: service.current_p99_latency_ms
    type: float
    source: metrics_pipeline
    entity: service_id
    description: Current p99 latency in milliseconds

  - id: service.active_incident_flag
    type: boolean
    source: incident_management_pipeline
    entity: service_id
    description: True if service has an active open incident

  - id: service.deployment_failure_rate_30d
    type: float
    source: ci_pipeline
    entity: service_id
    description: Ratio of failed deployments in last 30 days, 0-1

  # Timing and environment signals
  - id: deployment.is_peak_traffic_window
    type: boolean
    source: traffic_analysis_pipeline
    entity: deployment_id
    description: True if deployment falls within peak traffic hours for this service

  - id: deployment.days_since_last_deploy
    type: int
    source: ci_pipeline
    entity: service_id
    description: Days since the last successful deployment — longer gap increases risk

  - id: deployment.test_coverage_delta
    type: float
    source: ci_pipeline
    entity: deployment_id
    description: Change in test coverage — negative means coverage decreased

  - id: deployment.similar_deployment_incident_rate
    type: float
    source: historical_analysis_pipeline
    entity: deployment_id
    description: Incident rate for historically similar deployments to this service, 0-1

Resolver Examples

# resolvers/deployment_resolvers.py

@registry.resolver("deployment.changelog_risk_score")
async def resolve_changelog_risk(entity_id: str, timestamp: datetime) -> float:
    # Pull PR description, commit messages, and diff summary
    # Pass to LLM for risk extraction — happens once at deployment creation
    deployment = await ci_api.get_deployment(entity_id)
    changelog_text = f"""
    PR title: {deployment.pr_title}
    PR description: {deployment.pr_description}
    Commits: {deployment.commit_messages}
    Changed files: {deployment.changed_files_summary}
    """
    result = await llm_client.extract_risk_score(
        text=changelog_text,
        domain="software_deployment",
        as_of=timestamp
    )
    return result.risk_score


@registry.resolver("deployment.similar_deployment_incident_rate")
async def resolve_similar_incident_rate(entity_id: str, timestamp: datetime) -> float:
    # Find historically similar deployments and compute their incident rate
    deployment = await ci_api.get_deployment(entity_id)
    result = await db.execute("""
        SELECT
            SUM(CASE WHEN d.caused_incident THEN 1 ELSE 0 END)::float
            / NULLIF(COUNT(*), 0)
        FROM deployments d
        WHERE d.service_id = $1
          AND d.has_db_migration = $2
          AND d.has_dependency_update = $3
          AND d.deployed_at BETWEEN $4 - INTERVAL '90 days' AND $4
          AND d.deployment_id != $5
    """, deployment.service_id, deployment.has_db_migration,
        deployment.has_dependency_update, timestamp, entity_id)
    return float(result.scalar() or 0)


@registry.resolver("service.current_error_rate")
async def resolve_error_rate(entity_id: str, timestamp: datetime) -> float:
    # Point-in-time error rate from metrics store
    result = await metrics_api.query(
        metric="http_error_rate",
        service=entity_id,
        at=timestamp,
        window="5m"
    )
    return float(result.value or 0)


@registry.resolver("deployment.is_peak_traffic_window")
async def resolve_peak_traffic(entity_id: str, timestamp: datetime) -> bool:
    deployment = await ci_api.get_deployment(entity_id)
    # Check if deployment time falls within service's historically peak hours
    result = await db.execute("""
        SELECT EXISTS (
            SELECT 1 FROM service_traffic_patterns
            WHERE service_id = $1
              AND day_of_week = EXTRACT(DOW FROM $2)
              AND hour_of_day = EXTRACT(HOUR FROM $2)
              AND is_peak = true
        )
    """, deployment.service_id, timestamp)
    return bool(result.scalar())

Guardrails

# memintel_guardrails_deployment.yaml

type_strategy_map:
  int:      [threshold, percentile]
  float:    [threshold, percentile, z_score]
  boolean:  [equals]

parameter_priors:
  deployment.lines_changed:
    low_severity:     { threshold: 200  }
    medium_severity:  { threshold: 500  }
    high_severity:    { threshold: 1500 }

  deployment.changelog_risk_score:
    low_severity:     { threshold: 0.4 }
    medium_severity:  { threshold: 0.6 }
    high_severity:    { threshold: 0.8 }

  deployment.similar_deployment_incident_rate:
    low_severity:     { threshold: 0.10 }
    medium_severity:  { threshold: 0.20 }
    high_severity:    { threshold: 0.35 }

  service.deployment_failure_rate_30d:
    low_severity:     { threshold: 0.10 }
    medium_severity:  { threshold: 0.20 }
    high_severity:    { threshold: 0.35 }

bias_rules:
  cautious:     high_severity
  standard:     medium_severity
  fast:         low_severity
  hotfix:       low_severity     # hotfixes bypass standard caution

global_default_strategy:   threshold
global_preferred_strategy: threshold

User Creates Tasks

// SRE lead creates deployment monitoring tasks via internal deployment portal

// Task 1 — High-risk deployment gate
const deploymentRisk = await client.tasks.create({
    intent: "Alert me when a deployment carries significant risk based on change size, service history, and current environment state",
    entityScope: "all_production_deployments",
    delivery: {
        type: "webhook",
        endpoint: "https://myapp.com/hooks/deployment-risk"
    },
    dryRun: true
});

// System resolves to composite condition:
// changelog_risk_score + similar_incident_rate + service_error_rate
// weighted by has_db_migration, is_peak_traffic_window, days_since_last_deploy

// Task 2 — Immediate block: deploying into active incident
const activeIncidentBlock = await client.tasks.create({
    intent: "Alert me immediately when someone attempts to deploy to a service with an active incident",
    entityScope: "all_production_deployments",
    delivery: {
        type: "webhook",
        endpoint: "https://myapp.com/hooks/deployment-block",
        priority: "immediate"
    }
});
// Resolves to: service.active_incident_flag = true (immediate — no threshold)

// Task 3 — Peak traffic deployment warning
const peakDeployment = await client.tasks.create({
    intent: "Alert me when a significant deployment is scheduled during peak traffic",
    entityScope: "all_production_deployments",
    delivery: {
        type: "webhook",
        endpoint: "https://myapp.com/hooks/deployment-timing"
    }
});
// Resolves to: is_peak_traffic_window = true AND lines_changed > 500

What the Alert Looks Like

⚠️ Deployment Risk Alert — High Risk Detected

Service:      payment-service  (v2.4.1 → v2.4.2)
Deployment:   deploy_9f3k2m
Triggered by: @sarah.chen  |  14:23 UTC

Risk Factors:
  • DB migration included          — schema change detected
  • 847 lines changed              — above medium threshold (500)
  • Changelog risk score: 0.71     — "refactored payment retry logic"
  • Similar deployments:           — 28% caused incidents in last 90 days
  • Current error rate: 0.41%      — elevated baseline (normal: 0.12%)
  • Peak traffic window:           — 14:00-16:00 is historically high traffic

Overall risk score:  0.79  (threshold: 0.65)

Recommendation: Consider delaying to off-peak window (after 18:00 UTC)
or enabling canary release with 5% traffic split.

→ Proceed anyway   → Delay deployment   → Switch to canary   → Cancel

The Deployment Risk Advantage

The deployment.similar_deployment_incident_rate primitive is what makes this genuinely intelligent rather than just threshold-based. A deployment with a database migration to a service that has had 28% of similar deployments cause incidents is categorically different from a deployment with a database migration to a service that has never had a migration-related incident. The system distinguishes between these automatically — not because a rule was written for each service, but because the historical pattern is encoded in the primitive and the intent "significant risk" incorporates it.

Use Case 2 — Incident Early Warning and SLO Breach Prediction

The Problem

SLO breaches are almost never sudden. They are the end state of a gradual degradation that began hours or days before the threshold was crossed. By the time a static alert fires, the error budget may already be significantly consumed, the on-call engineer is paged into an already-critical situation, and the investigation starts from scratch.

The SRE team does not need to be told the SLO was breached. They need to be told the SLO is going to be breached — with enough lead time to investigate and intervene before users are impacted.

The Architecture Boundary

Raw telemetry  →  Signal Extraction  →  Primitives  →  Memintel
(metrics,          (aggregation,         (typed,          (deterministic
 traces,            percentile calc,      normalised)       trajectory eval)
 logs, events)      anomaly scoring)

Primitive Design

# memintel_primitives_slo.yaml

primitives:

  # SLO and error budget signals
  - id: service.error_rate_5m
    type: float
    source: metrics_pipeline
    entity: service_id
    description: Error rate over last 5 minutes

  - id: service.error_rate_trend_1h
    type: time_series<float>
    source: metrics_pipeline
    entity: service_id
    description: Error rate sampled every 5 minutes over last hour — enables change detection

  - id: service.error_budget_remaining_pct
    type: float
    source: slo_pipeline
    entity: service_id
    description: Percentage of monthly error budget remaining, 0-1

  - id: service.error_budget_burn_rate_1h
    type: float
    source: slo_pipeline
    entity: service_id
    description: Current error budget burn rate — how many months of budget consumed per hour

  # Latency signals
  - id: service.p99_latency_ms
    type: float
    source: metrics_pipeline
    entity: service_id
    description: Current p99 latency in milliseconds

  - id: service.p99_latency_trend_1h
    type: time_series<float>
    source: metrics_pipeline
    entity: service_id
    description: p99 latency sampled every 5 minutes over last hour

  - id: service.p99_latency_vs_baseline_ratio
    type: float
    source: metrics_pipeline
    entity: service_id
    description: Current p99 vs historical baseline for same day/hour pattern

  # Infrastructure signals
  - id: service.memory_utilization_pct
    type: float
    source: metrics_pipeline
    entity: service_id
    description: Current memory utilisation percentage

  - id: service.memory_trend_2h
    type: time_series<float>
    source: metrics_pipeline
    entity: service_id
    description: Memory utilisation over last 2 hours — detects memory leaks

  - id: service.cpu_utilization_pct
    type: float
    source: metrics_pipeline
    entity: service_id
    description: Current CPU utilisation percentage

  - id: service.saturation_score
    type: float
    source: metrics_pipeline
    entity: service_id
    description: Composite resource saturation across CPU, memory, connections, 0-1

  # Dependency signals
  - id: dependency.error_rate
    type: float
    source: metrics_pipeline
    entity: dependency_id
    description: Error rate of upstream dependency — enables cascade detection

  - id: dependency.latency_ms
    type: float
    source: metrics_pipeline
    entity: dependency_id
    description: Current latency of upstream dependency

  - id: service.downstream_impact_score
    type: float
    source: dependency_graph_pipeline
    entity: service_id
    description: Estimated blast radius if this service degrades, 0-1

  # Historical context
  - id: service.recent_deployment_flag
    type: boolean
    source: ci_pipeline
    entity: service_id
    description: True if a deployment occurred in the last 2 hours

  - id: service.similar_incident_pattern_score
    type: float
    source: historical_analysis_pipeline
    entity: service_id
    description: Similarity of current signal pattern to patterns that preceded past incidents, 0-1

Resolver Examples

# resolvers/slo_resolvers.py

@registry.resolver("service.error_budget_burn_rate_1h")
async def resolve_burn_rate(entity_id: str, timestamp: datetime) -> float:
    # Burn rate = how fast we are consuming the monthly error budget
    # A burn rate of 1.0 = consuming budget at exactly the sustainable rate
    # A burn rate of 14.4 = consuming the entire monthly budget in 2 days
    result = await db.execute("""
        WITH error_counts AS (
            SELECT
                COUNT(CASE WHEN status >= 500 THEN 1 END)::float AS errors,
                COUNT(*)::float AS total
            FROM request_log
            WHERE service_id = $1
              AND request_time BETWEEN $2 - INTERVAL '1 hour' AND $2
        )
        SELECT
            (errors / NULLIF(total, 0)) / $3 * (30 * 24)
        FROM error_counts
    """, entity_id, timestamp, await get_slo_threshold(entity_id))
    return float(result.scalar() or 0)


@registry.resolver("service.error_rate_trend_1h")
async def resolve_error_rate_trend(entity_id: str, timestamp: datetime) -> list[float]:
    # Return 12 data points — one per 5 minutes over the last hour
    result = await metrics_api.query_range(
        metric="http_error_rate",
        service=entity_id,
        start=timestamp - timedelta(hours=1),
        end=timestamp,
        step="5m"
    )
    return [float(point.value) for point in result.data_points]


@registry.resolver("service.similar_incident_pattern_score")
async def resolve_pattern_score(entity_id: str, timestamp: datetime) -> float:
    # Compare current signal fingerprint against historical pre-incident patterns
    current_signals = await get_signal_fingerprint(entity_id, timestamp)
    result = await pattern_matcher.similarity_score(
        service_id=entity_id,
        current_fingerprint=current_signals,
        compare_against="pre_incident_windows",
        lookback_days=90
    )
    return float(result.score)


@registry.resolver("service.p99_latency_vs_baseline_ratio")
async def resolve_latency_vs_baseline(entity_id: str, timestamp: datetime) -> float:
    current = await metrics_api.query(
        metric="http_p99_latency",
        service=entity_id,
        at=timestamp,
        window="5m"
    )
    # Compare against same hour same day-of-week over last 4 weeks
    baseline = await metrics_api.query(
        metric="http_p99_latency",
        service=entity_id,
        at=timestamp,
        window="5m",
        historical_baseline=True,
        baseline_weeks=4
    )
    return float(current.value / baseline.value) if baseline.value > 0 else 1.0

Guardrails

# memintel_guardrails_slo.yaml

type_strategy_map:
  float:                [threshold, percentile, z_score]
  time_series<float>:   [change, z_score, percentile]
  boolean:              [equals]

parameter_priors:
  service.error_budget_burn_rate_1h:
    low_severity:     { threshold: 2.0  }   # consuming budget 2x faster than sustainable
    medium_severity:  { threshold: 5.0  }   # consuming 30-day budget in 6 days
    high_severity:    { threshold: 14.4 }   # consuming 30-day budget in 2 days

  service.p99_latency_vs_baseline_ratio:
    low_severity:     { threshold: 1.5 }    # 50% above baseline
    medium_severity:  { threshold: 2.5 }    # 150% above baseline
    high_severity:    { threshold: 4.0 }    # 300% above baseline

  service.similar_incident_pattern_score:
    low_severity:     { threshold: 0.55 }
    medium_severity:  { threshold: 0.70 }
    high_severity:    { threshold: 0.85 }

  service.saturation_score:
    low_severity:     { threshold: 0.70 }
    medium_severity:  { threshold: 0.82 }
    high_severity:    { threshold: 0.92 }

  # Change strategy thresholds for trend detection
  service.error_rate_trend_1h:
    low_severity:     { value: 0.002, window: "30m" }   # 0.2pp increase in 30 mins
    medium_severity:  { value: 0.005, window: "20m" }   # 0.5pp increase in 20 mins
    high_severity:    { value: 0.010, window: "15m" }   # 1.0pp increase in 15 mins

bias_rules:
  proactive:    low_severity
  early:        low_severity
  standard:     medium_severity
  critical:     high_severity
  page:         high_severity

global_default_strategy:   threshold
global_preferred_strategy: change    # trend detection is preferred for SLO monitoring

User Creates Tasks

// SRE creates monitoring tasks via internal platform or Slack bot

// Task 1 — Error budget burn rate — the primary SLO signal
const burnRateTask = await client.tasks.create({
    intent: "Alert me early when error budget is burning significantly faster than sustainable",
    entityScope: "all_production_services",
    delivery: {
        type: "webhook",
        endpoint: "https://myapp.com/hooks/slo-burn-rate"
    }
});
// Resolves to: burn_rate > 5.0 (medium severity — consuming budget in 6 days)
// Fires well before the SLO is breached

// Task 2 — Error rate trending up
const errorTrendTask = await client.tasks.create({
    intent: "Alert me when error rate is rising steadily — before it breaches the SLO threshold",
    entityScope: "all_production_services",
    delivery: {
        type: "webhook",
        endpoint: "https://myapp.com/hooks/slo-error-trend"
    }
});
// Resolves to: change strategy on error_rate_trend_1h
// Fires on sustained upward trajectory, not just current level

// Task 3 — Memory leak detection
const memoryLeakTask = await client.tasks.create({
    intent: "Alert me when a service shows signs of a memory leak",
    entityScope: "all_production_services",
    delivery: {
        type: "webhook",
        endpoint: "https://myapp.com/hooks/slo-memory-leak"
    }
});
// Resolves to: monotonic increase strategy on memory_trend_2h
// Detects gradual linear growth that indicates a leak, not a spike

// Task 4 — Pattern matches past incidents
const patternTask = await client.tasks.create({
    intent: "Alert me when current service signals resemble patterns that preceded past incidents",
    entityScope: "all_production_services",
    delivery: {
        type: "webhook",
        endpoint: "https://myapp.com/hooks/slo-pattern-match"
    }
});
// Resolves to: similar_incident_pattern_score > 0.70
// This is the most powerful early warning — fires on familiar danger signs

// Task 5 — Latency degradation relative to baseline
const latencyTask = await client.tasks.create({
    intent: "Alert me when p99 latency is significantly above the historical baseline for this time of day",
    entityScope: "all_production_services",
    delivery: {
        type: "webhook",
        endpoint: "https://myapp.com/hooks/slo-latency"
    }
});
// Resolves to: p99_latency_vs_baseline_ratio > 2.5
// Baseline-adjusted — handles services with variable traffic patterns

What the Alerts Look Like

Burn rate alert — early warning:

⚠️ SLO Early Warning — Error Budget Burning Fast

Service:       checkout-service
SLO:           99.9% availability  |  Monthly budget: 43.8 minutes
Triggered:     14:47 UTC  (proactive — SLO not yet breached)

Error budget status:
  Remaining:    67.3%  (29.4 minutes)
  Burn rate:    6.2x   (consuming 30-day budget in 4.8 days)
  At this rate: SLO breach in approximately 18.3 hours

Current signals:
  Error rate:     0.41%  (SLO threshold: 0.10%)
  p99 latency:    284ms  (baseline: 112ms — 2.5x elevated)
  Saturation:     0.71   (approaching high threshold)
  Recent deploy:  Yes    (payment-service v2.4.2 — 47 mins ago)

Pattern match:  0.68 similarity to pre-incident patterns (last incident: 2023-11-14)

→ Open investigation   → Check recent deployment   → Rollback   → Silence 30m

Memory leak alert:

⚠️ SLO Early Warning — Memory Leak Detected

Service:      user-session-service
Triggered:    09:23 UTC

Memory trend (last 2 hours):
  07:23: 61.2%  →  08:23: 68.7%  →  09:23: 76.1%
  Rate of increase: +7.4% per hour (linear — consistent with leak)

At current rate:
  OOM likely:    ~3.2 hours  (12:30 UTC)
  Service impact: downstream auth-service and api-gateway affected

Recent changes:
  Last deployment: user-session-service v1.8.3  (yesterday 16:45 UTC)
  Changed: session cache TTL logic refactored

→ Profile memory   → Check heap dump   → Rollback to v1.8.2   → Scale horizontally

The SLO Monitoring Advantage

Two primitives make this qualitatively different from threshold alerting:

service.error_budget_burn_rate_1h converts the abstract SLO into a concrete rate of consumption. A burn rate of 6.2x tells the SRE that at the current pace, the entire monthly error budget will be consumed in 4.8 days — not that the error rate is 0.41%. This is the framing that drives action. Nobody acts on "error rate is 0.41% and the SLO is 0.1%." Everyone acts on "we will breach our SLO in 18 hours at this rate."

service.similar_incident_pattern_score is the most powerful primitive in this system. Instead of writing rules for every failure mode, the system learns from historical incident data what the pre-incident signal fingerprint looks like — and fires when the current fingerprint resembles it. The SRE does not need to enumerate every possible failure pattern. They express "alert me when this looks like it's going wrong" and the system matches against the evidence of what "going wrong" has looked like before.

The System Response Loop

Trigger: Every 5 minutes (metrics) OR event-driven (deployment events)
          ↓
For each service in scope:
  Resolvers fetch point-in-time values:
    → Metrics API: error_rate, latency, memory, CPU
    → SLO pipeline: budget remaining, burn rate
    → CI pipeline: recent deployment flag
    → Pattern matcher: similarity to past incidents
  Concept computation runs:
    → Composite saturation score
    → Burn rate calculation
    → Trend analysis on time-series primitives
  Condition evaluates
          ↓
  If fired:
    Alert delivered with context — not just "threshold breached"
    Decision logged with full audit trail
    Runbook link attached based on condition type
          ↓
  If not: Decision logged. No alert.
          ↓
Next 5-minute tick

Event-driven vs scheduled

Task	Best trigger	Reason
Deployment risk	Event-driven on deployment creation	Risk assessment needed before deployment proceeds
Active incident block	Event-driven on deployment creation	Immediate — must fire before deployment starts
Error budget burn rate	Every 5 minutes	Budget consumption is continuous
Error rate trend	Every 5 minutes	Trend needs consistent sampling interval
Memory leak	Every 5 minutes	Linear growth requires regular sampling to detect slope
Pattern match	Every 5 minutes	Pattern similarity changes as signals evolve

Task Management

User controls

// View all active SLO monitoring tasks
const tasks = await client.tasks.list({ owner: "current_user" });

// Silence a task during a known maintenance window
await client.tasks.pause("tsk_slo_burn_rate_checkout", {
    reason: "Planned maintenance window 02:00-04:00 UTC — expected elevated errors"
});

// Check alert performance — are we getting too many false positives?
const taskDetail = await client.tasks.get("tsk_error_rate_trend");
console.log(`Fired ${taskDetail.fire_count_7d} times in last 7 days`);
console.log(`False positive rate: ${taskDetail.false_positive_rate_7d}`);
console.log(`Avg lead time before incident: ${taskDetail.avg_lead_time_minutes} mins`);

Admin visibility

// Admin view — alert volume and quality across all services
const metrics = await adminClient.tasks.metrics({
    tag: "slo_monitoring"
});

console.log(metrics.total_alerts_7d);          // 847
console.log(metrics.false_positive_rate);       // 0.23 — 23% false positive rate
console.log(metrics.avg_lead_time_minutes);     // 34 — average 34 mins before incident
console.log(metrics.incidents_caught_early);    // 12 of 14 incidents had prior warning

// Version distribution — are teams on the latest calibrated conditions?
console.log(metrics.version_distribution);
// { "1.0": 8, "1.1": 23, "1.2": 41 }  — most teams on latest version

// Suspend tasks if metrics pipeline is unreliable
await adminClient.tasks.suspend("tsk_memory_leak_all_services", {
    reason: "Memory metrics pipeline outage — data unreliable until 15:00 UTC"
});

Calibration

Deployment risk — driven by post-deployment incident data

// A deployment was flagged as high-risk but deployed without incident — false positive
await client.feedback.decision({
    conditionId: "deployment.high_risk",
    conditionVersion: "1.1",
    entity: "deploy_9f3k2m",
    timestamp: "2024-03-15T14:23:00Z",
    feedback: "false_positive",
    note: "Deployment completed without incident — DB migrations for this service are low risk"
});

SLO monitoring — driven by incident retrospectives

// An incident occurred that was not predicted — false negative
// Memory was not flagged even though leak had been running for 3 hours
await client.feedback.decision({
    conditionId: "service.memory_leak_risk",
    conditionVersion: "1.0",
    entity: "user-session-service",
    timestamp: "2024-03-12T07:00:00Z",
    feedback: "false_negative",
    note: "Memory leak started at 07:00 — OOM occurred at 11:30 — not caught early enough"
});

// Admin reviews calibration recommendation
const cal = await adminClient.conditions.calibrate({
    conditionId: "service.memory_leak_risk",
    conditionVersion: "1.0",
});

console.log(cal.recommended_params);
// { monotonic_increase_threshold: 5.0 }  — reduced from 7.4% per hour to 5%
// Catches the leak earlier in its progression

console.log(cal.impact.delta_alerts);   // +3 per week
// Slightly more alerts but catches leaks 90 mins earlier on average

Full Lifecycle Diagram

SETUP (one time)
──────────────────────────────────────────────────────────
Data Engineer:   Configure metrics pipeline resolvers
Data Engineer:   Configure CI/CD event pipeline resolvers
Data Engineer:   Build pattern matcher against incident history
Admin:           Define primitives → memintel_primitives_deployment.yaml
                                     memintel_primitives_slo.yaml
Admin:           Define guardrails  → memintel_guardrails_deployment.yaml
                                      memintel_guardrails_slo.yaml
System:          Load config at startup


DEPLOYMENT RISK EVALUATION
──────────────────────────────────────────────────────────
Deployment event created in CI/CD
  → Memintel evaluates immediately
  → Risk score computed from change size, service history, environment state
  → If high risk: alert fires before deployment proceeds
  → Engineer decides: proceed, delay, switch to canary, or cancel


ONGOING SLO MONITORING LOOP
──────────────────────────────────────────────────────────
Every 5 minutes:
  → Resolvers fetch current metrics, SLO state, pattern scores
  → Concept computation: burn rate, trend analysis, pattern matching
  → Conditions evaluate
  → If fired: contextual alert with lead time estimate and runbook link
  → If not: decision logged, no alert


TASK MANAGEMENT
──────────────────────────────────────────────────────────
User:    View / pause / resume own tasks
         (e.g. pause during planned maintenance windows)
Admin:   View alert volume and quality across all services
Admin:   Suspend tasks during metrics pipeline outages
Admin:   Deprecate condition versions after calibration


CALIBRATION CYCLE
──────────────────────────────────────────────────────────
Ground truth: incident retrospectives, post-deployment reviews
User:    Submit feedback (false positive / false negative)
Admin:   Review calibration recommendation + lead time impact
Admin:   Approve → new condition version (immutable)
User:    Rebind task to new version (explicit, never automatic)
         ↓
Back to ONGOING LOOP with improved detection sensitivity

Application Context

Before creating primitives and tasks for this use case, define the application context so the LLM can compile accurate, domain-aware monitoring definitions.

SRE / SLO monitoring context:

{
  "domain": {
    "description": "Production reliability monitoring for a cloud-native SaaS platform. We monitor service health and SLO compliance to detect degradation before it impacts customers.",
    "entities": [
      { "name": "service",    "description": "a production microservice with defined SLOs" },
      { "name": "deployment", "description": "a code or configuration change deployed to production" }
    ],
    "decisions": ["slo_breach_risk", "deployment_block", "incident_early_warning", "memory_leak"]
  },
  "behavioural": {
    "data_cadence": "streaming",
    "meaningful_windows": { "min": "5m", "max": "24h" }
  },
  "semantic_hints": [
    { "term": "degradation",     "definition": "sustained increase in error rate or latency over at least 15 minutes" },
    { "term": "high risk deploy", "definition": "includes DB migration, dependency update, or >500 lines changed" },
    { "term": "peak traffic",    "definition": "14:00-18:00 UTC weekdays based on historical p95 traffic patterns" }
  ],
  "calibration_bias": {
    "false_negative_cost": "high",
    "false_positive_cost": "medium"
  }
}

The streaming cadence and short minimum window (5m) tell the compiler this is a real-time monitoring context — it will prefer event-driven strategies and short evaluation windows over batch approaches. The semantic hint for "degradation" means "alert me when this service shows early signs of degradation" compiles to a sustained trend condition, not a single-point threshold breach. false_negative_cost: high because a missed incident early warning is worse than a brief false page.

Role Summary

Step	Who	Deployment Risk	SLO Monitoring
Data pipeline	Data Eng.	CI/CD events, diff analysis, NLP on changelogs	Metrics pipeline, SLO engine, pattern matcher
Resolvers	Data Eng.	Changelog risk, historical incident rate, environment state	Burn rate, trend series, baseline comparison
Primitives	Admin	`memintel_primitives_deployment.yaml`	`memintel_primitives_slo.yaml`
Guardrails	Admin	`memintel_guardrails_deployment.yaml`	`memintel_guardrails_slo.yaml`
Task creation	User	SRE lead / release manager	On-call SRE
Feedback	User	Post-deployment incident reviews	Incident retrospectives
Calibration	Admin	Approves risk threshold adjustments	Approves sensitivity adjustments

Why This Architecture Fits SRE

Both use cases share the same structural property: the risk signal requires evaluating current state against historical pattern, not just against a static threshold.

Deployment risk requires evaluating this deployment against the history of similar deployments to this service — with similar change characteristics, at similar times, in similar environment states. A threshold on lines changed is noise. A threshold on "lines changed to a service whose similar deployments caused incidents 28% of the time" is signal.

SLO breach prediction requires evaluating current metrics against the trajectory — not the current value. An error rate of 0.41% below a 1.0% SLO threshold is fine in isolation. An error rate that has risen from 0.08% to 0.41% in 90 minutes, with a burn rate of 6.2x, while memory is climbing linearly and the signal pattern resembles the last three incidents — that is a page that should have gone out 30 minutes ago.

Static thresholds fire when something is already wrong. Memintel fires when something is becoming wrong — which is the only framing that gives an SRE team enough time to prevent an incident rather than just respond to one.

Next Steps

Core Concepts — understand the ψ → φ → α model in depth
Why not SQL and rules? — the architectural case for intent-based monitoring
Financial Risk Monitoring — the same architecture in finance
API Reference — full endpoint documentation

Why Observability Needs Memintel​

The Three Roles​

Use Case 1 — Deployment Risk Monitoring​

The Problem​

The Architecture Boundary​

Primitive Design​

Resolver Examples​

Guardrails​

User Creates Tasks​

What the Alert Looks Like​

The Deployment Risk Advantage​

Use Case 2 — Incident Early Warning and SLO Breach Prediction​

The Problem​

The Architecture Boundary​

Primitive Design​

Resolver Examples​

Guardrails​

User Creates Tasks​

What the Alerts Look Like​

The SLO Monitoring Advantage​

The System Response Loop​

Event-driven vs scheduled​

Task Management​

User controls​

Admin visibility​

Calibration​

Deployment risk — driven by post-deployment incident data​

SLO monitoring — driven by incident retrospectives​

Full Lifecycle Diagram​

Application Context​

Role Summary​

Why This Architecture Fits SRE​

Next Steps​

Why Observability Needs Memintel

The Three Roles

Use Case 1 — Deployment Risk Monitoring

The Problem

The Architecture Boundary

Primitive Design

Resolver Examples

Guardrails

User Creates Tasks

What the Alert Looks Like

The Deployment Risk Advantage

Use Case 2 — Incident Early Warning and SLO Breach Prediction

The Problem

The Architecture Boundary

Primitive Design

Resolver Examples

Guardrails

User Creates Tasks

What the Alerts Look Like

The SLO Monitoring Advantage

The System Response Loop

Event-driven vs scheduled

Task Management

User controls

Admin visibility

Calibration

Deployment risk — driven by post-deployment incident data

SLO monitoring — driven by incident retrospectives

Full Lifecycle Diagram

Application Context

Role Summary

Why This Architecture Fits SRE

Next Steps