Skip to main content

Observability

Distributed systems fail in sequences. Service A calls B, B calls C, C times out — and the root cause is three hops away from the symptom. Sifting patterns trace these chains through shared variables, detect missing recovery events, and flag SLA violations with gap constraints.

Time~15 minutes
PrerequisitesWhat is Sifting?

1. Cascade timeout

A call chain where the deepest service times out and no recovery follows.

Loading playground...

Result: 1 match — api_gateway → auth_service → user_db timed out with no recovery. The cache_service → redis chain also timed out, but redis recovered at time 8, killing that match.

What to notice: The variable chain ?svc_a → ?svc_b → ?svc_c traces the call path through join semantics. Stage 2 reuses ?svc_b from stage 1 as a caller — this is how sifting "follows" a dependency chain. The unless after is open-ended: it checks from the timeout forward, catching any future recovery.


2. Retry storm

A service retries a failed call more than once before the downstream recovers. This creates amplified load that can worsen the original failure.

Loading playground...

Result: 1 match — order_svc retried payment_api twice with no success between. shipping_svc also failed and retried inventory_api, but succeeded at time 6 before the second retry, so unless between kills that match.

What to notice: The negation window spans from the initial failure to the second retry. A successful call between those bounds means the retry storm was resolved. Without the negation, you'd flag every service that ever retried — the negation makes it specific to unresolved retry storms.


3. SLA breach

A request takes longer than the allowed threshold to complete. Use a gap constraint to enforce the timing bound.

Loading playground...

Result: 1 match — req_100 took 7 ticks (1 to 8), exceeding the gap 5.. threshold. req_101 took 2 ticks (2 to 4), within bounds.

What to notice: The temporal e1 before e2 gap 5.. constraint adds a metric bound: the gap between stage 1's end and stage 2's start must be at least 5 ticks. This is STN-style bounded-difference constraint checking — you define the SLA threshold in the pattern itself, not in post-processing.


The pattern across all three examples

PatternStagesJoinsNegationTemporal
Cascade timeoutcall → call → timeoutcaller/callee chainno recovery afterimplicit ordering
Retry stormfail → retry → retrysame caller/targetno success betweenimplicit ordering
SLA breachstart → endsame request/servicenonegap constraint (min 5 ticks)

Integration with incremental mode

In production, you feed events from your tracing pipeline into fabula's incremental engine:

let cascade_timeout = PatternBuilder::<String, MemValue>::new("cascade_timeout")
.stage("e1", |s| {
s.edge("e1", "type".into(), MemValue::Str("call".into()))
.edge_bind("e1", "caller".into(), "svc_a")
.edge_bind("e1", "callee".into(), "svc_b")
})
.stage("e2", |s| {
s.edge("e2", "type".into(), MemValue::Str("call".into()))
.edge_bind("e2", "caller".into(), "svc_b")
.edge_bind("e2", "callee".into(), "svc_c")
})
.stage("e3", |s| {
s.edge("e3", "type".into(), MemValue::Str("timeout".into()))
.edge_bind("e3", "service".into(), "svc_c")
})
.unless_after("e3", |neg| {
neg.edge("mid", "type".into(), MemValue::Str("recovery".into()))
.edge_bind("mid", "service".into(), "svc_c")
})
.build();

let mut engine: SiftEngineFor<MemGraph> = SiftEngine::new();
engine.register(cascade_timeout);

let mut graph = MemGraph::new();
graph.set_time(10);

// Each span/event from your tracing system:
let source = "e1".to_string();
let label = "type".to_string();
let value = MemValue::Str("call".into());
let interval = Interval::open(1);

graph.add_str("e1", "type", "call", 1);
graph.add_ref("e1", "caller", "api_gateway", 1);
graph.add_ref("e1", "callee", "auth_service", 1);

let events = engine.on_edge_added(&graph, &source, &label, &value, &interval);
for event in &events {
match event {
SiftEvent::Completed {
pattern, bindings, ..
} => {
// Alert: cascade_timeout detected!
// bindings["svc_a"], bindings["svc_b"], bindings["svc_c"]
// contain the affected services.
println!("Alert: {} — {:?}", pattern, bindings);
}
SiftEvent::Negated { pattern, .. } => {
// Recovery detected — a previously active alert is resolved.
println!("Resolved: {}", pattern);
}
_ => {}
}
}

The engine tracks partial matches across thousands of concurrent requests. When a cascade completes, you get the full call chain in the bindings. When a recovery event arrives, partial matches are automatically killed.

Mapping your data

OpenTelemetry spans map to fabula edges as follows:

Real-world fieldFabula edge
spanIDsource node
operationNamelabel (e.g., "type")
serviceName, statusCodetarget values via label edges
startTime, endTimeinterval [start, end)
parentSpanIDspan.parent -> parent_span edge for call-chain joins

Each span becomes a set of edges sharing the same source node. The parentSpanID edge lets patterns join child spans to their parents, enabling call-chain traversal through variable bindings.


Limitations and false positives

These patterns detect structural anomalies but are not immune to noise:

  • Cascade timeout: Retries that eventually succeed but slowly can still match the pattern. A call chain that recovers after 30 seconds of retries never emits a "recovery" event if the retry logic is internal to the service.
  • Retry storm: Legitimate retry bursts during deployments can trigger false positives. A rolling restart that causes brief connection failures looks identical to a real retry storm.
  • SLA breach: Clock skew between services can create phantom violations. If service A's clock runs 2 seconds ahead of service B, a 4.5-second request can appear to take 6.5 seconds.
  • Mitigation: Use metric gap constraints to tighten time windows (e.g., gap 10.. instead of gap 5.. to tolerate clock skew). Add negation windows for expected maintenance events (unless between for deployment markers). Consider confidence thresholds -- only alert when surprise scoring ranks the match above a baseline.

Timestamp resolution

Fabula requires strict temporal ordering between stages. Telemetry events with identical timestamps (common in batch-exported spans) cannot be placed in consecutive stages.

If your tracing backend exports spans with coarse timestamps, buffer and assign monotonic sequence IDs before feeding to fabula. Alternatively, use batch evaluation. See Thinking in Time for details.


How fabula compares

  • vs Datadog monitors: Metric thresholds and anomaly detection over time series. No structural pattern matching across distributed traces, no variable joins correlating caller/callee chains, no negation windows.
  • vs Jaeger TraceQL: Queries within a single trace (span attributes, duration filters). Limited negation, no cross-trace pattern matching, no incremental streaming. Fabula matches patterns across the full event stream, not scoped to individual traces.
  • vs Grafana alerting: Condition-based alerts on metric thresholds. No multi-step pattern detection, no temporal sequencing across services, no gap constraints for SLA enforcement.

Where to go next