Observability
Distributed systems fail in sequences. Service A calls B, B calls C, C times out — and the root cause is three hops away from the symptom. Sifting patterns trace these chains through shared variables, detect missing recovery events, and flag SLA violations with gap constraints.
| Time | ~15 minutes |
| Prerequisites | What is Sifting? |
1. Cascade timeout
A call chain where the deepest service times out and no recovery follows.
Result: 1 match — api_gateway → auth_service → user_db timed out with no recovery. The cache_service → redis chain also timed out, but redis recovered at time 8, killing that match.
What to notice: The variable chain ?svc_a → ?svc_b → ?svc_c traces the call path through join semantics. Stage 2 reuses ?svc_b from stage 1 as a caller — this is how sifting "follows" a dependency chain. The unless after is open-ended: it checks from the timeout forward, catching any future recovery.
2. Retry storm
A service retries a failed call more than once before the downstream recovers. This creates amplified load that can worsen the original failure.
Result: 1 match — order_svc retried payment_api twice with no success between. shipping_svc also failed and retried inventory_api, but succeeded at time 6 before the second retry, so unless between kills that match.
What to notice: The negation window spans from the initial failure to the second retry. A successful call between those bounds means the retry storm was resolved. Without the negation, you'd flag every service that ever retried — the negation makes it specific to unresolved retry storms.
3. SLA breach
A request takes longer than the allowed threshold to complete. Use a gap constraint to enforce the timing bound.
Result: 1 match — req_100 took 7 ticks (1 to 8), exceeding the gap 5.. threshold. req_101 took 2 ticks (2 to 4), within bounds.
What to notice: The temporal e1 before e2 gap 5.. constraint adds a metric bound: the gap between stage 1's end and stage 2's start must be at least 5 ticks. This is STN-style bounded-difference constraint checking — you define the SLA threshold in the pattern itself, not in post-processing.
The pattern across all three examples
| Pattern | Stages | Joins | Negation | Temporal |
|---|---|---|---|---|
| Cascade timeout | call → call → timeout | caller/callee chain | no recovery after | implicit ordering |
| Retry storm | fail → retry → retry | same caller/target | no success between | implicit ordering |
| SLA breach | start → end | same request/service | none | gap constraint (min 5 ticks) |
Integration with incremental mode
In production, you feed events from your tracing pipeline into fabula's incremental engine:
let cascade_timeout = PatternBuilder::<String, MemValue>::new("cascade_timeout")
.stage("e1", |s| {
s.edge("e1", "type".into(), MemValue::Str("call".into()))
.edge_bind("e1", "caller".into(), "svc_a")
.edge_bind("e1", "callee".into(), "svc_b")
})
.stage("e2", |s| {
s.edge("e2", "type".into(), MemValue::Str("call".into()))
.edge_bind("e2", "caller".into(), "svc_b")
.edge_bind("e2", "callee".into(), "svc_c")
})
.stage("e3", |s| {
s.edge("e3", "type".into(), MemValue::Str("timeout".into()))
.edge_bind("e3", "service".into(), "svc_c")
})
.unless_after("e3", |neg| {
neg.edge("mid", "type".into(), MemValue::Str("recovery".into()))
.edge_bind("mid", "service".into(), "svc_c")
})
.build();
let mut engine: SiftEngineFor<MemGraph> = SiftEngine::new();
engine.register(cascade_timeout);
let mut graph = MemGraph::new();
graph.set_time(10);
// Each span/event from your tracing system:
let source = "e1".to_string();
let label = "type".to_string();
let value = MemValue::Str("call".into());
let interval = Interval::open(1);
graph.add_str("e1", "type", "call", 1);
graph.add_ref("e1", "caller", "api_gateway", 1);
graph.add_ref("e1", "callee", "auth_service", 1);
let events = engine.on_edge_added(&graph, &source, &label, &value, &interval);
for event in &events {
match event {
SiftEvent::Completed {
pattern, bindings, ..
} => {
// Alert: cascade_timeout detected!
// bindings["svc_a"], bindings["svc_b"], bindings["svc_c"]
// contain the affected services.
println!("Alert: {} — {:?}", pattern, bindings);
}
SiftEvent::Negated { pattern, .. } => {
// Recovery detected — a previously active alert is resolved.
println!("Resolved: {}", pattern);
}
_ => {}
}
}
The engine tracks partial matches across thousands of concurrent requests. When a cascade completes, you get the full call chain in the bindings. When a recovery event arrives, partial matches are automatically killed.
Mapping your data
OpenTelemetry spans map to fabula edges as follows:
| Real-world field | Fabula edge |
|---|---|
| spanID | source node |
| operationName | label (e.g., "type") |
| serviceName, statusCode | target values via label edges |
| startTime, endTime | interval [start, end) |
| parentSpanID | span.parent -> parent_span edge for call-chain joins |
Each span becomes a set of edges sharing the same source node. The parentSpanID edge lets patterns join child spans to their parents, enabling call-chain traversal through variable bindings.
Limitations and false positives
These patterns detect structural anomalies but are not immune to noise:
- Cascade timeout: Retries that eventually succeed but slowly can still match the pattern. A call chain that recovers after 30 seconds of retries never emits a "recovery" event if the retry logic is internal to the service.
- Retry storm: Legitimate retry bursts during deployments can trigger false positives. A rolling restart that causes brief connection failures looks identical to a real retry storm.
- SLA breach: Clock skew between services can create phantom violations. If service A's clock runs 2 seconds ahead of service B, a 4.5-second request can appear to take 6.5 seconds.
- Mitigation: Use metric gap constraints to tighten time windows (e.g.,
gap 10..instead ofgap 5..to tolerate clock skew). Add negation windows for expected maintenance events (unless betweenfor deployment markers). Consider confidence thresholds -- only alert when surprise scoring ranks the match above a baseline.
Fabula requires strict temporal ordering between stages. Telemetry events with identical timestamps (common in batch-exported spans) cannot be placed in consecutive stages.
If your tracing backend exports spans with coarse timestamps, buffer and assign monotonic sequence IDs before feeding to fabula. Alternatively, use batch evaluation. See Thinking in Time for details.
How fabula compares
- vs Datadog monitors: Metric thresholds and anomaly detection over time series. No structural pattern matching across distributed traces, no variable joins correlating caller/callee chains, no negation windows.
- vs Jaeger TraceQL: Queries within a single trace (span attributes, duration filters). Limited negation, no cross-trace pattern matching, no incremental streaming. Fabula matches patterns across the full event stream, not scoped to individual traces.
- vs Grafana alerting: Condition-based alerts on metric thresholds. No multi-step pattern detection, no temporal sequencing across services, no gap constraints for SLA enforcement.
Where to go next
- Getting Started — Build and evaluate patterns in Rust.
- Incremental Integration — Wire fabula into your event pipeline.
- Scoring Reference — Rank alerts by surprise to reduce noise.
- Pattern Cookbook — More pattern recipes.