Ingestion Architecture
This page covers the end-to-end path telemetry takes from raw files in object storage through to optimized, queryable Parquet segments.
Important: Kafka is used exclusively for shuttling file notifications and work coordination messages between stages — raw telemetry data never flows through Kafka. Workers read and write telemetry data directly from object storage.
Ingestion Flow
S3 / GCS / Azure Blob
raw prefixes:
otel-raw/logs/ metrics/ traces/
object-created notifications
PubSub Adapters
SQS / GCP / Azure / HTTP
Kafka
objstore.ingest.{signal}
(file paths only, NOT raw telemetry data)
boxer-ingest-{signal}
batch + group by org/collector/time window
Kafka
segments.{signal}.ingest
ingest-{signal} worker
reads raw objects
normalizes telemetry
writes Parquet segment
registers in lrdb
writes cooked parquet
S3 / GCS / Azure Blob
db/{org}/{collector}/
{date}/{dataset}/{hour}/
tbl_{segment_id}.parquet
PostgreSQL (lrdb)
segment metadata:
time bounds, org, instance, frequency
Compaction and Rollup
After ingest, background stages merge and downsample segments to keep query costs stable as data grows.
Compaction
Kafka
boxer.{signal}.compact
boxer-compact-{signal}
groups segments for merge
Kafka
segments.{signal}.compact
compact-{signal} worker
reads input segments
writes merged segment
updates segment index
Rollup (metrics only)
Kafka
boxer.metrics.rollup
boxer-rollup-metrics
Kafka
segments.metrics.rollup
rollup-metrics worker
reads lower-res tier
writes higher-res tier
updates segment index
Key Design Points
- Kafka carries notifications, not data. Kafka topics contain file paths and work coordination messages only. Raw and cooked telemetry lives exclusively in object storage — workers read and write it directly.
- Horizontally scalable. Add workers and Kafka partitions to handle more volume.
- Boxer decouples fanout from compute. High-volume notification streams are batched and grouped before reaching heavier worker stages.
- Object storage is immutable state. Workers only append new segments — they never modify existing ones. PostgreSQL tracks the mutable planning metadata.
- Compaction and rollup run continuously in the background to keep segment counts and query costs bounded.