Lakerunner
Architecture
Ingestion

Ingestion Architecture

This page covers the end-to-end path telemetry takes from raw files in object storage through to optimized, queryable Parquet segments.

Important: Kafka is used exclusively for shuttling file notifications and work coordination messages between stages — raw telemetry data never flows through Kafka. Workers read and write telemetry data directly from object storage.

Ingestion Flow

S3 / GCS / Azure Blob
raw prefixes:
otel-raw/logs/ metrics/ traces/
object-created notifications
PubSub Adapters
SQS / GCP / Azure / HTTP
Kafka
objstore.ingest.{signal}
(file paths only, NOT raw telemetry data)
boxer-ingest-{signal}
batch + group by org/collector/time window
Kafka
segments.{signal}.ingest
ingest-{signal} worker
reads raw objects
normalizes telemetry
writes Parquet segment
registers in lrdb
writes cooked parquet
S3 / GCS / Azure Blob
db/{org}/{collector}/
{date}/{dataset}/{hour}/
tbl_{segment_id}.parquet
PostgreSQL (lrdb)
segment metadata:
time bounds, org, instance, frequency

Compaction and Rollup

After ingest, background stages merge and downsample segments to keep query costs stable as data grows.

Compaction
Kafka
boxer.{signal}.compact
boxer-compact-{signal}
groups segments for merge
Kafka
segments.{signal}.compact
compact-{signal} worker
reads input segments
writes merged segment
updates segment index
Rollup (metrics only)
Kafka
boxer.metrics.rollup
boxer-rollup-metrics
Kafka
segments.metrics.rollup
rollup-metrics worker
reads lower-res tier
writes higher-res tier
updates segment index

Key Design Points

  1. Kafka carries notifications, not data. Kafka topics contain file paths and work coordination messages only. Raw and cooked telemetry lives exclusively in object storage — workers read and write it directly.
  2. Horizontally scalable. Add workers and Kafka partitions to handle more volume.
  3. Boxer decouples fanout from compute. High-volume notification streams are batched and grouped before reaching heavier worker stages.
  4. Object storage is immutable state. Workers only append new segments — they never modify existing ones. PostgreSQL tracks the mutable planning metadata.
  5. Compaction and rollup run continuously in the background to keep segment counts and query costs bounded.