Skip to main content
  1. Posts/

StarRocks Adoption Story: Revolutionizing Data Pipelines with Real-time OLAP

·7 mins
Author
nanta
Data Engineering blog covering Apache Kafka, Airflow, Trino, StarRocks, and modern data infrastructure. Sharing practical lessons from production.

Background
#

If you run data pipelines long enough, you inevitably face one question: How do you build real-time dashboards?

Our team was no different. Our existing pipeline looked like this:

Service → Kafka → Iceberg → S3 → Trino → Airflow(5min) → Dashboard

On the surface it worked fine, but the pain points in practice were clear:

  • At least 5 minutes of latency: The Airflow schedule interval was the bottleneck
  • Pipeline complexity: Managing 5+ components chained as Kafka → Flink → Redis → API → Dashboard
  • Redundant I/O: Trino full-scanned S3 on every query
  • High development cost: Each new real-time dashboard took roughly 2 weeks to build

After adopting StarRocks, the architecture simplified to this:

Service → Kafka → StarRocks → Dashboard (sub-second latency)

Eliminating the intermediate components dramatically simplified the pipeline, and ingesting data directly from Kafka into StarRocks gave us the real-time capability we needed.

Results
#

After approximately 3 months of PoC and 6 months of phased rollout, we achieved the following improvements:

MetricBeforeAfterImprovement
Dashboard latency5 min< 1 sec~300x
Dashboard dev time~2 weeks~1 week50% reduction
Pipeline components5+260% reduction
Query response time30~50 sec5~10 sec5~10x
Hardware cost128 GB x 18 nodes64 GB x 3 nodes~75% savings

Trino is fast in terms of raw query time, but when factoring in Airflow schedule delays and end-to-end latency, along with hardware cost efficiency, StarRocks proved to be a better fit for real-time workloads.

Table Model Selection Guide
#

Choosing the right table model is the most important decision when first adopting StarRocks. A wrong choice means you will have to recreate the table later.

Decision Flow
#

┌─────────────────────────────┐
│  What data are you storing?  │
└──────────────┬──────────────┘
       ┌───────▼────────┐
       │  Need UPDATE?   │
       └───────┬────────┘
      ┌────────┴────────┐
      │                 │
     [No]              [Yes]
      │                 │
┌─────▼─────┐    ┌─────▼──────┐
│  Need      │    │ Primary Key │
│aggregation?│    │ (frequent   │
└─────┬─────┘    │  UPDATE)    │
      │          └────────────┘
  [No]    [Yes]
      │      │
┌─────▼───┐ ┌▼──────────┐
│Duplicate│ │Aggregate   │
│(raw data)│ │(auto-agg) ★│
└─────────┘ └────────────┘

Model Comparison
#

ModelDuplicates AllowedUPDATEAuto-aggregationBest For
Duplicate KeyOXXLogs, raw events
Aggregate KeyXAutoOReal-time statistics ★
Primary KeyXO (fast)XFrequent UPDATEs

Duplicate Key: Storing Raw Data
#

Use this when you need to preserve original data as-is, such as click logs, API events, or sensor data.

CREATE TABLE order_events (
    event_id BIGINT,
    event_time DATETIME,
    order_id VARCHAR(50),
    user_id BIGINT,
    event_type VARCHAR(20),
    amount DECIMAL(10, 2)
)
DUPLICATE KEY(event_id, event_time)
PARTITION BY date_trunc('day', event_time)
DISTRIBUTED BY HASH(event_id) BUCKETS 10;

Aggregate Key: Real-time Statistics ★
#

Data is automatically aggregated at ingestion time. This model was the key reason for adopting StarRocks.

CREATE TABLE order_stats (
    stat_time DATETIME NOT NULL COMMENT '5-minute intervals',
    region VARCHAR(20) NOT NULL,
    delivery_type VARCHAR(20) NOT NULL,
    -- Aggregate columns: aggregate functions applied automatically at ingestion
    order_count BIGINT SUM DEFAULT "0",
    total_amount DECIMAL(15, 2) SUM DEFAULT "0",
    user_bitmap BITMAP BITMAP_UNION,
    max_amount DECIMAL(10, 2) MAX
)
AGGREGATE KEY(stat_time, region, delivery_type)
PARTITION BY date_trunc('day', stat_time)
DISTRIBUTED BY HASH(stat_time) BUCKETS 10;

Available aggregate functions:

FunctionPurposeExample
SUMSummationOrder count, total revenue
MAX / MINMaximum / minimum valueHighest price, lowest price
REPLACEOverwrite with latest valueLatest status
BITMAP_UNIONExact unique countUnique visitors
HLL_UNIONApproximate unique countHigh-cardinality sets

Unlike HyperLogLog, BITMAP_UNION provides exact unique counts. For business KPI dashboards where accuracy matters, always use this approach.

Primary Key: Frequent UPDATEs
#

Best suited for scenarios where the same key is frequently updated, such as order status tracking or inventory management.

CREATE TABLE orders (
    order_id VARCHAR(50) NOT NULL,
    status VARCHAR(20),
    amount DECIMAL(10, 2),
    updated_at DATETIME
)
PRIMARY KEY(order_id)
DISTRIBUTED BY HASH(order_id) BUCKETS 10
PROPERTIES (
    "enable_persistent_index" = "true"
);

Enabling enable_persistent_index significantly improves UPDATE performance.

Data Ingestion
#

Routine Load: Real-time Kafka Integration
#

This approach continuously ingests data from a Kafka topic. It is the method used in most real-time pipelines.

CREATE ROUTINE LOAD order_load ON orders
COLUMNS(
    order_id,
    user_id,
    timestamp_ms,
    amount,
    order_date = FROM_UNIXTIME(timestamp_ms / 1000)
)
PROPERTIES (
    "format" = "json",
    "jsonpaths" = "[\"$.orderId\",\"$.userId\",\"$.timestamp\",\"$.amount\"]"
)
FROM KAFKA (
    "kafka_broker_list" = "kafka-broker:9092",
    "kafka_topic" = "orders"
);

When combined with an Aggregate Key table, transformation and aggregation happen simultaneously at ingestion time.

CREATE ROUTINE LOAD order_stats_load ON order_stats
COLUMNS(
    timestamp_ms,
    region,
    amount,
    user_id,
    -- Round to 5-minute intervals
    stat_time = FROM_UNIXTIME(FLOOR(timestamp_ms / 1000 / 300) * 300),
    order_count = 1,
    total_amount = amount,
    user_bitmap = BITMAP_HASH(user_id)
)
WHERE amount > 0
PROPERTIES ("format" = "json")
FROM KAFKA (
    "kafka_broker_list" = "kafka-broker:9092",
    "kafka_topic" = "orders"
);

This single pattern replaced the aggregation logic that previously required Flink – using nothing but SQL.

Stream Load: Bulk Data Loading
#

Ideal for one-time bulk loads via files or APIs.

# CSV file loading
curl --location-trusted \
  -u user:password \
  -H "label:load_$(date +%Y%m%d%H%M%S)" \
  -H "column_separator:," \
  -T data.csv \
  http://starrocks-fe:8030/api/mydb/mytable/_stream_load

Performance Tuning Tips
#

Thread Pool Configuration
#

In high-load environments with 500+ RPS of concurrent connections, the default thread pool size is insufficient.

# be.conf
pipeline_scan_thread_pool_thread_num = 32   # default: 24
pipeline_exec_thread_pool_thread_num = 32   # default: 24

Bucket Count Guidelines
#

Data SizeRecommended Buckets
< 10 GB10
10~50 GB20
50~100 GB30
> 100 GB50+

Formula: buckets = max(1, data_size_GB / 10)

Partitioning Strategy
#

Applying functions to partition columns prevents partition pruning. This is a more common mistake than you might think.

-- ✅ Correct: partition pruning works
WHERE event_time >= NOW() - INTERVAL 3 DAY

-- ❌ Incorrect: partition pruning disabled
WHERE DATE(event_time) >= CURRENT_DATE - 3

TTL Configuration
#

To automatically drop old partitions, configure TTL.

PROPERTIES (
    "partition_live_number" = "3"  -- Keep only the 3 most recent partitions
)

Operational Know-how
#

Materialized View Management
#

ASYNC refresh can stop without warning. Periodically check the status and manually recover when issues arise.

-- Check status
SHOW MATERIALIZED VIEWS;

-- Force synchronous refresh
REFRESH MATERIALIZED VIEW db.mv_name WITH SYNC MODE;

-- Reactivate a deactivated MV
ALTER MATERIALIZED VIEW db.mv_name ACTIVE;

Routine Load Monitoring
#

The status frequently transitions to PAUSED. Common causes include Kafka offset issues or malformed messages.

-- Check status
SHOW ROUTINE LOAD FOR db.load_job;

-- Resume
RESUME ROUTINE LOAD FOR db.load_job;

Scale-in Precautions
#

When scaling down nodes, you must perform a Decommission first. Removing nodes without this procedure will result in data loss.

-- 1. Check current nodes
SHOW PROC '/backends';

-- 2. Start decommission
ALTER SYSTEM DECOMMISSION BACKEND "<BE_IP>:<HEARTBEAT_PORT>";

-- 3. Wait until TabletNum reaches 0, then remove
ALTER SYSTEM DROP BACKEND "<BE_IP>:<HEARTBEAT_PORT>";

Things to Know Before Adopting
#

Known Limitations
#

IssueDescriptionWorkaround
Routine LoadLimited handling of malformed messagesPre-validate on the Kafka side
datetime partitionsCompatibility issues with Iceberg datetime partitionsUse an alternative partitioning strategy
Version upgradesEncountered bugs in 4.x releasesAlways test in a staging environment

Always thoroughly validate version upgrades in a staging environment before applying them to production. We went through several rounds of upgrades and rollbacks ourselves. Have a rollback plan ready at all times.

Adoption Checklist
#

Pre-deployment

  • Define use cases and requirements
  • Estimate data volume and growth rate
  • Choose table models
  • Design partitioning strategy

Post-deployment

  • Create and verify Routine Load jobs
  • Configure user permissions
  • Set data retention policies (TTL)
  • Document scale-in/out procedures
  • Build monitoring dashboards

Conclusion
#

Here are the key lessons we learned from adopting StarRocks:

  1. The Aggregate Key model is the centerpiece – Automatic aggregation at ingestion time optimizes both storage and query performance
  2. Use BITMAP_UNION for exact unique counts – Business KPIs demand precise numbers, not approximations
  3. Routine Load + Aggregate Key replaces Flink – You can build a real-time aggregation pipeline with SQL alone
  4. Invest in operational automation – Monitoring Materialized Views and Routine Load is essential

For real-time analytics workloads, StarRocks is a powerful option that dramatically reduces pipeline complexity. That said, it is still maturing in terms of version upgrade stability and operational robustness, so we recommend conducting thorough PoC testing and staging validation before adopting it.

Reference: StarRocks Official Docs