Core Concepts

Data Modeling in Cassandra

Learn the query-first approach to Cassandra data modeling — the key to unlocking performance at scale.

Think in Queries, Not Entities

The most important mindset shift when moving to Cassandra: model your data around your queries, not around your entities.

In relational databases, you normalize data into entities and use JOINs to combine them at query time. In Cassandra, JOINs don't exist. Instead, you denormalize — you store data in exactly the shape each query needs. This means you often have the same data in multiple tables, each optimized for a specific access pattern.

Relational approach: Design tables based on entities, then write queries.

Cassandra approach: Define your queries first, then design tables to serve them.

Partition Key Design

The partition key is the most critical design decision. It determines:

Which node stores the data
Whether queries can be efficiently served
How evenly data is distributed across the cluster

Good partition keys:

High cardinality (many distinct values)
Evenly distributed (no "hot partitions")
Match your most common query pattern

Example: Time-Series IoT Data

Imagine storing temperature readings from thousands of sensors:

sql

-- Bad: sensor_id alone as partition key
-- If one sensor generates far more data, that node becomes overloaded
CREATE TABLE sensor_data_bad (
  sensor_id TEXT,
  reading_time TIMESTAMP,
  temperature FLOAT,
  PRIMARY KEY (sensor_id, reading_time)
);

-- Better: bucket by time period to limit partition size
CREATE TABLE sensor_data (
  sensor_id TEXT,
  date      TEXT,        -- e.g., "2025-01-15"
  reading_time TIMESTAMP,
  temperature FLOAT,
  PRIMARY KEY ((sensor_id, date), reading_time)
) WITH CLUSTERING ORDER BY (reading_time DESC);

Clustering Columns

Clustering columns define the sort order of rows within a partition. This is critical for range queries.

sql

-- Query: "Get the last 100 events for user X"
CREATE TABLE user_events (
  user_id   UUID,
  event_time TIMESTAMP,
  event_type TEXT,
  data      TEXT,
  PRIMARY KEY (user_id, event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);

-- This query is efficient because data is pre-sorted
SELECT * FROM user_events
WHERE user_id = ?
LIMIT 100;

Denormalization Example

Consider a social app where users follow each other. You need two queries:

"Get everyone this user follows" (following list)
"Get everyone who follows this user" (followers list)

sql

-- Table 1: optimized for "who does user X follow?"
CREATE TABLE following (
  follower_id UUID,
  followed_id UUID,
  created_at  TIMESTAMP,
  PRIMARY KEY (follower_id, followed_at)
);

-- Table 2: optimized for "who follows user X?"
CREATE TABLE followers (
  followed_id UUID,
  follower_id UUID,
  created_at  TIMESTAMP,
  PRIMARY KEY (followed_id, follower_id)
);

-- When a follow happens, write to BOTH tables

This duplication is intentional and necessary in Cassandra.

Consistency Levels

Cassandra lets you tune the consistency-availability tradeoff per query:

Level	Description
ONE	One replica must respond. Fastest, least consistent.
QUORUM	Majority of replicas must respond. Balanced.
ALL	All replicas must respond. Slowest, most consistent.
LOCAL_QUORUM	Quorum within the local datacenter. Recommended for multi-DC.

sql

-- Set consistency for a session
CONSISTENCY QUORUM;

SELECT * FROM user_events WHERE user_id = ?;

Example

sql

-- Time-series table: latest readings per sensor per day
CREATE TABLE sensor_readings (
  sensor_id   TEXT,
  bucket      TEXT,        -- "2025-01-15" (day bucket)
  ts          TIMESTAMP,
  value       DOUBLE,
  unit        TEXT,
  PRIMARY KEY ((sensor_id, bucket), ts)
) WITH CLUSTERING ORDER BY (ts DESC)
  AND default_time_to_live = 2592000;  -- Auto-expire after 30 days

-- Write
INSERT INTO sensor_readings (sensor_id, bucket, ts, value, unit)
VALUES ('sensor-001', '2025-01-15', toTimestamp(now()), 23.5, 'celsius');

-- Read last 10 readings today
SELECT * FROM sensor_readings
WHERE sensor_id = 'sensor-001'
  AND bucket = '2025-01-15'
LIMIT 10;

Want to run this code interactively?

Try in Compiler