Core Concepts
Data Modeling in Cassandra
Learn the query-first approach to Cassandra data modeling — the key to unlocking performance at scale.
Think in Queries, Not Entities
The most important mindset shift when moving to Cassandra: model your data around your queries, not around your entities.
In relational databases, you normalize data into entities and use JOINs to combine them at query time. In Cassandra, JOINs don't exist. Instead, you denormalize — you store data in exactly the shape each query needs. This means you often have the same data in multiple tables, each optimized for a specific access pattern.
Relational approach: Design tables based on entities, then write queries.
Cassandra approach: Define your queries first, then design tables to serve them.
Partition Key Design
The partition key is the most critical design decision. It determines:
- Which node stores the data
- Whether queries can be efficiently served
- How evenly data is distributed across the cluster
Good partition keys:
- High cardinality (many distinct values)
- Evenly distributed (no "hot partitions")
- Match your most common query pattern
Example: Time-Series IoT Data
Imagine storing temperature readings from thousands of sensors:
-- Bad: sensor_id alone as partition key
-- If one sensor generates far more data, that node becomes overloaded
CREATE TABLE sensor_data_bad (
sensor_id TEXT,
reading_time TIMESTAMP,
temperature FLOAT,
PRIMARY KEY (sensor_id, reading_time)
);
-- Better: bucket by time period to limit partition size
CREATE TABLE sensor_data (
sensor_id TEXT,
date TEXT, -- e.g., "2025-01-15"
reading_time TIMESTAMP,
temperature FLOAT,
PRIMARY KEY ((sensor_id, date), reading_time)
) WITH CLUSTERING ORDER BY (reading_time DESC);Clustering Columns
Clustering columns define the sort order of rows within a partition. This is critical for range queries.
-- Query: "Get the last 100 events for user X"
CREATE TABLE user_events (
user_id UUID,
event_time TIMESTAMP,
event_type TEXT,
data TEXT,
PRIMARY KEY (user_id, event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
-- This query is efficient because data is pre-sorted
SELECT * FROM user_events
WHERE user_id = ?
LIMIT 100;Denormalization Example
Consider a social app where users follow each other. You need two queries:
- "Get everyone this user follows" (following list)
- "Get everyone who follows this user" (followers list)
-- Table 1: optimized for "who does user X follow?"
CREATE TABLE following (
follower_id UUID,
followed_id UUID,
created_at TIMESTAMP,
PRIMARY KEY (follower_id, followed_at)
);
-- Table 2: optimized for "who follows user X?"
CREATE TABLE followers (
followed_id UUID,
follower_id UUID,
created_at TIMESTAMP,
PRIMARY KEY (followed_id, follower_id)
);
-- When a follow happens, write to BOTH tablesThis duplication is intentional and necessary in Cassandra.
Consistency Levels
Cassandra lets you tune the consistency-availability tradeoff per query:
| Level | Description |
|---|---|
| ONE | One replica must respond. Fastest, least consistent. |
| QUORUM | Majority of replicas must respond. Balanced. |
| ALL | All replicas must respond. Slowest, most consistent. |
| LOCAL_QUORUM | Quorum within the local datacenter. Recommended for multi-DC. |
-- Set consistency for a session
CONSISTENCY QUORUM;
SELECT * FROM user_events WHERE user_id = ?;Example
-- Time-series table: latest readings per sensor per day
CREATE TABLE sensor_readings (
sensor_id TEXT,
bucket TEXT, -- "2025-01-15" (day bucket)
ts TIMESTAMP,
value DOUBLE,
unit TEXT,
PRIMARY KEY ((sensor_id, bucket), ts)
) WITH CLUSTERING ORDER BY (ts DESC)
AND default_time_to_live = 2592000; -- Auto-expire after 30 days
-- Write
INSERT INTO sensor_readings (sensor_id, bucket, ts, value, unit)
VALUES ('sensor-001', '2025-01-15', toTimestamp(now()), 23.5, 'celsius');
-- Read last 10 readings today
SELECT * FROM sensor_readings
WHERE sensor_id = 'sensor-001'
AND bucket = '2025-01-15'
LIMIT 10;Want to run this code interactively?
Try in Compiler