Getting Started

Apache Cassandra Introduction

Learn Cassandra — the distributed NoSQL database built for massive scale, high availability, and fault tolerance across multiple data centers.

What is Apache Cassandra?

Apache Cassandra is an open-source, distributed NoSQL database designed for handling large volumes of data across many commodity servers with no single point of failure. Originally developed at Facebook to power the inbox search feature, it was open-sourced in 2008 and became an Apache project in 2010.

Cassandra's defining feature is its masterless, peer-to-peer architecture. Unlike MongoDB (which has a primary node) or MySQL (with a single primary), every Cassandra node is equal. There is no leader. Data is replicated across multiple nodes automatically, and the cluster continues operating even if several nodes fail.

The CAP Theorem and Cassandra

In distributed systems, the CAP theorem states you can only guarantee two of three properties:

  • Consistency — All nodes see the same data at the same time
  • Availability — Every request receives a response
  • Partition tolerance — The system operates despite network failures

Cassandra is an AP system — it prioritizes Availability and Partition tolerance over strict Consistency. It achieves "eventual consistency": after a write, all replicas will eventually converge to the same value, but they may be briefly out of sync.

For many real-world use cases — IoT sensor data, user activity logs, time-series metrics — eventual consistency is perfectly acceptable.

When to Use Cassandra

Cassandra excels when you need:

  • Extremely high write throughput — Millions of writes per second across a cluster
  • Linear horizontal scalability — Add nodes to increase capacity; performance scales linearly
  • Multi-region, multi-datacenter replication — Built-in support for geo-distributed deployments
  • High availability with no downtime — No single point of failure; rolling upgrades without outages
  • Time-series and append-heavy workloads — IoT data, logs, metrics, user activity streams

Cassandra is a poor fit for:

  • Applications needing complex JOINs or ad-hoc queries
  • Small datasets that don't need horizontal scaling
  • Systems where strict ACID transactions are required across multiple rows

Cassandra's Data Model

Cassandra uses a table-based model with a powerful partition key concept. Data is distributed across nodes based on a hash of the partition key.

text
Keyspace (like a database)
  └── Table
        └── Row (identified by partition key + clustering columns)
              └── Columns

A Cassandra table has two types of key components:

  1. Partition key — Determines which node stores the data. All rows with the same partition key are stored together.
  2. Clustering columns — Determine the order of rows within a partition.

Cassandra uses CQL (Cassandra Query Language) — a SQL-like language that feels familiar but has different rules and constraints.

Running Cassandra

Docker (quickest):

bash
docker run -d --name cassandra -p 9042:9042 cassandra:5

# Wait ~30 seconds for startup, then connect
docker exec -it cassandra cqlsh

Check cluster status:

bash
docker exec -it cassandra nodetool status

Example

sql
-- Connect with cqlsh
-- Create a keyspace (replication strategy matters in production)
CREATE KEYSPACE IF NOT EXISTS myapp
  WITH replication = {
    'class': 'SimpleStrategy',
    'replication_factor': 3
  };

USE myapp;

-- Create a table
CREATE TABLE IF NOT EXISTS users (
  user_id   UUID PRIMARY KEY,
  email     TEXT,
  name      TEXT,
  created_at TIMESTAMP
);

-- Insert
INSERT INTO users (user_id, email, name, created_at)
VALUES (uuid(), 'alice@example.com', 'Alice', toTimestamp(now()));

-- Query (must use primary key)
SELECT * FROM users WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;

Want to run this code interactively?

Try in Compiler