Alignment and Responsible Use

Constitutional AI and What It Means for Claude Code Behavior

Claude Code is powered by Claude, trained using Anthropic's Constitutional AI approach. Understanding how that training shapes Claude's behavior helps you work with it more effectively and understand why it sometimes refuses, hedges, or asks clarifying questions.

What Constitutional AI Is

Constitutional AI (CAI) is Anthropic's method for training Claude to be helpful, harmless, and honest. Rather than learning solely from human preferences, Claude was trained to critique its own outputs against a set of principles — a "constitution" — and revise them before responding.

The practical result: Claude has internalized values that guide its behavior regardless of which tool it's used through. Whether you're in Claude.ai, the Claude API, or Claude Code, the same alignment properties apply. These are not external filters that a tool can toggle — they're trained into the model.

The Three Properties: Helpful, Harmless, Honest

Helpful: Claude aims to genuinely assist with your task — understanding your actual intent, not just the literal words you used. This is why it sometimes asks clarifying questions rather than producing output that technically matches your request but misses your goal.

Harmless: Claude won't knowingly help produce outcomes that cause harm. In a coding context, this means it will decline to write malware, data exfiltration tools, or code that clearly enables abuse. This is not configurable — it's a property of the model, not of Claude Code.

Honest: Claude expresses uncertainty rather than fabricating confident answers. In a coding context, this is why it says "I'm not certain, but..." or "you should verify this" rather than asserting things it doesn't know to be true. This property directly addresses one of the most dangerous failure modes in AI coding assistance: confident-sounding wrong code.

How This Shapes Claude Code Behavior

Refusing Harmful Requests

If you ask Claude Code to write a keylogger, a script that scrapes without authorization, or code that deliberately circumvents security controls, it will decline — explaining why rather than silently not complying.

This applies even in highly technical contexts where the intent seems ambiguous. If a request looks like it could enable harm, Claude will often ask for clarification about the context before proceeding.

Flagging Issues You Didn't Ask About

Because Claude is trained toward honesty and genuine helpfulness, it will often surface problems it notices during a task even if you didn't ask about them:

Security vulnerabilities in code it's modifying nearby
Potential race conditions or edge cases in functions it's implementing
Outdated dependency versions it encounters while reading package files

This proactive reporting is a feature of constitutional training — Claude isn't executing narrow instructions, it's trying to genuinely help you produce good outcomes.

Expressing Uncertainty

Claude Code will say "I'm not confident about this" when it isn't. This matters enormously for code that can't be easily tested — authentication logic, cryptographic implementations, complex SQL queries.

When Claude hedges, take it seriously. "I believe this is correct but you should test it" is meaningfully different from a confident answer, and the distinction reflects Claude's actual epistemic state.

Asking Clarifying Questions

When your request is underspecified in ways that matter — where different interpretations would lead to meaningfully different code — Claude Code will ask before proceeding. This is a trained behavior to prevent specification failure: doing what you literally said rather than what you actually want.

text

# Example of helpful clarification behavior
$ claude "add caching to the products endpoint"

# Claude might ask:
# "Should this cache be per-user or shared across all users?
#  Shared caching would be faster but would show stale data
#  if one user updates their product list before the cache expires.
#  Which behavior is correct for your use case?"

The Trust Hierarchy in Claude Code

Claude Code operates within a layered trust structure:

Anthropic's training — The foundational values and constraints from Constitutional AI training. Cannot be overridden.
Operator configuration — Anthropic's own guidelines for how Claude Code operates as a product.
Your CLAUDE.md — Project-specific context and constraints you define. The model treats these seriously.
Session instructions — Your in-session prompts and commands.
Inferred intent — What the model believes you probably want based on context.

Lower layers cannot override higher ones. Your CLAUDE.md cannot instruct Claude to help with things that violate layer 1 or 2. Understanding this hierarchy helps set realistic expectations.

What CLAUDE.md Can and Cannot Do

Can do:

Define project-specific conventions the model should follow
Specify compliance requirements (HIPAA, SOC2, PCI) that constrain how code is written
Set security review requirements for sensitive code areas
Define coding patterns, file structures, and team standards

Cannot do:

Unlock behaviors that constitutional training prohibits
Make the model skip safety considerations for "trusted" requests
Override Anthropic's usage policies

markdown

# CLAUDE.md — What Works

## Critical Constraints
- This is a financial services application. Never generate any code
  that logs transaction amounts, account numbers, or user PII to
  any logging system. Use masked representations only.

## Security Review Required
Any changes to: payment processing, authentication, authorization,
or data export functionality must be flagged in your response as
requiring human security review before merging.

This type of CLAUDE.md works because it aligns with constitutional training — it asks Claude to be more careful and to surface concerns for human review, which reinforces rather than conflicts with its trained values.

The Alignment-Performance Tradeoff Myth

A common misconception: that safety-oriented training reduces capability. The practical evidence suggests otherwise. Claude's trained tendency to ask clarifying questions, flag uncertainty, and notice side effects makes it a better coding assistant — not a more limited one.

The alternative — an AI that confidently produces any output requested without expressing uncertainty or raising concerns — would generate more code faster and cause more silent failures, security vulnerabilities, and hard-to-diagnose bugs.

Alignment and capability are complementary in practice.

Working Effectively Within Constitutional Constraints

When Claude declines a request:

Ask yourself whether the request could plausibly enable harm from Claude's perspective
Provide context that clarifies legitimate intent: "This is for authorized penetration testing of our own systems"
If the refusal seems clearly wrong, rephrase to be more specific about the safe use case

When Claude hedges:

Treat the uncertainty flag as a signal to verify manually
Ask "what specifically are you uncertain about?" to identify the high-risk parts
For security-sensitive code, plan for human expert review regardless of Claude's confidence level

When Claude asks clarifying questions:

Answer them thoroughly — this investment in specification prevents costly rework
If the question reveals an ambiguity you hadn't considered, take time to resolve it before proceeding

Key Takeaways

Constitutional AI training gives Claude consistent values — helpful, harmless, honest — regardless of which tool uses it
Refusing harmful requests, expressing uncertainty, and asking clarifying questions are trained behaviors, not bugs
The CLAUDE.md file can define project constraints that align with constitutional values — it cannot override them
Alignment and capability are complementary: Claude's trained caution makes it a more reliable coding assistant
Treat refusals and uncertainty expressions as meaningful signals, not obstacles to route around