SQL Server, from the ground up Lesson 21 / 40

Clustered and non-clustered indexes, the way I wish someone had explained them

Why every table wants a clustered index, how to pick the column, what fragmentation actually costs, and the hidden write-amplification of getting it wrong.

The first time a senior DBA told me that a clustered index “is the table,” I nodded, smiled, and pretended I knew exactly what that meant. I then spent the next year generating execution plans that made no sense to me until, one afternoon, staring at a scary-looking “Key Lookup” in a production query, the whole thing finally clicked.

This is the explanation I wish somebody had given me on day one. If you already get it, skip to the “picking the key” section and watch me argue with twenty years of bad tutorials. If you don’t get it yet, welcome. We are going to fix that in the next ten minutes.

Tables are B-trees. Not “stored in” B-trees. They are B-trees.

SQL Server stores a table as a data structure called a B-tree: a shape that looks like an upside-down tree with a single root page at the top, some intermediate pages in the middle, and the actual rows at the bottom in what are called leaf pages. When you run SELECT * FROM dbo.Orders WHERE OrderId = 42, the engine walks from the root down through the intermediate pages, following the pointer that says “OrderId 42 is somewhere down this branch,” until it hits the leaf page that holds row 42.

The clustered index is the rule that says how that B-tree is sorted. You get exactly one per table, because a table can only be physically sorted one way at a time. If you cluster on OrderId, the rows are laid out on disk in OrderId order. If you cluster on OrderDate, they’re laid out in date order. The leaf pages of the clustered index are the rows of the table. There is no separate “table” lurking underneath.

When you write this:

CREATE TABLE dbo.Orders (
    OrderId      BIGINT        NOT NULL PRIMARY KEY,
    CustomerId   INT           NOT NULL,
    OrderDate    DATETIME2(0)  NOT NULL,
    Amount       DECIMAL(18,2) NOT NULL,
    Status       TINYINT       NOT NULL
);

…SQL Server silently creates a clustered index on OrderId, because primary keys default to clustered. That’s it. That’s your table. The clustered index and the table are the same object. If that sentence feels weirdly philosophical, read it twice and move on. It is the single most important idea in this lesson.

Non-clustered indexes are little side copies with signposts

A non-clustered index is a second B-tree that lives next to the table. Its leaf pages don’t hold rows — they hold the indexed columns plus a pointer back to the clustered index. Think of it as a small sorted notebook that says: “If you want row where CustomerId = 7, look over here in the main book, page 143.”

That pointer lookup is called a key lookup, and it shows up constantly in execution plans for people who are new to SQL Server. A query that uses a non-clustered index to find rows, then has to run back to the clustered index to grab extra columns, looks like this in the plan:

Index Seek (NonClustered) → Key Lookup (Clustered) → Nested Loops → SELECT

Every key lookup is a round-trip. Do a thousand of them and you have a problem. Do a million of them and you have a lunchtime incident.

The trick is to make the non-clustered index covering, meaning every column the query wants is in the index itself. You do that by adding columns to the key or to the INCLUDE list:

CREATE NONCLUSTERED INDEX IX_Orders_CustomerId
    ON dbo.Orders (CustomerId)
    INCLUDE (OrderDate, Amount);

Now SELECT CustomerId, OrderDate, Amount FROM dbo.Orders WHERE CustomerId = 7 can be answered entirely from the non-clustered index. No key lookup. No round-trip. The plan has a single clean seek, and your coffee stays warm.

Tables without a clustered index (aka “heaps”) are a trap

If a table has no clustered index, it’s a heap. Rows go wherever there happens to be free space on disk. Pages aren’t sorted. There’s no ordered structure.

Heaps are a trap. Every non-clustered index on a heap has to point to the physical location of the row, which is called a RID (row identifier). If the row ever moves — because you updated a variable-length column and it doesn’t fit on the original page anymore — SQL Server leaves a forwarding pointer behind so the old RID still works. If that row moves again, you can end up with chains of forwarding pointers that double your I/O and make every scan slower.

In fifteen years I have run into exactly three cases where a heap was the right answer: staging tables that are loaded and truncated in bulk, tiny lookup tables that never change, and one beautifully-engineered log table where no one ever queried anything except INSERT. That’s it. Make every other table have a clustered index, ideally from the CREATE TABLE statement. If you inherit a heap, put a clustered index on it before you do anything else.

Picking the clustered key — the four-rule checklist

The clustered key is the most important schema decision you make on a table. It dictates the physical order of every row, and every non-clustered index on that table carries a copy of it inside its own leaf pages. A wide clustered key makes every other index on the table fatter. A random clustered key fragments the whole table. A volatile clustered key forces rows to physically move every time it changes.

Kimberly Tripp wrote down the classic four-point checklist about fifteen years ago and it has not aged a day:

  1. Narrow. Every byte of the clustered key is repeated in every non-clustered index. Choose skinny data types. BIGINT (8 bytes) is great. UNIQUEIDENTIFIER (16 bytes) is twice as expensive. A composite key of five VARCHAR(50) columns is malpractice.
  2. Static. Updating a clustered-key column means physically moving the row and updating every non-clustered index. Pick a column that doesn’t change. Customer ID, order ID, timestamp-of-creation — all fine. “Current status” — not fine.
  3. Unique. If your clustered key isn’t unique, SQL Server sneaks a 4-byte “uniquifier” behind your back and stores it on every duplicate. Better to be unique on purpose than to pay for a hidden column you didn’t know existed.
  4. Ever-increasing. New rows land on the trailing page of the B-tree, which means inserts don’t split pages in the middle of the tree. This is why BIGINT IDENTITY, INT IDENTITY, and DATETIME2 on append-only event tables are the three workhorse clustered keys you’ll see in production.

This is also why people will die on the hill of “don’t cluster on a GUID.” A NEWID() GUID is wide (16 bytes), random (every insert lands in a different page), and causes constant page splits. If you must use a GUID as your business key, make it a non-clustered unique index and cluster on a BIGINT IDENTITY behind it. I have personally rescued a production system that was spending 40% of its CPU on page splits for exactly this reason. The fix took one afternoon. The bill for the preceding six months did not.

If you genuinely need GUIDs because of a distributed system, use NEWSEQUENTIALID() instead of NEWID() so at least the values are locally monotonic. It’s the compromise option, and it’s fine.

Non-clustered indexes: the “equality, range, include” pattern

When you design a non-clustered index for a specific query, here’s the trick that covers 80% of cases:

  • Put equality columns first (WHERE x = @x)
  • Then range columns (WHERE y BETWEEN @a AND @b)
  • Then ordering columns (ORDER BY z)
  • Then INCLUDE the leftover columns the query wants to SELECT

Take this query:

SELECT CustomerId, OrderDate, Amount
FROM dbo.Orders
WHERE CustomerId = @c
  AND OrderDate >= @start
ORDER BY OrderDate DESC;

The right index is (CustomerId, OrderDate) INCLUDE (Amount). CustomerId first because it’s the equality filter, OrderDate second because it’s both the range filter and the sort, and Amount in INCLUDE because the query returns it but doesn’t filter or sort on it.

Get the column order wrong — say, (OrderDate, CustomerId) — and the index is useless for this query. A range scan on OrderDate means the second column (CustomerId) is no longer in sorted order under the filter, so SQL Server can’t seek to your customer. It’ll scan or it’ll pick a different index entirely.

The write cost nobody brings up in tutorials

Here is the law of indexes: they make reads faster and writes slower. No exceptions. No free lunches.

A table with 10 non-clustered indexes pays for 11 page writes on every INSERT: the table itself, plus each index. DELETE is the same. UPDATE is the same if any of the indexed columns changed.

I have seen an ingestion pipeline go from 90 seconds to 14 minutes because a dev added two “helpful” non-clustered indexes during a Tuesday morning debugging session and forgot to remove them. The read-side query plans looked gorgeous. The pipeline was on fire by Wednesday. The fix was to drop the indexes, sprint out of the meeting room, and not make eye contact with the on-call engineer for the rest of the week.

Before you add a non-clustered index to a hot OLTP table, look up the write rate. If a table takes 50,000 inserts a minute, an extra microsecond per insert is 50 milliseconds a minute of extra CPU. That might be fine. It might not. The point is you have to look, not guess.

Fragmentation: what actually matters

Two kinds of fragmentation to know about:

  • External (logical) fragmentation — pages that are no longer in physical order. Hurts large range scans, because the disk has to jump around.
  • Internal fragmentation — pages that are only half full because of page splits. Wastes RAM and disk space.

Here’s the one query you should know by heart for finding the worst offenders:

SELECT
    OBJECT_NAME(ips.object_id) AS table_name,
    i.name                     AS index_name,
    ips.avg_fragmentation_in_percent,
    ips.page_count
FROM sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, 'SAMPLED') AS ips
JOIN sys.indexes AS i
    ON i.object_id = ips.object_id
   AND i.index_id  = ips.index_id
WHERE ips.page_count > 1000          -- ignore tiny indexes
ORDER BY ips.avg_fragmentation_in_percent DESC;

Then the rule of thumb:

  • Below 5% → leave it alone. Defragmenting tiny amounts is pure I/O for no benefit.
  • 5–30%ALTER INDEX ... REORGANIZE. Online, low-impact, shuffles leaf pages around. Doesn’t fully rebuild the tree.
  • Above 30%ALTER INDEX ... REBUILD. Heavier, rebuilds the entire index. Use WITH (ONLINE = ON) on Enterprise Edition; on Standard Edition this takes a schema lock, so do it in a maintenance window.

Important caveat: these thresholds are from Microsoft’s own old guidance and Paul Randal (who literally wrote the storage engine) has publicly said for years that the 30% number is basically made up. On small indexes, fragmentation doesn’t matter. On SSDs, fragmentation matters much less than it used to because the disk doesn’t care about physical order. On tables that get scanned a lot, it still matters. On tables that only ever get seeks, you can ignore it almost entirely. We will do a whole lesson on this later in the course (lesson 24: The myth of 30%), because too many DBAs are still copy-pasting the same “rebuild everything above 30%” job that was obsolete in 2011.

The biggest win is not letting fragmentation happen in the first place: ever-increasing clustered key, sensible FILLFACTOR (90% on tables that take many updates, 100% on append-only ones), and never run DBCC SHRINKDATABASE in production. It defragments the file by shuffling pages backwards, which re-fragments every index it touches. It’s the database equivalent of cleaning your room by stuffing everything into one cupboard.

The mental model that makes indexing feel easy

Once you really believe that the clustered index is the table, and that every non-clustered index carries a copy of the clustered key inside its leaves, every piece of advice in this area becomes a small consequence of that one fact:

  • Wide clustered key → every non-clustered index gets wider.
  • Random clustered key → every insert lands in a random page → constant page splits → fragmentation → sad queries.
  • Volatile clustered key → updates physically move rows → every non-clustered index has to be rewritten on each update.
  • Ten non-clustered indexes on a hot write table → every INSERT pays for eleven writes.

Indexes are a budget. Spend it on the queries that matter. Leave the rest alone. Measure before you add, measure before you drop, and never let anyone talk you into “let’s just add an index to fix this” without looking at how often that table gets written to.

Run this on your own machine

If you want to feel all of this in your hands, spin up a quick table:

USE tempdb;

CREATE TABLE dbo.LessonOrders (
    OrderId    BIGINT        NOT NULL IDENTITY(1,1) PRIMARY KEY,
    CustomerId INT           NOT NULL,
    OrderDate  DATETIME2(0)  NOT NULL DEFAULT SYSUTCDATETIME(),
    Amount     DECIMAL(18,2) NOT NULL
);

INSERT INTO dbo.LessonOrders (CustomerId, Amount)
SELECT ABS(CHECKSUM(NEWID())) % 1000, ABS(CHECKSUM(NEWID())) % 10000 / 100.0
FROM sys.all_objects a CROSS JOIN sys.all_objects b;

-- Before:
SELECT CustomerId, OrderDate, Amount
FROM dbo.LessonOrders
WHERE CustomerId = 42;
-- Look at the actual execution plan. Scan.

CREATE NONCLUSTERED INDEX IX_LessonOrders_CustomerId
    ON dbo.LessonOrders (CustomerId)
    INCLUDE (OrderDate, Amount);

-- After:
SELECT CustomerId, OrderDate, Amount
FROM dbo.LessonOrders
WHERE CustomerId = 42;
-- Seek. No key lookup. Covering index.

Turn on “Include Actual Execution Plan” in SSMS (Ctrl+M), run the query before and after you add the index, and watch the shape of the plan change. That’s the moment it clicks. Every indexing decision after today starts from this one idea.

Next lesson: covering indexes and the key-lookup tax, in proper detail.

Search