UNION vs UNION ALL

Short lesson today. Just one keyword, one bug, one rule.

Runehold’s finance team has two tables: Sales.Orders for the online channel and Retail.Sale for the physical stores. They want a unified view for revenue reporting. The query? Two selects combined with UNION. Except… UNION is slower than it needs to be in 99% of cases, and half the queries in production that use UNION should actually use UNION ALL.

Let’s fix that.

The four set operators

T-SQL has four operators that combine result sets:

UNION — rows from A + rows from B, with duplicates removed
UNION ALL — rows from A + rows from B, keeping duplicates
INTERSECT — rows that appear in both A and B
EXCEPT — rows in A that don’t appear in B

All four require:

Both sides have the same number of columns
Column types are compatible (or SQL Server will try implicit conversion, which may break SARGability)
Column names in the final result come from the first query

-- UNION: merge + remove duplicates
SELECT ProductId, Quantity FROM Sales.OrderLine
UNION
SELECT ProductId, Quantity FROM Retail.SaleLine;

-- UNION ALL: merge, keep duplicates
SELECT ProductId, Quantity FROM Sales.OrderLine
UNION ALL
SELECT ProductId, Quantity FROM Retail.SaleLine;

The only difference: UNION removes duplicates, UNION ALL doesn’t.

To remove duplicates, SQL Server has to do extra work: sort or hash every combined row, compare, and drop duplicates. That’s measurable on small data and painful on large data.

The rule:

Use UNION ALL unless you have a specific reason to deduplicate.
Use UNION only when you know there can be duplicates and you want them gone.

In most cases where you’re combining partitioned data — online orders + retail orders, archived logs + live logs, tenant A’s data + tenant B’s data — there are no duplicates possible by construction. UNION ALL is the correct choice.

Writing UNION “to be safe” is like sorting an already-sorted array “to be safe.” It’s not free.

The one place people still write UNION

Marketing at Runehold has an all-hands email list built from three sources: customers who opted in, employees, partners. Some people overlap — an employee who is also a customer.

-- Combine three lists, remove duplicates
SELECT Email FROM Sales.Customer WHERE EmailOptIn = 1
UNION
SELECT Email FROM HR.Employee
UNION
SELECT Email FROM Partners.Contact;

This is legitimate UNION. The expected output is a deduplicated list, and the duplicates across sources are genuinely possible. In this case UNION ALL would send the same person three emails. UNION is the right choice.

When in doubt: can this row legitimately exist in both inputs? If yes, you need to decide what to do with the duplicate. If no, UNION ALL is safer and faster.

INTERSECT and EXCEPT

Less frequently used but beautiful when they fit.

`INTERSECT` — what’s in both?

-- Customers who have ordered online AND walked into a store
SELECT CustomerId FROM Sales.Orders
INTERSECT
SELECT CustomerId FROM Retail.Sale;

Returns CustomerIds that appear in both result sets. Equivalent to an inner join / EXISTS but often much more readable for “overlap” questions.

`EXCEPT` — in A but not in B?

-- Customers who have ordered online but never walked into a store
SELECT CustomerId FROM Sales.Orders
EXCEPT
SELECT CustomerId FROM Retail.Sale;

Returns rows in the first query that don’t appear in the second. Identical in effect to NOT EXISTS.

Both INTERSECT and EXCEPT deduplicate, like UNION. For single-column queries they’re clean and obvious. For multi-column it can get confusing — every column participates in the “does this row match” comparison.

A classic use of EXCEPT: data reconciliation. “What rows are in the staging table that aren’t in production?” A fast, obvious way to compare two result sets.

ORDER BY goes on the LAST query only

-- WRONG: ORDER BY on a middle query
SELECT Email FROM Sales.Customer ORDER BY Email    -- error
UNION
SELECT Email FROM HR.Employee;

-- RIGHT: ORDER BY after everything
SELECT Email FROM Sales.Customer
UNION
SELECT Email FROM HR.Employee
ORDER BY Email;

Only one ORDER BY allowed, at the end. It applies to the combined result.

If you really need to sort inside a branch (say, you want TOP (10) from each source combined), use subqueries:

SELECT * FROM (
    SELECT TOP (10) Email, 'Customer' AS source
    FROM Sales.Customer
    ORDER BY CreatedAt DESC
) AS c
UNION ALL
SELECT * FROM (
    SELECT TOP (10) Email, 'Employee' AS source
    FROM HR.Employee
    ORDER BY HireDate DESC
) AS e
ORDER BY source, Email;

Each subquery does its own TOP + ORDER BY; the outer UNION ALL combines the two, and the final ORDER BY sorts the combined result.

Column names and types

Column names come from the first query. Types must be compatible or SQL Server does implicit conversion.

SELECT CustomerId AS id, 'Sales' AS source FROM Sales.Orders
UNION ALL
SELECT EmployeeId, 'HR'  FROM HR.Employee;
-- Result columns: id, source

If the column types don’t match, SQL Server will try to convert to a common type, or fail with an explicit “types are incompatible” error. Best practice: cast explicitly when it’s not obvious:

SELECT CAST(CustomerId AS INT) AS id, 'Sales' AS source FROM Sales.Orders
UNION ALL
SELECT CAST(EmployeeId AS INT), 'HR' FROM HR.Employee;

Saves a confusing error in prod when someone changes a type downstream.

Tagging the source: a common pattern

When combining data from multiple sources, add a literal column identifying where each row came from:

SELECT 'online' AS channel,
       OrderDate, Total, CustomerId
FROM Sales.Orders
UNION ALL
SELECT 'retail',
       SaleDate,  Total, CustomerId
FROM Retail.Sale;

Finance’s revenue dashboard: one table, one time series, tagged by channel. Easy to group, easy to chart, easy to debug.

Run this on your own machine

USE Runehold;
GO

-- Set up a tiny Retail.Sale for the demo
IF SCHEMA_ID('Retail') IS NULL EXEC('CREATE SCHEMA Retail AUTHORIZATION dbo');

IF OBJECT_ID('Retail.Sale', 'U') IS NULL
    CREATE TABLE Retail.Sale (
        SaleId BIGINT IDENTITY(1,1) PRIMARY KEY,
        StoreCity NVARCHAR(50) NOT NULL,
        SaleDate DATETIME2(0) NOT NULL,
        Total DECIMAL(19,4) NOT NULL,
        CustomerId INT NULL   -- some in-store sales are anonymous
    );

INSERT INTO Retail.Sale (StoreCity, SaleDate, Total, CustomerId)
VALUES ('Amsterdam', '2026-03-20', 45.00, 1),
       ('Milan',      '2026-03-21', 62.00, 2),
       ('Milan',      '2026-03-21', 18.50, NULL),
       ('Berlin',    '2026-03-22', 99.00, 3);

-- Query 1: unified revenue view with channel tag
SELECT 'online' AS channel,
       OrderDate AS tx_date,
       Total,
       CustomerId
FROM Sales.Orders
UNION ALL
SELECT 'retail',
       SaleDate,
       Total,
       CustomerId
FROM Retail.Sale
ORDER BY tx_date DESC;

-- Query 2: combined totals per channel
WITH combined AS (
    SELECT 'online' AS channel, Total FROM Sales.Orders
    UNION ALL
    SELECT 'retail',            Total FROM Retail.Sale
)
SELECT channel, COUNT(*) AS tx_count, SUM(Total) AS revenue
FROM combined
GROUP BY channel;

-- Query 3: INTERSECT — customers active on both channels
SELECT CustomerId FROM Sales.Orders WHERE CustomerId IS NOT NULL
INTERSECT
SELECT CustomerId FROM Retail.Sale  WHERE CustomerId IS NOT NULL;

-- Query 4: EXCEPT — customers who shopped online but never in-store
SELECT CustomerId FROM Sales.Orders WHERE CustomerId IS NOT NULL
EXCEPT
SELECT CustomerId FROM Retail.Sale  WHERE CustomerId IS NOT NULL;

-- Query 5: performance comparison — time this versus UNION ALL on a big set
-- (small data here, but the principle scales)
SET STATISTICS TIME ON;

SELECT OrderDate AS d FROM Sales.Orders
UNION
SELECT SaleDate  FROM Retail.Sale;     -- implicit dedupe

SELECT OrderDate AS d FROM Sales.Orders
UNION ALL
SELECT SaleDate  FROM Retail.Sale;     -- faster

SET STATISTICS TIME OFF;

Run the timings on a bigger dataset and you’ll see the cost. On a table with millions of rows, UNION vs UNION ALL is the difference between a 4-second query and a 400 ms one.

Next lesson: string functions. STRING_AGG, STRING_SPLIT, TRIM, CONCAT_WS, and the death of half the CLR functions old shops still ship around.

The four set operators