Every confused pandas user I’ve ever helped at work was confused at the same level. Not at the API surface — they could read the docs — but at the level of what these objects are. They had a vague sense that df["col"] gives you “a column,” that df.loc[5] gives you “a row,” that there is something called an “index” that mostly seems annoying. The model in their head was “a 2D table, like Excel.” When that model failed — and it fails the moment you do an arithmetic operation between two slices that don’t have the same row order — they bounced off and reached for df.values to escape into NumPy, which is where competent intuitions go to die.
Pandas has two core data structures: Series and DataFrame. They are simple. Today we explain them. The rest of Module 5 will then make sense.
Series — a 1-D labeled array
A Series is a one-dimensional array of values, plus a one-dimensional array of labels (the index) of the same length. That’s it. Every value has a label, and the label is what makes pandas different from NumPy.
import pandas as pd
s = pd.Series([10, 20, 30, 40])
print(s)
0 10
1 20
2 30
3 40
dtype: int64
The left column (0, 1, 2, 3) is the index. When you don’t specify one, pandas makes you a default RangeIndex from 0. The right column is the values. The dtype line at the bottom tells you the type of those values.
You can specify your own index — strings, dates, anything hashable:
revenue = pd.Series(
[120, 135, 148, 162],
index=["Q1", "Q2", "Q3", "Q4"],
name="revenue_2025",
)
print(revenue)
Q1 120
Q2 135
Q3 148
Q4 162
Name: revenue_2025, dtype: int64
You can build one from a dict — the keys become the index automatically:
revenue = pd.Series({"Q1": 120, "Q2": 135, "Q3": 148, "Q4": 162})
The index is the killer feature, and it’s also what makes pandas surprising. Operations between Series align by label, not by position. Watch:
revenue_2025 = pd.Series({"Q1": 120, "Q2": 135, "Q3": 148, "Q4": 162})
revenue_2024 = pd.Series({"Q2": 100, "Q3": 110, "Q4": 125, "Q1": 95})
growth = revenue_2025 - revenue_2024
print(growth)
Q1 25
Q2 35
Q3 38
Q4 37
dtype: int64
Notice that revenue_2024 had its quarters in a different order. It didn’t matter. Pandas matched up Q1 with Q1 and so on, by label. If you’d done the same thing in NumPy, you’d have gotten nonsense — Q1 minus Q2, Q2 minus Q3 — silently. This is the trade pandas makes: a slightly more complex object, in exchange for arithmetic that means what you wanted.
What happens with mismatched labels?
a = pd.Series({"x": 1, "y": 2})
b = pd.Series({"y": 10, "z": 20})
print(a + b)
x NaN
y 12.0
z NaN
dtype: float64
Labels not present in both Series produce NaN. The dtype was promoted to float64 because integers can’t be NaN in NumPy-backed pandas — a famous footgun we’ll fix when we discuss the Arrow backend in a moment.
Dtypes — what kind of values
Every Series has a dtype, and which one you have controls a lot of behavior. The common ones in pandas:
int64— 64-bit signed integers. Default when all values are whole numbers and there are no nulls.float64— 64-bit floating point. Default when there are decimals, or when an int column has NaN.object— a NumPy “this is a Python object” escape hatch. In practice, almost always means strings, sometimes mixed types. Slow.string[python]/string[pyarrow]— proper string dtypes. The Arrow-backed one is meaningfully faster and the modern default.bool— booleans.category— for repeating values from a small fixed set (think: country codes, status enums). Stored as integers under the hood with a lookup table; saves enormous memory on real datasets.datetime64[ns]— timestamps with nanosecond precision.timedelta64[ns]— durations.
Then the nullable equivalents — pandas extension types, capital letter at the front: Int64, Float64, boolean, string. These support pd.NA (a proper “missing” value) without changing the dtype. They were the bridge between old NumPy-backed pandas and the new Arrow-backed pandas; you still see them but for new code, just go straight to the Arrow backend.
# Old style — int column with a missing value becomes float
pd.Series([1, 2, None])
# 0 1.0
# 1 2.0
# 2 NaN
# dtype: float64
# Arrow backend — stays an integer, gets a proper NA
pd.Series([1, 2, None], dtype="int64[pyarrow]")
# 0 1
# 1 2
# 2 <NA>
# dtype: int64[pyarrow]
That second one is what you actually want. Strings are the bigger win:
# Old style: NumPy "object" array of Python str instances
pd.Series(["alpha", "beta", "gamma"]) # dtype: object
# Arrow strings: tighter memory, ~10x faster string ops
pd.Series(["alpha", "beta", "gamma"], dtype="string[pyarrow]")
If you put pd.options.future.infer_string = True near the top of your script, pandas will use the Arrow string dtype automatically when reading text columns. Do that. Future-you will have less to worry about.
DataFrame — a dict of aligned Series
A DataFrame is a dictionary of Series that all share the same row index. That’s the entire mental model. Every column is a Series, every column has the same length, and every column has the same row labels.
import pandas as pd
df = pd.DataFrame({
"country": ["IT", "FR", "DE", "ES"],
"population_m": [59, 68, 84, 47],
"in_eurozone": [True, True, True, True],
})
print(df)
country population_m in_eurozone
0 IT 59 True
1 FR 68 True
2 DE 84 True
3 ES 47 True
You created it from a dict-of-lists. The keys became column names, the values became Series, the row index defaulted to a RangeIndex from 0.
You can also build a DataFrame from a list of dicts (one dict per row), which is closer to what you get back from a JSON API:
rows = [
{"country": "IT", "population_m": 59, "in_eurozone": True},
{"country": "FR", "population_m": 68, "in_eurozone": True},
{"country": "DE", "population_m": 84, "in_eurozone": True},
]
df = pd.DataFrame(rows)
Or from a 2D NumPy array (then you specify columns yourself):
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
df = pd.DataFrame(arr, columns=["a", "b", "c"])
Or — most commonly in real life — from a file:
df = pd.read_csv("countries.csv")
df = pd.read_parquet("countries.parquet")
We do that in lesson 27.
The attributes you’ll touch every day
A DataFrame has a small set of attributes that you’ll reach for constantly:
df.shape # (4, 3) — rows, columns
df.dtypes # the dtype of each column, as a Series
df.index # the row labels (a RangeIndex(0, 4) here)
df.columns # the column labels (Index(['country', 'population_m', 'in_eurozone']))
df.values # the underlying NumPy array (avoid using this; reach for .to_numpy() if you need it)
df.info() # printed summary: dtypes, non-null counts, memory usage
df.head(3) # first 3 rows
df.tail(3) # last 3 rows
df.describe() # summary statistics for numeric columns
df.info() and df.head() are the two things you run on any new DataFrame, every time. They’re how you check what you actually got, before any computation.
Selecting a column gives you a Series. Two columns gives you a DataFrame.
This trips up beginners and is critical to internalize:
df["country"] # Series — 1-D
df[["country"]] # DataFrame — 2-D, with one column
df[["country", "population_m"]] # DataFrame, two columns
Single brackets with a string returns a Series. Double brackets with a list of strings — even a list of one — returns a DataFrame. Many functions accept one or the other but not both, and “I passed it df["col"] and it complained it wanted a 2-D thing” is a daily occurrence at first.
The mirror happens with rows: df.loc[3] returns a Series (a row), df.loc[[3]] returns a DataFrame (a 1-row table). We’ll cover row selection properly in lesson 28.
The Index — boring, important
Every DataFrame has an Index for its rows, and the column names are themselves an Index for the columns. The default row index is a RangeIndex (0, 1, 2, …), but you can use anything: dates, country codes, user IDs, multi-level combinations of all three.
df = pd.DataFrame(
{"population_m": [59, 68, 84, 47]},
index=["IT", "FR", "DE", "ES"],
)
df.loc["IT"] # selects the IT row by label
A MultiIndex is an index with multiple levels — e.g. (country, year), (region, store, date). It’s how you do hierarchical or pivot-table-style data in pandas.
df = pd.DataFrame(
{"sales": [100, 110, 95, 105]},
index=pd.MultiIndex.from_tuples(
[("IT", 2024), ("IT", 2025), ("FR", 2024), ("FR", 2025)],
names=["country", "year"],
),
)
df.loc["IT"] # all rows for IT
df.loc[("IT", 2025)] # specific (country, year)
MultiIndexes are powerful and a little fiddly. We’ll meet them again in lesson 32 when we reshape data.
The opinion in 2026: don’t be precious about the index. Many modern pandas tutorials and Polars users argue you should mostly leave the index as a default RangeIndex and treat your “key” columns as regular columns. It’s a cleaner mental model. Use a meaningful index when you genuinely benefit (time series, joins by label, hierarchical aggregations); skip it otherwise.
NumPy backend vs Arrow backend, with an example
To make the backend distinction concrete:
import pandas as pd
# NumPy-backed (the historical default)
df_np = pd.DataFrame({
"name": ["Ada", "Linus", "Grace"],
"score": [98, None, 92],
})
print(df_np.dtypes)
# name object
# score float64 <-- promoted to float because of the None
# dtype: object
# Arrow-backed
df_pa = pd.DataFrame({
"name": ["Ada", "Linus", "Grace"],
"score": [98, None, 92],
}).convert_dtypes(dtype_backend="pyarrow")
print(df_pa.dtypes)
# name string[pyarrow]
# score int64[pyarrow] <-- stays an integer, with proper NA
# dtype: object
Both DataFrames behave the same for almost every operation you’ll do; the Arrow-backed one has cleaner dtypes, faster strings, and a more honest representation of missing data. Going forward, every example in this course that doesn’t specifically demonstrate a NumPy-backend behavior will assume you’re working Arrow-backed. Set dtype_backend="pyarrow" on your read_* calls and you’re there.
What’s next
Now that you know what you’re holding when you hold a DataFrame, lesson 27 covers how to actually get one — pd.read_csv, pd.read_parquet, the gotchas in each format, and why Parquet is the right answer almost every time something better than CSV is allowed.
Further reading
- pandas: Intro to data structures — the official introduction, worth reading once.
- pandas dtypes — the canonical reference.
- PyArrow-backed dtypes — the modern default explained properly.
See you Tuesday for the wide world of pd.read_*.