A pandas DataFrame is only useful if you can get data into it. Today we cover the pd.read_* family — the dozen or so functions that turn an external file or database query into a DataFrame, the options on each that bite people in production, and the corresponding df.to_* writers. By the end you’ll know which format to read, which arguments to set explicitly even when “the default works,” and how to deal with the fact that real-world data is full of dirty edge cases that the docs gracefully omit.
The general shape of the API is pd.read_<format>(path_or_buffer, **options). Almost every reader accepts a local path, a URL (HTTP, HTTPS, S3, GCS, Azure if pyarrow is installed), or a file-like object. They almost all accept dtype_backend="pyarrow" now. They almost all accept chunksize= for streaming large files. The differences are in the format-specific options.
CSV — the workhorse, full of traps
pd.read_csv is the function you’ll call more than any other in this course, because CSV is what other people send you. It’s also the one with the most footguns.
import pandas as pd
df = pd.read_csv("sales.csv")
That works for a clean file. It rarely is. The arguments that matter in production:
dtype — turn off type inference for known columns. By default, pandas reads each column, scans the whole thing, and guesses the dtype. This is fine for ad-hoc analysis and terrible for production pipelines: a single typo in row 50,000 turns an integer column into object, breaks downstream joins, and the bug only shows up in production. In a pipeline, declare your dtypes:
df = pd.read_csv(
"sales.csv",
dtype={
"order_id": "int64",
"customer_id": "string",
"amount": "float64",
"currency": "category",
},
)
Anything you don’t list still gets inferred. This pattern — explicit on the columns you care about, lazy on the rest — is the working compromise.
encoding — UTF-8 is the standard, but European data is often Latin-1. A CSV exported from old Excel on a German or Italian machine is frequently cp1252 (Windows Latin-1) and contains €, à, ñ, ß rendered in single bytes. UTF-8 will choke. If you see UnicodeDecodeError, try:
df = pd.read_csv("export.csv", encoding="cp1252")
# or
df = pd.read_csv("export.csv", encoding="latin-1")
If you don’t know the encoding and the file’s small, the chardet library can guess; for big files, just open it in a hex editor and look for the BOM or non-ASCII bytes.
sep — comma is the default, but European CSVs use semi-colons. Because the comma is the decimal separator in many European locales, the CSV exporter often picks ; as the field separator instead. If your “CSV” comes back as one giant column with semi-colons inside, that’s why:
df = pd.read_csv("italian_sales.csv", sep=";", decimal=",")
(Note decimal="," — yes, the decimal separator argument exists, and yes, you’ll need it.)
parse_dates — explicit, always. Pandas will not automatically parse date columns. If you ignore this, you get object dtype strings that look like dates but don’t sort, filter, or arithmetic correctly:
df = pd.read_csv(
"events.csv",
parse_dates=["created_at", "updated_at"],
date_format="ISO8601", # pandas 2.0+; faster and stricter than the old auto-parser
)
date_format="ISO8601" works for any ISO-8601 string; for non-ISO formats give the explicit strptime pattern (e.g. date_format="%d/%m/%Y").
chunksize — for files that don’t fit in memory. If you have a 50 GB CSV and 16 GB of RAM, you can’t read it whole. chunksize returns an iterator of DataFrames:
total = 0
for chunk in pd.read_csv("massive.csv", chunksize=500_000):
total += chunk["amount"].sum()
500,000 rows is a reasonable chunk size; tune it based on row width. We’ll see a full streaming example at the end.
dtype_backend="pyarrow" — for new code, just turn it on. As discussed in lesson 26, the Arrow backend gives you proper nullable types and faster strings. There’s no downside in 2026:
df = pd.read_csv("sales.csv", dtype_backend="pyarrow")
Parquet — what you should use whenever you can
CSV is the format other people send you. Parquet is the format you should use whenever the choice is yours.
Parquet is a columnar binary format from the Hadoop world (originally), now the standard for analytical data interchange. It’s columnar (so reading three columns from a thousand-column file is fast), it stores schema and dtypes (so no encoding or type-inference guessing), it’s compressed by default (typically 5-10x smaller than the equivalent CSV), and it’s the lingua franca of every modern data tool: DuckDB, Polars, Spark, Snowflake, BigQuery, Athena, every cloud data lake.
df = pd.read_parquet("sales.parquet", engine="pyarrow")
df.to_parquet("output.parquet", engine="pyarrow", compression="snappy")
A few things to know:
engine="pyarrow". The other option isfastparquet. Use pyarrow. It’s faster, more compatible, better maintained, and is what every other tool uses too.- Compression options —
snappyis the default, fast and lightweight. Usezstdfor better compression at slightly higher CPU cost; usegziponly if some downstream consumer requires it. - Reading a subset of columns is free, because of the columnar layout:
pd.read_parquet("file.parquet", columns=["a", "b"])only reads those columns from disk. - Schema is preserved, so you don’t need to re-specify dtypes on read. This alone is worth the format change.
If your team is sending CSVs around an internal pipeline, the single highest-impact engineering change you can make is “switch to Parquet.” Disk drops by 5-10x, read speed goes up by 5-20x, dtype bugs disappear. There is approximately no reason not to.
Excel — slow, sometimes necessary
Humans edit Excel files. They send them to you. You can either insist on CSVs (you will lose) or learn to read xlsx well.
df = pd.read_excel(
"report.xlsx",
sheet_name="Sales",
skiprows=3, # skip the title and blank rows that humans add at the top
header=0, # which row (after skipping) is the column header
usecols="A:F", # only these columns
)
pd.read_excel requires openpyxl for .xlsx (and xlrd for ancient .xls):
uv add openpyxl
It is slow. Reading a million-row xlsx will take a minute. If a process insists on Excel input, the standard workaround is “read the xlsx once, save as Parquet, work on Parquet thereafter.”
sheet_name=None returns a dict of {sheet_name: DataFrame}, which is useful for workbooks where each sheet is a region or month.
JSON — flat and nested
For flat JSON (a list of objects, each with the same keys), pd.read_json works:
df = pd.read_json("events.json")
JSON Lines — one JSON object per line, the standard for log/event streams — needs lines=True:
df = pd.read_json("events.jsonl", lines=True)
For nested JSON — anything where a field is itself an object or a list of objects — the right tool is pd.json_normalize. It flattens nested keys into dotted column names:
import json
import pandas as pd
with open("orders.json") as f:
raw = json.load(f)
# raw = [{"id": 1, "customer": {"name": "Ada", "country": "IT"}, "items": [...]}]
df = pd.json_normalize(
raw,
record_path="items", # explode this list to one row per item
meta=["id", ["customer", "name"], ["customer", "country"]],
)
json_normalize is the function you’ll reach for any time you call a REST API. It’s worth ten minutes with the docs and a real example to get the muscle memory.
SQL — the right way
pd.read_sql accepts either a SQL query string or a table name, plus a connection. Use SQLAlchemy for the connection — it works with every common database and is what read_sql is tested against:
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine("postgresql+psycopg2://user:pass@host:5432/db")
df = pd.read_sql(
"SELECT order_id, amount, created_at FROM orders WHERE created_at >= %(since)s",
engine,
params={"since": "2025-01-01"},
parse_dates=["created_at"],
)
Always parameterize your queries (the params= argument). Concatenating user input into a SQL string is a SQL injection waiting to happen and we don’t do that here.
For very large result sets, chunksize= works the same way as read_csv:
for chunk in pd.read_sql("SELECT * FROM events", engine, chunksize=100_000):
process(chunk)
A note for 2026: a lot of analytical work has moved to DuckDB, which can query Parquet files and pandas DataFrames directly with SQL. If you’re doing in-process analytics, duckdb.query("SELECT ... FROM df").to_df() is often the cleaner answer than read_sql against a remote database. We’ll see it in passing later in the course.
Cloud storage
If pyarrow is installed, every reader accepts cloud URLs natively:
df = pd.read_parquet("s3://my-bucket/sales/2025/03/") # whole partition
df = pd.read_csv("gs://my-bucket/users.csv")
df = pd.read_parquet("az://my-container/events.parquet")
You’ll need credentials configured in the usual places (~/.aws/credentials, GOOGLE_APPLICATION_CREDENTIALS, etc.) — the same way the underlying SDK expects.
Writing — the to_* family
Every reader has a corresponding writer:
df.to_csv("out.csv", index=False) # always pass index=False unless you actually want the index in the file
df.to_parquet("out.parquet", engine="pyarrow", compression="snappy")
df.to_excel("out.xlsx", sheet_name="Results", index=False)
df.to_json("out.jsonl", orient="records", lines=True)
df.to_sql("orders", engine, if_exists="append", index=False)
The single most common mistake on writing is leaving index=True (the default), which dumps a pandas RangeIndex into the file as a phantom unnamed first column. Pass index=False unless your index carries information.
A real example: streaming a 5 GB CSV into Parquet
Putting it together — a script that reads a too-big CSV in chunks, deduplicates by an order_id, and writes a single Parquet file:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
reader = pd.read_csv(
"big_orders.csv",
chunksize=500_000,
dtype={
"order_id": "int64",
"customer_id": "string",
"amount": "float64",
"currency": "category",
},
parse_dates=["created_at"],
date_format="ISO8601",
dtype_backend="pyarrow",
)
seen_ids: set[int] = set()
writer: pq.ParquetWriter | None = None
try:
for chunk in reader:
# drop within-chunk duplicates
chunk = chunk.drop_duplicates("order_id")
# drop cross-chunk duplicates
chunk = chunk[~chunk["order_id"].isin(seen_ids)]
seen_ids.update(chunk["order_id"].tolist())
table = pa.Table.from_pandas(chunk, preserve_index=False)
if writer is None:
writer = pq.ParquetWriter("orders.parquet", table.schema, compression="snappy")
writer.write_table(table)
finally:
if writer is not None:
writer.close()
This streams the CSV, never holds more than 500,000 rows in memory at once, and produces a clean Parquet file at the end. The seen_ids set is the only thing that grows with the data — and 5 million 64-bit integers is 40 MB, fine on any machine. For truly enormous files, you’d switch to a Bloom filter; for “5 GB CSV that’s a pain to handle,” this is enough.
What’s next
You can now get a DataFrame. Lesson 28 is selection and filtering — the loc / iloc / boolean-mask zoo, which is the part of pandas where the syntactic shortcuts compete with each other and a lot of clever-looking code is written for the wrong reason. We’ll find the patterns worth keeping.
Further reading
- pandas: IO tools — every reader and writer, with their full option lists.
- Apache Parquet documentation — the format, briefly.
- pyarrow.fs — the cloud filesystem layer pandas uses under the hood.
See you Friday.