Data & System Architecture, from the ground up Lesson 78 / 80

Privacy and compliance: GDPR, CCPA, data residency

Privacy regulations as architectural drivers. Right to erasure, data residency, customer-managed keys, and the consent and audit infrastructure compliance frameworks require.

Privacy regulations look like legal paperwork from the outside. From inside an architecture review, they are some of the most expensive constraints a team will ever absorb. A new privacy law in a new jurisdiction can force a rebuild of identity handling, deletion plumbing, regional deployment topology, and the audit log. The cost is not in the legal fees; it is in the engineering work that follows.

The mistake teams make is treating compliance as a check that the legal team handles after the platform is built. By the time the regulators care about a system, the architecture has assumed the user’s data lives in one warehouse, gets copied into three feature stores, gets exported nightly to two analytics vendors, and gets archived monthly into immutable cold storage. None of those copies were thinking about a future where the user might ask to be deleted. Going back and adding that capability is hard, slow, and produces some of the ugliest code in any data platform.

This lesson is about the privacy regulations that matter in 2026, the architectural patterns they require, and the compliance frameworks that codify the patterns into auditable controls.

The regulations that shape modern systems

A short tour. The shape of each is similar; the details differ.

GDPR, the EU’s General Data Protection Regulation, in force since 2018. The most influential privacy law in the world by virtue of its reach: it applies to any organisation processing the personal data of EU residents, regardless of where the organisation is based. The rights it grants users include access (give the user a copy of their data), portability (give it in a machine-readable format), rectification (let them correct it), erasure (delete it on request, with limited exceptions), and the right to object to specific kinds of processing. The principles include data minimisation (do not collect more than the purpose requires), purpose limitation (do not reuse data for a purpose unrelated to why it was collected), and storage limitation (do not keep it longer than necessary).

CCPA and CPRA, California’s privacy laws. Similar shape: rights to access, delete, and opt out of sale or sharing of personal information. The CPRA added requirements around sensitive personal information and a regulator (the CPPA) to enforce.

LGPD in Brazil, PIPL in China, POPIA in South Africa, the APP principles in Australia, and a growing collection of US state laws (Colorado, Virginia, Connecticut, Utah, Texas, others) each carry their own version of the same core ideas with local twists. PIPL has aggressive cross-border-transfer provisions that catch foreign companies off guard.

The architecture takeaway is that “privacy” is not one regulation; it is an overlapping set of regimes the platform has to satisfy simultaneously. Most teams pick GDPR as the high-water mark and build to it, on the theory that satisfying GDPR satisfies most other regimes most of the time.

The architectural implication: the right to be forgotten

The single hardest architectural problem in privacy is right to erasure. The user submits a deletion request. Where is their data?

In a small system, “the user table” is a defensible answer. In any real platform, the data exists in a dozen places: the operational database, the data warehouse, the ML feature store, last quarter’s backup snapshot, the analytics vendor’s copy, the support tool’s case history, the marketing automation tool’s profile, the export the data team sent to a contractor in 2024. Each of those copies is a deletion target, and each has different mechanics for actually removing a record.

The architectural patterns that make this tractable, in rough order of cost:

Tag every row with the subject’s ID. Every table that contains personal data carries an explicit reference to the user it describes. Deletion becomes “find every row tagged with subject X”. Without this tag, the team has to reason about each table individually, often by joining against transient identifiers that may not survive a backup.

Tombstone propagation. When a deletion happens in the source system, a tombstone record (the user’s ID plus a “deleted at” timestamp) flows downstream through the same pipelines that originally moved the data. Each downstream system applies the tombstone. The pattern requires that pipelines are idempotent and respect tombstones, which is a non-trivial property to retrofit.

Pseudonymisation. Instead of storing raw identifiers everywhere, the platform stores a pseudonymous identifier mapped through a single lookup service. Deleting the lookup entry effectively erases the link to the real person, even if downstream copies of the pseudonymous data remain. This loses some information forever (the team can no longer answer “what did this user do?”) but preserves analytical value (the team can still answer “what do users who did X look like in aggregate?”). Done well, this is the cheapest path to compliance for analytics workloads. Done poorly, it produces a fake sense of privacy because the pseudonyms are easy to re-identify.

Backup retention and exclusions. The hardest case is the immutable cold backup. The team cannot delete a single user from a backup tarball without restoring it, modifying it, and re-archiving, which is impractical. The pragmatic answer most teams adopt: backups expire on a schedule (typically 30 to 90 days), and the deletion request is honoured against backups by waiting for them to roll out of the retention window, with a documented procedure for restoring from a backup that still contains deleted data (the user is re-deleted on restore).

flowchart TD
    U[Deletion request<br/>user X]
    OP[Operational DB]
    WH[Data warehouse]
    FS[Feature store]
    AN[Analytics vendor]
    SU[Support tool]
    BK[Backups<br/>retention window]
    T[Tombstone published<br/>subject ID + timestamp]
    U --> OP
    OP --> T
    T --> WH
    T --> FS
    T --> AN
    T --> SU
    T --> BK

The drawing is misleadingly tidy. In a real system, the tombstone has to flow through batch pipelines that may run daily, streaming pipelines that lag by minutes, and vendor APIs that may not accept tombstones at all. Each path is its own engineering problem, and the team’s GDPR posture is the weakest path of the bunch.

Data residency

The second hard architectural implication is residency. Several jurisdictions require that personal data of their residents physically remain inside their borders, or at least leaves only under specific conditions.

The EU is the loudest example: GDPR places conditions on transfers of personal data outside the EU/EEA. The Schrems II decision invalidated the Privacy Shield framework that previously legitimised most US transfers, and the replacement (the EU-US Data Privacy Framework) is itself contested. China’s PIPL has its own cross-border transfer regime, which may require security assessments by Chinese regulators for certain transfers. India’s DPDPA, Russia’s data-localisation law, several other regimes: each has its own rules.

The architectural consequence is that the team running a global platform cannot store everyone’s data in one region and serve everyone from it. The patterns:

Region-locked deployments. Each region has its own database, backups, analytics, log archive. Cross-region replication is intentionally restricted. EU users’ data lives in EU regions exclusively; backups go to EU buckets; access logs stay in the EU. Strongest posture, most expensive.

Region-aware data routing. A global control plane routes user requests to the appropriate region’s data plane based on residency. The data plane never communicates with the wrong region for personal data. Cloud providers offer services that make this easier (AWS Local Zones, Azure Data Boundary, Google’s regional service controls).

Restricted-transfer mechanisms. When data must leave a region (typically for global SaaS vendors), the transfer happens under explicit legal mechanisms: standard contractual clauses, binding corporate rules, or the relevant adequacy decision. The architecture has to tag data with its sensitivity and origin to know which mechanism applies.

A platform that ignores residency until it has to fix it is in for a long migration. Adding region-locked storage to a system that started global is structurally similar to adding multi-tenancy after the fact: doable, expensive, and visible in the system’s complexity for years afterwards.

Encryption with customer-managed keys

Several frameworks and regulators encourage or require that the customer (or the platform team, on the customer’s behalf) holds the encryption keys, not the cloud provider.

The cloud default is encryption at rest with provider-managed keys: data is encrypted, and the key sits in the provider’s KMS. This protects against stolen disks but not against the provider itself, or against an attacker who has compromised IAM credentials that can decrypt through the provider’s API.

Customer-managed keys (CMK or BYOK) put the key in a KMS the customer controls and can revoke: AWS KMS, Azure Key Vault, GCP Cloud KMS, all with customer-managed key options. Some workloads go further with external HSMs or “hold-your-own-key” patterns where the cloud provider never sees the key. The trade-off is operational complexity: lose the key, lose the data; KMS outage, data unreadable until it returns. CMK is an obligation many regulated workloads accept, not a default for general-purpose ones.

Audit and access logs

GDPR Article 5(2) and most other regimes require accountability: the platform must be able to demonstrate, on demand, who accessed what personal data, when, and for what purpose. This overlaps with the security audit log from lesson 77 but is a distinct concern with stricter retention and querying requirements.

The privacy audit log answers questions like “on 14 March 2026, this user’s record was accessed; by whom?”, “which support agents have viewed this customer’s tickets in the last 90 days?”, and “which pipelines read from the table containing this user’s row, and into which downstream tables did the data land?”

Building this after the fact is hard. Building it as part of the data plane from the start is much easier: every read of a table containing personal data emits an event; every export to a vendor is logged with the row IDs covered; every analytics query is tagged with the data subjects it touched. The volume is significant, the storage is cheap, and the regulator will eventually ask.

Privacy is also about respecting choices: the user opted in to marketing emails but not personalised ads, opted out of sharing with third-party vendors, and revoked consent for analytics cookies. The architecture has to record those choices and propagate them to every system that acts on them.

The consent record itself is straightforward: a row per user per consent type, with a timestamp and version of the consent text shown. The hard part is propagation. Every downstream system that uses the data has to know the user’s current consent state, and a new opt-out has to reach the marketing tool, the analytics vendor, the recommendation engine, and the email sender within whatever timeframe the regulation permits (often “without undue delay”, in practice hours, not days).

Consent management platforms (OneTrust, Segment Consent Manager, Cookiebot) handle the user-facing UI and the record itself. The architectural work is the propagation layer: a stream of consent change events that downstream systems subscribe to, idempotent application of those events, and an audit log proving each system applied the change.

Compliance frameworks as architectural drivers

Beyond the privacy laws themselves, several compliance frameworks codify the controls a regulated platform needs. Each requires specific architectural decisions.

SOC 2 is the most common framework for B2B SaaS. It assesses the platform against five trust services criteria (security, availability, processing integrity, confidentiality, privacy). The controls it expects include change management with approval gates, separation of duties between development and production, encryption in the standard places, and a documented incident response process. Type II requires evidence over 6 to 12 months that the controls actually operate.

ISO 27001 is the international equivalent, broader in scope and more prescriptive about the management system around the controls (the ISMS). Similar controls; more emphasis on the management framework.

HIPAA in the US, for protected health information. Adds requirements around PHI handling, business associate agreements with vendors, and breach notification timelines. PCI DSS for payment card data, with network segmentation (cardholder data environment must be isolated), specific encryption standards, and quarterly vulnerability scans.

The architectural consequence is that the platform’s controls become auditable. Auditors ask for evidence: who approved this deploy, how is access reviewed, where are the encryption keys, who can read the audit log. A platform built without these controls in mind has to retrofit them when a customer’s procurement team requires the certification.

What good looks like

A team with a healthy privacy and compliance posture has data classified by sensitivity and tagged with the data subject’s identity. Pipelines that propagate deletions and consent changes downstream within documented timeframes. Region-locked deployments where residency requires it. Customer-managed keys for the workloads that need them. An audit log that answers regulator questions without panic. A compliance roadmap the architecture has been built to support, not retrofitted to satisfy.

None of this prevents every breach or satisfies every regulator. All of it is the difference between a platform that can defend its decisions when scrutinised and one that cannot. Lesson 53’s infrastructure-as-code discussion is relevant again: every region-lock policy, every encryption configuration, every audit-log destination is code, reviewed, version-controlled. Lesson 77’s security architecture is the foundation; you can have security that fails on privacy because the team treated the regulatory layer as an afterthought.

The next lesson moves toward the social and team dimensions of architecture: how Conway’s law shapes the systems we build, and the team topologies that produce platforms that can actually support the controls and patterns this module has spent its time describing.

Citations and further reading

  • General Data Protection Regulation (Regulation (EU) 2016/679), https://eur-lex.europa.eu/eli/reg/2016/679/oj (retrieved 2026-05-01). The official consolidated text. Worth reading at least the recitals and Articles 5, 6, 17, 20, and 32.
  • California Consumer Privacy Act and California Privacy Rights Act, https://oag.ca.gov/privacy/ccpa (retrieved 2026-05-01). The official California Attorney General resource on CCPA/CPRA.
  • NIST Privacy Framework v1.0, https://www.nist.gov/privacy-framework (retrieved 2026-05-01). The NIST privacy framework, structured similarly to the cybersecurity framework, useful for translating regulation into engineering controls.
  • NIST Cybersecurity Framework 2.0, https://www.nist.gov/cyberframework (retrieved 2026-05-01). Companion to the privacy framework; its protect and detect functions cover much of the audit and access infrastructure this lesson describes.
  • CIS Controls v8, https://www.cisecurity.org/controls/v8 (retrieved 2026-05-01). The practical control list, which maps onto SOC 2 and ISO 27001 evidence requirements with reasonable fidelity.
  • European Data Protection Board, “Guidelines on data subject rights”, https://www.edpb.europa.eu/our-work-tools/general-guidance/guidelines-recommendations-best-practices_en (retrieved 2026-05-01). The supervisory authorities’ interpretations of the GDPR rights, including erasure and portability.
Search