Automated Patient Anonymisation and Data Masking Pipeline
Problem to Solve
Health and medical researchers are desperate to get their hands on NHS data for many reasons, the principal gem being its
longitudinal attributes. To date, privacy considerations have rightly been a barrier, as medical records and files are
heavily populated with sensitive Personally Identifiable Information (PII).
Moving this raw data into analytical environments creates severe security vulnerabilities, not to mention regulatory
nightmares. Manual redaction processes are completely unscalable, slow, and highly susceptible to human error.
Without a secure, automated, and real-time data masking infrastructure, healthcare providers are forced to choose
between stalling vital medical innovation or risking catastrophic compliance breaches, severe regulatory penalties,
and the exposure of sensitive patient identities.
A number of solutions have been proposed, and potentially one of the most effective is homomorphic encryption (HE).
While HE has successfully transitioned from a purely theoretical mathematical concept into highly polished enterprise
software, its widespread adoption is still limited by extreme computational overheads and structural limitations.
The solution proposed in this project would resolve the conflict between clinical data utilisation and strict patient
privacy mandates by expunging NHS data of PII while in transit to particular environments, thereby enabling much needed
research to be carried out in the meantime, until such a time that HE becomes mature enough and a mainstream
general-purpose technology.
This project implements and provisions an event-driven data engineering pipeline designed to resolve the conflict
between clinical data utilisation and strict patient privacy mandates. The pipeline automates the anonymisation of healthcare
data in transit, ensuring compliant datasets are available for immediate medical research while completely safeguarding
PII. This directly satisfies GDPR and Caldicott Principles.
Objective
Design, build, and deploy a production-grade, real-time streaming anonymisation pipeline that:
• Intercepts NHS patient records at ingestion.
• Applies configurable, field-aware anonymisation strategies - pseudonymisation and redaction.
• Delivers clean, research-ready records to downstream analytical storage.
• Maintains full GDPR and NHS DSP Toolkit compliance throughout.
• Is fully automated, observable, and infrastructure-as-code driven - deployable to AWS at scale.
Architecture Diagram
For the sake of visual clarity and to avoid an over-crowded diagram that is unhelpful, this diagram accurately represents
about 90% of the architecture. The core data flow of:
Developer Push -> GitHub Actions -> ECR -> ECS (Private Subnet) -> MSK -> Consumer/Anonymiser -> RDS
is fully shown. The remaining 10% constituting detail-level omissions from the diagram do not misrepresent the system.
It is presented tabulated here:
| Not Shown | Detail |
|---|---|
| DLQ | The dead-letter queue MSK topic (patient-records-dlq) and the DLQ table in RDS are core to the design but absent from the diagram |
| Two consumer replicas | The diagram shows one Consumer Service box - the actual deployment runs x2 replicas |
| MSK has 2 brokers | The diagram shows a single MSK box - in reality it is a 2-broker, multi-AZ cluster (kafka.m5.large × 2) |
| Secrets Manager scope | The diagram labels it "DB credentials" only - it also holds the PSEUDO_SECRET (HMAC key) which is equally critical |
| KMS scope | Shown only on RDS and ECR - KMS also encrypts MSK storage, Secrets Manager values, and CloudWatch logs |
| No health endpoints | /health/live and /health/ready HTTP endpoints on each ECS task are not represented |
Software and Cloud Engineering Trade-offs
1. Kafka (MSK) over SQS or a simpler queue
NHS record ingestion is bursty and high-volume. Kafka's consumer group model enables horizontal scaling of the processing
layer without reprocessing records, and its log retention enables replay for audit and incident investigation - both
essential in a regulated healthcare environment. SQS does not support replay.
2. Pseudonymisation as the default strategy over full redaction
Pseudonymisation preserves referential integrity - the same patient maps to the same HMAC-SHA256 token across all records,
enabling longitudinal cohort analysis. Full redaction destroys this relationship entirely. The strategy is configurable via
environment variable, allowing operators to switch to full redaction for contexts where linkage is not required.
3. ECS Fargate over Lambda
The consumer runs a continuous poll loop - a long-lived, stateful process. Lambda's stateless, time-bounded execution model
is a poor fit for Kafka consumer group membership and offset management. ECS Fargate provides persistent, scalable container
execution without EC2 management overhead.
4. Terraform over console or CDK (Cloud Development Kit)
Every infrastructure component is version-controlled, reviewable, and reproducible. Terraform is a precise, executable
specification of cloud architecture - more reliable and auditable than click-ops, and more portable than CDK which ties
infrastructure code to a single language ecosystem.
Cost Optimisations
This pipeline was fully deployed to AWS and then decommissioned via terraform destroy after successful implementation and
evidence capture - a deliberate FinOps decision.
| Resource | Cost |
|---|---|
| MSK kafka.m5.large × 2 brokers (48 hrs) | ~ $9.22 |
| RDS db.t3.micro PostgreSQL (48 hrs) | ~ $0.86 |
| ECS Fargate tasks (48 hrs) | ~ $1.50 |
| ECR storage | ~ $0.10 |
| Total one-time deployment cost | ≈ $11.70 |
| Ongoing running cost | $0.00 |
The repository, documentation, and architecture diagrams are the portfolio artifact. The design is production-ready and scales to enterprise NHS deployment under standard FinOps governance - but incurs zero persistent cloud spend as a proof of concept.
With access to enterprise budget, I would have more options to be more innovative to build and provision better, cost-effective infrastructure and orchestrated pipelines.
Technical Details
Presented here is a summary - for full technical details please see my GitHub Repo.
The pipeline runs entirely within a private AWS VPC. A Kafka Producer (ECS Fargate) generates synthetic NHS patient records and
publishes them to an Amazon MSK topic. A Kafka Consumer (ECS Fargate, 2 replicas) polls the raw topic, applies the anonymisation
engine, and writes clean records to both an anonymised MSK topic and a PostgreSQL RDS database. Failed records are routed to a
dead-letter queue. All infrastructure is provisioned with Terraform and deployed via GitHub Actions CI/CD.
Although free-text clinical notes are processed for PII using regex pattern matching in this v1 implementation, the pipeline is however
architected for spaCy / medspaCy NER integration - SPACY_MODEL, CONFIDENCE_THRESHOLD, and a full ENTITIES_TO_REDACT
entity list are already configured in config/settings.py - making NLP-based detection a straightforward next step rather than a redesign.
Stack:
Python 3.11 · GitHub Actions · Terraform · Amazon MSK (Kafka) · ECS Fargate · RDS PostgreSQL · ECR · KMS · Secrets Manager · CloudWatch
Anonymisation Engine
Two pluggable strategies selected via the ANONYMISATION_STRATEGY environment variable:
• Pseudonymisation - deterministic HMAC-SHA256 token replacement per field type (PAT-, NHS-, DOB-, TEL-
prefixed tokens). Same input always produces the same token within a deployment, preserving referential integrity for research.
• Redaction - full [REDACTED] replacement for all PII fields, with additional regex scanning of free-text clinical notes
for names, phone numbers, and email addresses.
Reliability
Manual offset commits (enable_auto_commit=False) - offsets are committed only after successful downstream write,
preventing data loss on consumer failure. Failed records are routed to a dead-letter queue topic and persisted to a
dedicated DLQ database table with full error context for inspection and reprocessing.
Compliance
GDPR Articles 5, 17, 25, and 30 are addressed by design: anonymisation is mandatory in the pipeline path, raw PII never
reaches analytical storage, anonymisation strategy and version are stamped on every output record, and pseudonymisation
key rotation (via PSEUDO_SECRET replacement) invalidates all tokens to satisfy the right to erasure.
Security Posture
Post-deployment security audit identified and documented three findings (ECR tag mutability, local Terraform state, open MSK
plaintext ports) with full root cause analysis and remediation steps. None resulted in exposure of real patient data - the
pipeline operated entirely on synthetic records. Full details in Project GitHub Repo's SECURITY.md.
Testing
153 unit tests · 89% coverage · pytest · flake8 · black · fully automated in GitHub Actions on every push.
Infrastructure
Terraform-managed: VPC, private subnets, NAT Gateway, MSK (2-broker, multi-AZ, TLS), RDS PostgreSQL 16, ECS Fargate
(producer and consumer services), ECR, KMS (single key encrypts RDS, MSK, Secrets Manager, ECR, CloudWatch), Secrets Manager,
CloudWatch (30-day log retention). All deployed to eu-west-2 (London) - the required region for NHS workloads under
UK data residency requirements.
✨ Coda
This pipeline is designed for teams who need to share clinical data for analytics or research purposes without exposing personally identifiable information (PII).Key capabilities:
- Version 1 generates synthetic NHS patient records (producer)
- Consumes records from Kafka in real time (consumer)
- Applies pluggable anonymisation strategies (pseudonymisation / redaction)
- Persists anonymised records to PostgreSQL with full audit trail
- Routes failed records to a dead letter queue (DLQ)
- Fully containerised with Docker Compose for local development
- CI/CD via GitHub Actions (lint → test → Docker build → push)
🏆 Milestones
For full list of milestones and screenshots please see GitHub Repo README.1. Fully Deployed AWS Infrastructure:

2. Elastic Container Cluster Services:

3. All 153 Unit Tests Passed with 89% Coverage:

4. Sample Patient Record Pre and Post Pipeline:

5. CloudWatch Monitoring - MSK Kafka:
