A live tracker documents 110 DMCA takedowns by UK Biobank against 197 GitHub repos where researchers accidentally published health data on 500,000 volunteers.
Key Takeaways
UK Biobank granted 20,000 researchers worldwide access under strict agreements prohibiting redistribution; accidental GitHub uploads violate those agreements repeatedly.
The Guardian re-identified a participant using only approximate birth date and date of one major surgery, confirming real re-identification risk in leaked files.
Nearly half the targeted files are Jupyter or R notebooks; a quarter are genetic/genomic formats (PLINK, BOLT-LMM, BGEN) directly encoding participant genotypes.
UK Biobank uses DMCA copyright notices as a removal mechanism because the UK has no equivalent privacy-breach statute compelling platforms to act as quickly.
Takedown notices paused entirely from January through most of March 2026, restarting only after The Guardian’s investigation, suggesting enforcement is reactive, not continuous.
Hacker News Comment Review
Commenters argued that distributing sensitive data to 20,000 researchers globally without scalable compliance auditing makes leaks structurally inevitable, not just a policy failure.
The Jupyter notebook workflow was identified as the proximate technical cause: cell outputs silently embed data rows, and researchers push notebooks without clearing outputs or using .gitignore.
Debate surfaced over whether sanctions have any teeth: US equivalents (HHS, HIPAA enforcement) name, shame, and impose corrective action plans, while UK Biobank appears limited to access suspension.
Notable Comments
@adwf: Points to a separate BBC report of all 500,000 participant records offered for sale on Alibaba, with an official UK Biobank response linked.
@captn3m0: Found additional live repos with Date of Birth columns in under five minutes, and linked Broad Institute ingestion scripts showing how decryption utilities were distributed to researchers.
@mil22: “they don’t even provide the data to the participants themselves” – volunteers whose data leaked cannot access their own records.