UK Biobank health data keeps ending up on GitHub

· privacy · Source ↗

TLDR

  • A live tracker documents 110 DMCA takedowns by UK Biobank against 197 GitHub repos where researchers accidentally published health data on 500,000 volunteers.

Key Takeaways

  • UK Biobank granted 20,000 researchers worldwide access under strict agreements prohibiting redistribution; accidental GitHub uploads violate those agreements repeatedly.
  • The Guardian re-identified a participant using only approximate birth date and date of one major surgery, confirming real re-identification risk in leaked files.
  • Nearly half the targeted files are Jupyter or R notebooks; a quarter are genetic/genomic formats (PLINK, BOLT-LMM, BGEN) directly encoding participant genotypes.
  • UK Biobank uses DMCA copyright notices as a removal mechanism because the UK has no equivalent privacy-breach statute compelling platforms to act as quickly.
  • Takedown notices paused entirely from January through most of March 2026, restarting only after The Guardian’s investigation, suggesting enforcement is reactive, not continuous.

Hacker News Comment Review

  • Commenters argued that distributing sensitive data to 20,000 researchers globally without scalable compliance auditing makes leaks structurally inevitable, not just a policy failure.
  • The Jupyter notebook workflow was identified as the proximate technical cause: cell outputs silently embed data rows, and researchers push notebooks without clearing outputs or using .gitignore.
  • Debate surfaced over whether sanctions have any teeth: US equivalents (HHS, HIPAA enforcement) name, shame, and impose corrective action plans, while UK Biobank appears limited to access suspension.

Notable Comments

  • @adwf: Points to a separate BBC report of all 500,000 participant records offered for sale on Alibaba, with an official UK Biobank response linked.
  • @captn3m0: Found additional live repos with Date of Birth columns in under five minutes, and linked Broad Institute ingestion scripts showing how decryption utilities were distributed to researchers.
  • @mil22: “they don’t even provide the data to the participants themselves” – volunteers whose data leaked cannot access their own records.

Original | Discuss on HN