Overcoming privacy concerns in medical research databases
The research at MIT’s Computer Science and Artificial Intelligence Laboratory and Indiana University at Bloomington was recently published in the journal Cell Systems.
According to the researchers, the new system reduces the chances of privacy compromises to almost zero - addressing one of the pivotal issues facing data sharing initiatives.
Sean Simmons, an MIT postdoc in mathematics and first author on the new paper, and Bonnie Berger, a mathematics professor at MIT and the corresponding author on the paper, told us the system is based on the ideas of differential privacy.
“The basic concept is that, by slightly perturbing analysis results, one is able to guarantee privacy for research participants,” the researchers said.
“Though these ideas had been applied to some genomic statistics, the existing technologies could not deal with the diverse ancestries present in many real world genomic data sets that are known to be critical to accurate genomic studies. Our goal was to develop methods that overcame this hurdle.”
According to the researchers, the most challenging part was determining how to overcome the effect of outliers.
“If one individual is very different from all the other individuals in a study, their inclusion can greatly affect the result, leading to privacy loss,” said Simmons and Berger. “We dealt with this by slightly modifying our definition of privacy to focus on protecting information about private disease status—a realistic goal as it is the data that is most sensitive.”
How does it work?
The system “perturbs the results” of a genomic analysis to ensure privacy, yet is still accurate enough to retain useful information.
“In particular, it allows users to determine if a particular genomic alteration is correlated with a disease of interest in a dataset, or to produce a list of locations in the genome that are highly associated with the disease,” the researchers said.
The method is able to overcome issues that cause false positives in genomic studies as well, unlike previous methods. Specifically, it corrects for population stratification – false positives due to different ancestries in a sample.
While the research addresses privacy issues in genomic databases, the researchers said the ideas of differential privacy can be applied to almost any area where private human data is collected.
“One reason that data is not shared is due to concern over the privacy of individuals in the study,” explained Simmons and Berger. “Our approach helps overcome that particular roadblock.”