Protecting patient privacy is of the utmost importance both during, and after the completion of, a clinical trial. To address this, data custodians apply de-identification methods to large datasets to help protect the anonymity of patients during data analysis.
A recent study – published Cancer Epidemiology, Biomarkers & Prevention, a journal of the American Association for Cancer Research – looked at two de-identification methods known as k-anonymization and adding a “fuzzy factor.” They found that these methods were successful at reducing the risk of an individual patient’s identity being deduced from the data.
“Researchers typically get access to de-identified data, that is, data without any personal identifying information, such as names, addresses, and Social Security numbers,” said Dr. Giske Ursin, director of Cancer Registry of Norway, Institute of Population-based Research. “However, this may not be sufficient to protect the privacy of individuals participating in a research study.”
Health data shared with clinical researchers during a study can include information on diagnoses, medications they might be taking and family histories of disease. This data is just as sensitive as personal identifying information, and should be protected in the same manner.
“People who have the permission to access such datasets have to abide by the laws and ethical guidelines, but there is always this concern that the data might fall into the wrong hands and be misused,” said Ursin. “As a data custodian, that’s my worst nightmare.”
Using data composed of over five million records collected from around 900,000 women who participated in the Norwegian Cervical Cancer Screening Program, Ursin and her colleagues tested the effectiveness of the two de-identification techniques. Based on the assumption that a hacker would know that personal information about an individual was included in the dataset, the researchers used a tool called ARX to calculate the risk of re-identification.
The researchers created three different datasets during their study: D1 contained original patient data; D2 was de-identified using the k-anonymization method by changing all the dates to the 15th of that month; D3 was “fuzzied” by randomly adding or subtracting one to four months from each record.
These date-changing methods are designed to prevent someone from being able to re-identify individuals in a study, however trial investigators are still able to make use of the data because they know the intervals between procedure dates. “We found that changing the dates using the standard procedure of k-anonymization drastically reduced the chances of re-identifying most individuals in the dataset,” said Ursin.
Patients in the D1 data set had an average 97.1 percent risk of being re-identified by a hacker. In comparison, for those in D2 and D3, the risk of re-identification for patient records dropped to 9.7 and 9.8, respectively. This result indicated that the de-identification methods were successful at protecting patient data.
“Every time a research group requests permission to access a dataset, data custodians should ask the question, ‘What information do they really need and what are the details that are not required to answer their research question,’ and make every effort to collapse and fuzzy the data to ensure protection of patients’ privacy,” said Ursin. “However, given the recent trend in sharing data and combining datasets for big-data analyses – which is a good development – there is always a chance of information falling into the hands of someone with malicious intent. Data custodians are, therefore, rightly concerned about potential future challenges and continue to test preventive measures.”