The vernacular for disclosing pharmaceutical clinical trial results to the public may seem just as confusing as the process! Some people use “anonymization” and “redaction” interchangeably regarding transparency and disclosure of clinical trial data. But what are the differences in technique or method behind each?
We can think of “anonymization” as the process of removing, transforming, or concealing any identifying information such as patient IDs. “Redaction” means hiding information we don’t want to disclose. But what is the correct approach, and what are the differences? Does Health Canada (HC) or the European Medicines Agency (EMA) have a preference for meeting compliance requirements?
What is clinical trial data anonymization?
- Anonymization is removing, transforming, or hiding values for variables that allow direct or indirect identification of a clinical trial volunteer from the data. “Anonymization” is the umbrella term for transforming or masking data whereas “redaction” is just one method of anonymization.
- Clinical trial documents require anonymization. Sponsors can anonymize a document that they have already submitted to a health authority.
- Sponsors can also proactively anonymize a document that they are submitting. European Union Clinical Trial Regulation (EU-CTR) submissions require this as well. Read more about the challenges posed by EU-CTR in this blog.
- Clinical datasets, as either SDTM or ADAM data, require anonymization. Regulatory and medical writers use these datasets to write clinical documents. Clinical datasets, as either SDTM or ADAM data can be anonymized and voluntarily disclosed but are currently not required as part of the submission process to EMA or HC. Regulatory and medical writers use these datasets to write clinical documents, and researchers can use these for secondary or aggregate analysis.
- Quantitative anonymization requires a risk assessment to a predetermined threshold (often 0.09) to determine the probability of re-identification of a clinical trial volunteer. The number of data fields in the dataset requiring anonymization depends on the dataset’s risk score. Higher risk scores mean sponsors must anonymize more fields. Often, a statistician or an experienced transparency specialist will assist with determining this calculation.
- Quantitative anonymization provides high data utility for scientists in the research community! Common anonymization methods include:
- Generalization: Replace actual data values with a substitution value or numeric range. There are two types: character and numeric. Examples include
- banding on ages (i.e., 65 is replaced with 60-70) and
- generalization of countries (i.e., United States of America is replaced with North America)
- Offset: Sets all First Date Collected values to the Anchor Date. It then shifts all other date variables to maintain respective offsets to the First Date Collected. The following parameters must be identified:
- Anchor Date (typically, a study milestone such as start date)
- First Date Collected Domain (the domain in which First Date Collected exists)
- First Date Collected Variable (the column representing First Date Collected)
- Recoding: Overwrites actual values with a randomly generated value
- Shuffling: Randomly moves values from one row to another. Examples include the shuffling of patient IDs. Preserving the relationship between original and anonymized values across datasets for the same subject is a priority. Typically, sponsors shuffle the DM (Demographics) domain.
- Redaction: Masking a data field in a document with a black box to irreversibly obscure it. Sponsors can redact individual data fields using a Word tool. They can redact entire pages or sections of documents with a Box tool. This method offers little to no data utility.
In Europe, sponsors must upload and publish clinical trial documents to the Clinical Trial Information System (CTIS). The draft EU-CTR guidance discusses protecting personal protected data (PPD) and commercially confidential information (CCI) in these documents. It states that the suggested anonymization techniques are randomization and generalization.
What is redaction?
- Redaction is an anonymization technique that masks data entirely with an overlay or black box. Think of redaction like whiting out a word on a piece of paper.
- Use redaction on PPD, CCI, and Sponsor data. Each redaction will use unique overlay text (i.e., PPD versus CCI).
- The overlay box must contain regulatory-authority-specific text and color. For example, Health Canada requires that redaction boxes over patient data be light blue and contain the text “PPD.”
- Redaction is a common method for masking CCI in a document. This is information/data confidential to the Sponsor that disclosure may undermine their legitimate economic interest or competitive position. Examples include novel developments on products or intellectual property and drug chemical identity or exact composition. You can read more about how to identify CCI in this blog.
- Redaction is the fastest and most value-centric anonymization method. It allows for maximum data protection while requiring the least amount of time and resources. On the flip side, it provides the least data utility.
Do regulatory health authorities prefer one type of anonymization over the other?
Sponsors can use a range of anonymization techniques while preserving data utility! Neither EMA’s Policy 0070, 0043, nor HC’s Public Release of Clinical Information (PRCI) have a preferred technique.
At times, it is impossible to avoid redaction (i.e., CCI). At other times, it may be more beneficial to anonymize with shuffle (i.e., patient IDs). In certain therapeutic areas or sensitive patient populations, the preferred method is redaction to lessen the risk of reidentification.
The submission landscape continues to change. Both the EMA and HC are more flexible in approving anonymization techniques that maximize data utility.
Do I need clinical trial data anonymization technology to help me?
The short answer is, yes! Technology can greatly reduce the manual efforts required to anonymize regulatory documents. Retrospectively anonymizing entire dossiers, for example, may involve working with tens (sometimes even hundreds) of thousands of pages of documents. These regulatory documents include Clinical Trial Protocols, Clinical Study Reports, Case Report Forms, Statistical Analysis Plans, and more.
The effort required to review these documents, page by page, is tremendous. Utilizing a technology backed by artificial intelligence, machine learning, and natural language processing will streamline this process by identifying data requiring anonymization for you. Various technology platforms can anonymize and redact your documents with the click of a button.
Technology enables accuracy and consistency across a high volume of pages that is difficult to match through manual efforts. It may also bring innovation and efficiency to your current processes by offering proactive data and document anonymization during initial drafting.
One example of this technology is our CoAuthor software. Using technology may not eliminate the human review effort. But it can make the difference between meeting a regulatory authority submission or request on time or not.
Certara is a leading technology and services provider in the transparency and disclosure space. We can be your next partner for all your anonymization and redaction needs. Our team of experts has years of experience in successfully supporting programs across regions. They can help you meet and exceed your compliance requirements.
Read our white paper to learn how to save time and resources while disclosing clinical trial data.
This blog was originally published on 2022年10月5日, and was updated on 2025年1月17日.