Become a member

Get the best offers and updates relating to Liberty Case News.

― Advertisement ―

spot_img
HomeHealthHow to Approach the Statistical De-Identification Process Effectively

How to Approach the Statistical De-Identification Process Effectively

Innovation in health care relies on the ability to figure out what the data is trying to teach us. Data analytics, including but not limited to GenAI powered data analytics, presents an insatiable demand for large, well-curated, searchable data sets. This is already a challenge — we have lots of data, but not a lot of good data. Exacerbating this challenge to data curation is often a legal, policy, ethical or business risk mandate that the curated data also be “de-identified.” For data sets that include Protected Health Information (PHI), rendering data de-identified must be done in accordance with one of two methods set forth in the HIPAA regulations. And consistently, the method that typically works for data analytics is the statistical method.

The statistical method is not new. And contrary to public myth, it is not considered “less compliant” than the alternative, the so-called safe harbor method. Initially, the Office of Civil Rights, which administers HIPAA, had proposed only including the statistical method. But the regulated community wanted an easy, rinse and repeat standard that would not require them to obtain statistical guidance in every case, which was seen as a severe transactional burden. The safe harbor method, which requires the removal of 18 enumerated fields, extends administrative ease to the regulated community, but comes with a heavy price. In many cases, the data remaining after redacting or obfuscating all of the data required under safe harbor de-identification is no longer fit for purpose.

Statistical de-identification is as much a tactical activity as a strategic one. There are several concrete steps the regulated community can take to get the most out of your statistical de-identification initiatives. 

  • Motivation matters: Safe harbor and statistical de-identification present different strategic opportunities and compliance hurdles. Safe harbor de-identification enables a regulated party to have a relatively easy method of self-administering de-identification by the removal of 18 enumerated fields, provided none of those fields are necessary for the intended activity. It is robotic, but also inflexible. The statistical method, in contrast, is intended to provide flexibility by looking at the actual, measurable risks of re-identification presented by a range of factors, including the data but also the recipient, the other information available to the recipient and policy and contractual safeguards. It requires a governance program to make sure the parameters of the opinion are followed but in exchange nearly universally enables greater data to persist into the de-identified data set.
  • Involve counsel: If this is the first time you’re doing statistical de-identification or this statistical exercise is strategically or materially different from past opinions, the process will likely raise legal and compliance questions and legal advice will be important.
  • Think big first: The statistical exercise is a good opportunity to involve business stakeholders to understand short- and medium-term data plans. Start by thinking about (1) the maximum data that would be helpful to persist in the de-identified data set; (2) the potential recipients of the de-identified data set, and reasonable controls around their usage; and (3) the range of possible use cases and business priorities. Working with your expert, you may need to retreat from certain data fields or purposes, but by thinking broadly at the outset, you can work more effectively with your expert.
  • More than redaction: In setting the data dictionary element of the opinion, data redaction (the removal of certain fields) is the most obvious tool. Your statistician, however, can provide guidance with more nuance, both in terms of privacy protections and retaining data utility. For example, data randomization or data shifting, adding noise to make it harder to discern re-identifying patterns, including synthetic data, creating look-alike fields, and a range of other data obfuscation techniques can be explored. Cryptographic techniques for creating private IDs will need to be carefully applied to ensure private IDs are not practically reversible, including by choosing appropriate cryptographic keys. Data transformation techniques need to be fit for purpose — in some cases, certain data manipulations might mean that the data could not be used, for example, for certain FDA-regulated purposes. But this is part of the strategic discussion.
  • More than just tables: Statistical de-identification can be used to de-identify unstructured data, including text, clinical notes and medical images. Technology and capabilities evolve rapidly, and unstructured data has moved from niche and only selectively tractable to a scalable option in just a few years. When considering the maximum data in the de-identified dataset, it’s important to validate assumptions around what’s practically achievable to ensure options aren’t artificially restricted. 
  • Be ready to horse trade: In many cases, a well-designed statistical opinion will present you with tradeoffs on available data fields or granularity. To illustrate with a simple example, ethnicity-related data fields may be allowed, but not in certain locations where they would be highly identifying due to the local population demographics. Instead of the opinion requiring the redaction of ethnicity or location in all cases, it can permit data fields under certain parameters but “grey out” the availability of the data fields in others. If you can implement the data architecture to do this, you create a menu of options for your business, allowing recipients to access certain data within a flexible framework.
  • Opinion as recipe: The data that will persist in the de-identified data set (usually called the data dictionary) is just one element in the overall opinion. The opinion will have several other ingredients — all of which matter, and you will need to comply with all of them for the opinion to be applicable. For example, the statisticians may consider the presence of certain contractual clauses or policies to be relevant to measuring risk. Or, the statistician may have taken into account the stated purpose of the de-identified data set. Just as a bread recipe wouldn’t make a loaf if you opted to forgo the yeast or ignore the water, you need to implement and comply with the opinion as a whole.
  • Build a statistical relationship: The initial lift for the opinion is the biggest. But the opinion will need to be renewed, typically every 18 months although time frames vary. And you may find that the assumptions in the opinion need to be reviewed or changed. If your statistical expert is a strong partner, they will help you grow and adapt the opinion in line with your strategic priorities, even between renewal periods.
  • Build a crosswalk: One of the insights embedded in the HIPAA de-identification standards is the need (under either method) to refresh de-identified data over time. Institutions can implement a linking code that enables them to de-identify new data as it comes in and associate it with individuals in the data set. Though not necessary for every purpose, longitudinal de-identified data sets are essential to many of the purposes described above. Tokenization and linkage technologies can also be applied to link between discrete datasets without sharing PHI or identifying elements, though it’s important to ensure the resulting linked dataset meets HIPAA de-identification standards.
  • Data puddle or data lake: In some cases, the data you need to de-identify is discrete and will be generated on a case-by-case basis applying the opinion’s parameters. In other cases, your business may present a range of future, unspecified and/or varied data use cases. In the latter case, you may want to develop a data lake—a large, curated, data set at rest that is available to provision smaller data cuts for particular projects. A well-designed opinion is equally applicable for the whole and subsets.
  • De-identification versus data aggregation: Data Aggregation is a term of art under HIPAA that involves the use of PHI from multiple covered entities for benchmarking and other joint activities. The regulated community often uses “de-identified” and “aggregated” interchangeably, but they are not. Make sure what you need is de-identified data for a particular project.
  • Invest in data tagging: Data tagging will enable your organization to have more dexterity in the data it deems available for de-identification and will provide granularity at the field level. It’s technical operational and administrative work that might not seem glamorous, but it’s an essential building block of lucrative data sets.
  • Role of AI: It’s impossible to say anything about a health care or data topic right now without talking about AI. So we’ll just say this: AI is a burden and a gift in de-identification. AI tools can help to de-identify unstructured data (notoriously difficult) and can accelerate de-identification tools and data set analysis. AI can also be used to double check statistical assumptions on residual risk. But AI tools can also potentially change the re-identification risk calculus if AI tools can interrogate data and identify patterns leveraged for re-identification in new ways. 

As data demands grow, de-identification is an essential governance and strategic priority for stakeholders in the digital data economy. De-identification projects enable engineers, business leaders, compliance leaders and counsel to work together collaboratively and create a conversation around data governance that pays dividends beyond the data set itself.

Photo: Weiquan Lin, Getty Images


Jordan Collins is a results-oriented, strategic leader with over 20 years’ experience in analytic functions focused on enabling data-driven decisions at an enterprise level. He is currently the General Manager of Privacy Analytics, an IQVIA company. Privacy Analytics enables organizations to unleash the value of sensitive data for secondary purposes while managing privacy considerations. Jordan has a PhD in Philosophy from the University of Auckland, an MA in Applied Statistics from York University, an MSc in Pure Mathematics from McMaster University, and a BSc (Hon.) degree in Mathematics from Mount Allison University. Jordan has a strong analytics background, starting his career as a statistician. He has deep consulting experience with an entrepreneurial bent, having stood up his own statistical consulting practice focusing on statistical applications in healthcare as well as industrial process and business optimization. For the past 10 years he has applied these analytic skills to technical privacy challenges globally.

Jennifer Geetter is a partner in McDermott Will & Schulte’s DC office. With a practice focused primarily on the development, delivery and implementation of digital health solutions, data and research, Jennifer works closely with both adopters and developers to bring their innovative healthcare solutions to patients and providers. In order to design and deploy digital health technologies effectively, Jenn offers valuable guidance on key issues, like patient on-boarding, provider implementation, privacy and regulatory issues. She advises global life sciences, healthcare and informatics clients on legal issues attendant to digital health, biomedical innovation, research compliance, global privacy and data security laws, and financial relationship management.

This post appears through the MedCity Influencers program. Anyone can publish their perspective on business and innovation in healthcare on MedCity News through MedCity Influencers. Click here to find out how.

Source link