A PRIVACY ANALYTICS WHITE PAPER. The De-identification Maturity Model. Khaled El Emam, PhD Waël Hassan, PhD

Similar documents
Safe Harbor Vs the Statistical Method

De-Identification Reduce Privacy Risks When Sharing Personally Identifiable Information

Risk Management using the HITRUST De-Identification Framework

A Reality Check on Health Information Privacy: How should we understand re-identification risks under HIPAA?

Addendum 1 Compliance indicators for the Australian Privacy Principles

pic National Prescription Drug Utilization Information System Database Privacy Impact Assessment

Office of the Chief Privacy Officer. Privacy & Security in an App Enabled World HIMSS, Tuesday March 1, 2016, Las Vegas, NV

U.S. Department of Energy Office of Inspector General Office of Audit Services. Audit Report

Guidance on De-identification of Protected Health Information September 4, 2012.

PRIVACY BREACH GUIDELINES

Matching Accuracy of Patient Tokens in De-Identified Health Data Sets

Getting Ready for Ontario s Privacy Legislation GUIDE. Privacy Requirements and Policies for Health Practitioners

A Study on Personal Health Information De-identification Status for Big Data

Data Integration and Big Data In Ontario Brian Beamish Information and Privacy Commissioner of Ontario

POLICY STATEMENT PRIVACY POLICY

Audit of Engage Grants Program

De-identification and Clinical Trials Data: Oh the Possibilities!

AUTHORIZATION FOR INDIRECT COLLECTION OF PERSONAL INFORMATION. Ministry of Health & Ministry Responsible for Seniors

Privacy Policy - Australian Privacy Principles (APPs)

C. Agency for Healthcare Research and Quality

Viewing the GDPR Through a De-Identification Lens: A Tool for Clarification and Compliance. Mike Hintze 1

Health Research 2017 Call for Proposals. Evaluation process guide

THE JOURNEY FROM PHI TO RHI: USING CLINICAL DATA IN RESEARCH

Guide to the SEI Partner Network

Performance audit report. Department of Internal Affairs: Administration of two grant schemes

Request for Proposals (RFP) # School Health Transactional System. Release Date: July 24, 2018

Reporting a Privacy Breach to the Commissioner

LifeBridge Health HIPAA Policy 4. Uses of Protected Health Information for Research

Sample Privacy Impact Assessment Report Project: Outsourcing clinical audit to an external company in St. Anywhere s hospital

AMSTERDAM FUND FOR THE ARTS PROFESSIONAL ARTS SCHEME

A Standardized Approach to De-Identification

GAO DEFENSE CONTRACTING. Improved Policies and Tools Could Help Increase Competition on DOD s National Security Exception Procurements

A fresh start for registration. Improving how we register providers of all health and adult social care services

HOW ONE HOSPITAL EMBRACED PATIENT SATISFACTION TRANSPARENCY

GAO INDUSTRIAL SECURITY. DOD Cannot Provide Adequate Assurances That Its Oversight Ensures the Protection of Classified Information

3. Does the institution have a dedicated hospital-wide committee geared towards the improvement of laboratory test stewardship? a. Yes b.

The Allen Distinguished Investigator( ADI) Program seeks to create a cohort of

Privacy Impact Assessment: care.data

Accountable Care Atlas

Developing a framework for the secondary use of My Health record data WA Primary Health Alliance Submission

Requests for Proposals

PRIVACY BREACH MANAGEMENT GUIDELINES. Ministry of Justice Access and Privacy Branch

Privacy and EHR Information Flows in Canada

Statement of Guidance: Outsourcing Regulated Entities

Farm Co-operatives and Collaboration Pilot Program Farmer Group Projects Funding Guidelines

PRIVACY AND ANTI-SPAM CODE FOR OUR ORGANIZATION

The EU GDPR: Implications for U.S. Universities and Academic Medical Centers

STATEMENT. JEFFREY SHUREN, M.D., J.D. Director, Center for Devices and Radiological Health Food and Drug Administration

Health Technology Assessment (HTA) Good Practices & Principles FIFARMA, I. Government s cost containment measures: current status & issues

Spencer Foundation Request for Proposals for Research-Practice Partnership Grants

BIOMETRICS IN HEALTH CARE : A VALUE PROPOSITION FROM HEALTH CARE SECTOR

Compliance with Personal Health Information Protection Act

IHE IT Infrastructure Handbook. De-Identification


A Privacy Compliance Checklist: Organizing for Privacy Management

How we use your information. Information for patients and service users

Child Care Program (Licensed Daycare)

The Growth Fund Guidance

Retrospective Chart Review Studies

Farm Data Code of Practice Version 1.1. For organisations involved in collecting, storing, and sharing primary production data in New Zealand

Business Risk Planning

Implementing National Health Observatories

Privacy Toolkit for Social Workers and Social Service Workers Guide to the Personal Health Information Protection Act, 2004 (PHIPA)

Accountability Framework and Organizational Requirements

Publication Development Guide Patent Risk Assessment & Stratification

PRIVACY BREACH MANAGEMENT POLICY

CONCEPTS AND METHODS FOR DE-IDENTIFYING CLINICAL TRIAL DATA. Khaled El Emam, Ph.D. (University of Ottawa) and

III. The provider of support is the Technology Agency of the Czech Republic (hereafter just TA CR ) seated in Prague 6, Evropska 2589/33b.

SHOULD I APPLY FOR AN ARC FUTURE FELLOWSHIP? GUIDELINES

Precedence Privacy Policy

Rutgers School of Nursing-Camden

UNESCO/Emir Jaber al-ahmad al-jaber al-sabah Prize for Digital Empowerment of Persons with Disabilities. Application Guidelines for 2018/2019

Call for Proposal EACEA/07/2017 Erasmus+ Programme KA3 Support for Policy Reform. Social Inclusion through Education, Training and Youth

A Qualitative Study of Master Patient Index (MPI) Record Challenges from Health Information Management Professionals Perspectives

Mortality Data in Healthcare Analytics

Contains Nonbinding Recommendations. Draft Not for Implementation

How the Quality Improvement Plan and the Service Accountability Agreement Can Transform the Health Care System

WEATHERIZATION ASSISTANCE PROGRAM. Procurement. Trainer s Manual Three Hour Workshop

Briefing: Quality governance for housing associations

Canadian Agricultural Automation Cluster: Call for Proposals

SSF Call for Proposals: Framework Grants for Research on. Big Data and Computational Science

FRENCH LANGUAGE HEALTH SERVICES STRATEGY

Guide 2: Submitting a Potential Research Topic or Potential Network Topic

Draft Code of Practice FOR PUBLIC CONSULTATION

Instructions for Submission: Research Grant Applications National Multiple Sclerosis Society 2018

Response to the Department of Health consultation on a draft health information policy framework

Indicator 22 Patient records meet requirements to describe and support the management of health care provided.

Harvesting Wearable Device Data Session 230, March 6, 2018 Ajay K. Mittal, Associate Director, IT American College of Cardiology

Restricted Call for proposals addressed to National Authorities for Higher Education in Erasmus+ programme countries

REQUEST FOR PROPOSAL. Conduct a Resident Satisfaction Survey. City of Hyattsville, Maryland

PERSONAL HEALTH INFORMATION PROTECTION ACT (PHIPA) Frequently Asked Questions (FAQ s) Office of Access and Privacy

INSERT ORGANIZATION NAME

Dr. Kristin Heins, ND Thrive Natural Family Health 110 Eglinton Avenue East, Suite 502 Toronto, Ontario M4P 2Y1 Telephone: (647)

Brussels, 19 December 2016 COST 133/14 REV

Ontario s Digital Health Assets CCO Response. October 2016

REVIEWED BY Leadership & Privacy Officer Medical Staff Board of Trust. Signed Administrative Approval On File

REQUEST FOR PROPOSALS

EIT Raw Materials Call for KAVA Education projects Instructions and process description

Our next phase of regulation A more targeted, responsive and collaborative approach

Horizontal Monitoring

Transcription:

A PRIVACY ANALYTICS WHITE PAPER The De-identification Maturity Model Authors: Khaled El Emam, PhD Waël Hassan, PhD 1

Table of Contents The De-identification Maturity Model... 4 Introduction... 4 DMM Structure... 4 Key De-identification Practice Dimension... 6 P1 Ad-hoc... 6 P2 Masking... 7 P3 Heuristics... 8 P4 Risk-based... 8 P5 Governance... 8 The Implementation Dimension... 8 I1 - The Initial Level... 9 I2 The Repeatable Level... 9 I3 The Defined Level... 9 I4 The Measured Level... 9 The Cumulative Nature of the Implementation Dimension... 9 The Automation Dimension... 10 A1 Home-grown Automation... 10 A2 Standard Automation... 10 Use of the De-identification Maturity Model... 10 The Practice Dimension and Compliance... 10 Scoring... 11 Process Improvement... 12 Ambiguous Cases... 12 Case 1... 12 Case 2... 12 2

Case 3... 13 Case 4... 13 Mapping to the Twelve Characteristics... 13 Quantitative Assessment Scheme... 15 References... 19 About Privacy Analytics... 19 3

The De-identification Maturity Model Introduction Privacy Analytics has developed the De-identification Maturity Model or DMM as a formal framework for evaluating the maturity of anonymization services within an organization. The framework gauges the level of an organization s readiness and experience with respect to anonymization in terms of people, processes, technologies and consistent measurement practices. The DMM is used as a measurement tool and enables the enterprise to implement a fact-based improvement strategy. The DMM is a successor to a previous analysis in which we identified twelve criteria for evaluating an anonymization methodology [1], and to an earlier maturity model which we developed to assess the identifiability of data [2]. The criteria used, in both instances, were based on contemporary standards. The set of twelve criteria, although useful for general evaluation purposes, poses two challenges regarding its application: (a) how can these criteria be used to evaluate the anonymization practices of organizations, and (b) do all of the twelve criteria need to be implemented at once? We have now developed a maturity model based on these twelve criteria: the De-identification Maturity Model. The DMM is intended to serve a number of purposes: it can (a) be used by organizations as a yardstick to evaluate their de-identification practices, (b) provide a roadmap for improvement, helping organizations to determine what they need to do next in order to improve their de-identification practices and (c) allow different units or departments within a larger organization to compare their de-identification practices in a concise and objective way. Organizations that have a higher maturity score on the DMM are considered to have better and more sophisticated de-identification practices. Higher maturity scores indicate that the organization is able to: (a) defensibly ensure that the risk of re-identification is very small, (b) meet regulatory and legal requirements, (c) share more data for secondary purposes using fewer resources (greater efficiency), (d) share higher quality data that better meets the analytical needs of the data recipients, (e) de-identify data through consistent practices, and (f) better estimate the resources and time required to de-identify a data set. DMM Structure The DMM has five maturity levels to describe the de-identification practices that an organization has in place, with level 1 being the lowest level of maturity and level 5 being the highest level of maturity. Borrowing a term from the ISO/IEC 15504 international standard [1] on software process assessment, we will assume that the scope of the DMM is an organizational unit (or OU for short). An OU is a general term for entities of all sizes, from small units of a few people assigned to a specific project up to whole enterprises. We therefore deliberately do not define an OU because the definition will be case specific. It may be a particular business unit within a larger enterprise, or it can be a whole ministry of health. DMM is a descriptive model developed through discussions with, and experiences and observations of, more than 60 OUs over the last five years. The model is intended to capture the stages that an OU passes through as it implements de-identification services. Based on our experiences, the DMM describes the evolutionary stages through which OUs achieve greater levels of sophistication and efficiency. 4

The DMM has three dimensions, as illustrated in FIGURE 1. The first dimension is the nature of the deidentification methodology that the OU has in place: this is the Key De-identification Practice dimension. The second dimension captures how well these practices are being implemented. This Implementation dimension covers elements such as proper management of de-identification practices, documentation of practices, and their measurement. The third dimension of Automation assesses the degree of automation of the de-identification process. We will examine each of these dimensions in detail below. Figure 1: The three dimensions of the De-identification Maturity Model. The DMM can be represented in the form of a matrix as illustrated in Figure 2. This matrix also allows for the explicit scoring of an OU s de-identification services. 5

Figure 2: The de-identification maturity matrix showing all three dimensions of the DMM. Key De-identification Practice Dimension P1 Ad-hoc At this level, an OU does not have any defined practices for de-identification. The OU may not even realize that de-identification is necessary. If it is recognized as necessary, it is left to database administrators or analysts to figure out what to do, without much guidance. The methods that are used to de-identify the data are either developed in-house (e.g., invented by a programmer) or picked up by browsing the Internet. At Practice Level 1 (P1), such methods are not proven to be rigorous or defensible. OUs at this level will often lack adequate advice from their legal counsel, have a weak or non-existent privacy or compliance office, and/or have poor communication with this office. OUs at the Practice Level 1 tend to have a lot of variability in how they de-identify data, as the type and amount of de-identification applied will be depend on the analyst who is performing it, and that analyst s experience and skill. They may apply various techniques on the data, such as rudimentary masking methods, or other non-reviewed approaches. The quality of the data coming out of the OU will also vary, as the extent of de-identification may be indiscriminately high or low. Some OUs at Practice Level 1 recognize the low maturity of their de-identification capacities and err on the conservative side by not releasing any data. For OUs at this level which do release data, a data breach would almost certainly be considered notifiable in jurisdictions where there are breach notification laws in place. 6

P2 Masking OUs at this level only implement masking techniques. Masking techniques focus exclusively on direct identifiers such as name, phone number, health plan number, and so on. They include techniques such as pseudonymization, the suppression of fields, and randomization [3]. As we have documented elsewhere [1], masking is not sufficient to ensure that the risk of re-identification is very small for the data set. Masking is necessary, but not sufficient, to protect against identity disclosure. Even if good masking techniques are used, it is possible to produce a data set with a high risk of re-identification. According to our observations, OUs often remain at Practice Level 2 for one or a combination of these three reasons: (a) masking tool vendors erroneously convince them that masking is sufficient to ensure that the risk of re-identification is small, (b) the OU needs to disclose unique identifiers (such as social security numbers or health insurance numbers) to facilitate data matching but is uncomfortable doing so and consequently chooses to implement pseudonymization, and (c) the individuals who are tasked with implementing de-identification lack knowledge of the area and do not know about residual risks from indirect identifiers. In healthcare, and increasingly outside of healthcare, this method of de-identification alone will not meet the expected standards for protection of personal information (see the table mapping to existing standards below). Therefore, OUs at Practice Level 2 will also have to go through a notification process if they experience a breach, in jurisdictions where there are breach notification laws. 7

P3 Heuristics At this level OUs have masking techniques in place, and have started to add heuristic methods for protecting indirect identifiers. Heuristic methods are rules-of-thumb that are used to de-identify data. For example, the commonly used cell size of five is a rule of thumb, as is the heuristic that no geographic area with less than 20,000 residents will be released. Oftentimes these heuristics are just copied from another organization that is believed to be reputable or is perceived to have good practices for managing privacy. A discussion of heuristics can be found elsewhere [4]. This is a significant improvement from Practice Level 2 in that it is starting to look at ways to protect indirect identifiers, such as location, age, dates of service, types of service, etc., that can be used in combination to identify individuals. However, heuristics have two key disadvantages: (a) they do not ensure that the risk of re-identification is very small for the OU s particular data sets, and (b) they may result in too much distortion of the data set. The primary reasons for these disadvantages are that heuristics do not rely on measurement of re-identification risk and do not take the context of the use or disclosure of the data into account. P4 Risk-based Risk-based de-identification involves the use of empirically validated and peer-reviewed measures to determine the acceptable re-identification risk and to demonstrate that the actual risk in the data sets is at or below this acceptable risk level. In addition to measurement, there are specific techniques that take into account the context of the de-identification when deciding on an acceptable risk level. Because risk can be measured, it is possible to perform only enough de-identification on the data to meet the risk threshold, and no more. In addition to risk measurement, OUs at this level quantify information loss, or the amount of change made to the data. By considering these two types of measures, the OU can ensure that the data experiences minimal change while still meeting the risk threshold requirement. Of course, masking techniques are still used at this level to protect direct identifiers in the data set. These are assumed to be carried over from lower levels on this dimension. At this level, if practices are implemented consistently, it would be relatively straightforward to make the case that no notification is required if there is a data breach of the de-identified data. P5 Governance At the highest level of maturity, masking and risk-based de-identification are applied as described in Practice Level 4. However, now there is a governance framework in place, as well as practices to implement it. Governance practices include performing audits of data recipients, monitoring changes in regulations, and having a re-identification response process. The Implementation Dimension The de-identification practices described above can be implemented with different levels of rigor. We define four levels of implementation: Initial, Repeatable, Defined, and Measured. Note, that there is no Implementation dimension for OUs at the P1 level because there are no de-identification practices to implement. Consequently these cells are crossed out in Figure 2. 8

I1 - The Initial Level At the Initial level the de-identification practices are performed by an analyst with no documented process, no specific training, and no performance measurements in place. It is experiential and the quality of the de-identification that is performed will depend largely on the skills and effort of the analyst performing it. This leads to variability in how well and how quickly de-identification can be performed. It also means that the OU is at risk of losing a significant amount of its de-identification expertise if these analysts leave or retire; there is no institutional memory being built up to maintain the practices within the OU. I2 The Repeatable Level At the Repeatable level the OU has basic project management practices in place to manage the deidentification service. This means that: (a) there are roles and responsibilities defined for performing deidentification, and (b) there is a known high-level process for receiving data, de-identifying it, and then releasing it to the data users. At this level there is a basic structure in place for de-identification. Also critical at this level is the involvement of the privacy or compliance office in helping to shape deidentification practices. Since these staff would have a more intimate understanding of legislation and regulations, their inputs are advisable at the early stages of implementing de-identification practices. I3 The Defined Level The Defined level of implementation means that the de-identification process is documented and there is training in it in place. Documentation is critical because it is a requirement in privacy standards. The HIPAA Privacy Rule Statistical Method, for example, explicitly mentions documentation of the process as a compliance requirement. Training ensures that the analysts performing the de-identification will be able to do it correctly and consistently. In order to comply with regulatory requirements, the nature of the documentation also matters. The purpose of the documentation is to demonstrate to an auditor or an investigator, in the event of a potentially adversarial situation, the precise practices that are used to de-identify the data. An auditor or investigator will be looking at the process documentation for a number of reasons, such as: (a) there has been a breach and the regulator is investigating it; (b) there has been a patient complaint about data sharing and concerns have been expressed in the media, resulting in an audit being commissioned; and/ or (c) a client of the OU has complained that data they are receiving from the OU are not de-identified properly. The documentation is then necessary to make the case strongly and convincingly that adequate methods were used to de-identify the data. Based on our experience, a two or three page policy document, for example, will generally not be considered to be sufficient. I4 The Measured Level The Measured level of implementation pertains to performance measures of the de-identification process being made and used. Measures can be based on tracking of the data sets that are released and of any data sharing agreements. For example, the OU can examine trends of data releases and their types over time to enable better resource allocation; for instance, overlaps in data requests could lead to the creation of standard data sets. Other measures can include data user satisfaction surveys and measures of response times and delays in getting data out. The Cumulative Nature of the Implementation Dimension The levels in the Implementation dimension are cumulative in that it is difficult to implement a higher level without having implemented a lower level. For example, meaningful performance measures will be difficult to collect without having a defined process. 9

The Automation Dimension Automation is essential for scalability. Any data set that is not trivial in size can only be de-identified using automated tools. Automation becomes critical as data sets become larger and as de-identification needs to be performed regularly (as in data feeds). Without automation, it will be difficult to believe that there is a de-identification process in place. A1 Home-grown Automation An OU may attempt to develop its own scripts and tools to de-identify data sets. Based on our experiences, solutions developed in-house tend to have fundamental weaknesses in them, or to be incomplete. For example, some OUs may try to develop their own algorithms for creating pseudonyms. We have sometimes seen OUs develop their own hashing schemes for pseudonyms. We strongly advise against this, because they will almost certainly not work correctly, or will be easy to re-identify (one should only use NIST approved hashing schemes). But even for seemingly straightforward masking schemes, de-identification can be reversed if not constructed carefully [1]. It takes a considerable amount of expertise in this area to construct effective masking techniques. We have also observed home grown de-identification scripts that distort the data too much, because they focus on modifying data to protect privacy without considering data utility. A2 Standard Automation Standard automation means adopting tools that have been used more broadly by multiple organizations and have received scrutiny. These may be publicly available (open source) or commercial tools. The advantage of standard tools is transparency, in that their algorithms have likely been reviewed and evaluated by a larger community. Any weaknesses have been identified and flagged, and the developer(s) of the tools have been made aware of them and likely addressed them. Data masking is not a trivial task. Developing pseudonymization or randomization schemes that are guaranteed not to be reversible (at least have a low probability) requires significant knowledge in the area of disclosure control. The measurement of re-identification risk is an active area of research among statisticians, and the de-identification of indirect identifiers is a complex optimization problem. The metrics and algorithms of any tools adopted should be proven to be defensible. Use of the De-identification Maturity Model The Practice Dimension and Compliance Across multiple jurisdictions, OUs at levels P1 and P2 in the Practice dimension are generally not compliant with de-identification regulations. Practice Level 3 OUs may pass an audit, but with major findings, because they will not be able to provide objective evidence that their de-identification techniques ensure that re-identification risk is very small (since there is no measurement in place). The outcome of an audit or an investigation will depend on how strict the auditor is, but in general, de-identified data cannot be defensibly considered to be re-identifiable. In general, we recommend that Practice Level 3 should be a transitional stage to higher Practice levels. 10

Figure 3: An example of scoring P2-I3-A1 on the maturity matrix. Scoring An OU scored on the DMM has three scores: the Practice score, the Implementation score, and the Automation score. For example, an OU that has purchased a data masking tool and implemented it, and has documented the data masking process and its justifications thoroughly, would be scored at P2-I3-A2. If the masking tool was home grown, but with the same level of documentation and training, then the score would be P2-I3-A1. This is illustrated on the maturity matrix in Figure 3. The check marks in the matrix indicate the three dimensions of the score. The absolute minimum score to be able to make any defensible claim of compliance with current standards would be a P4-I3-A1 score. We deliberately did not attach weights to the dimensions as we do not have a parsimonious way of doing so. Based on our experiences we would argue that an OU would be best off improving its P score first, and then focusing on Automation, followed by the Implementation. Without the appropriate deidentification practices (Practice dimension) in place, adequate privacy protection is impossible. The OU must first have reasonable de-identification practices in place. Then, automation is necessary for the deidentification of data sets of any substantial size. Finally, the Implementation dimension should receive attention. Therefore, the priority ranking is Practice > Automation > Implementation. Improvements beyond the minimum required for compliance may be motivated by a drive to achieve certain outcomes, such as finding efficiencies, improving service quality, and reducing costs. For example, further improvements allow for the release of higher quality data, the release of larger volumes of data, faster data releases, and lower operating costs. 11

In large OUs there may be multiple departments and units performing de-identification, and each of these departments and units may have a different DMM score. These scores may be quite divergent, reflecting significant heterogeneity within the OU. This makes it challenging to compute or present an OU-wide score. There are three ways to approach such heterogeneity: (a) present a range for each of the three scores to reflect that heterogeneity, (b) present the average for each score, or (c) define multiple department or unit profiles and present a separate assessment for each. The third approach requires some explanation. In a case where there are two extreme sets of practices within an OU, some very good and some very poor (according to the maturity model), it would be difficult to present a coherent picture of the whole OU. Those departments or units with high maturity scores would be represented, characterized, and described in a Pioneers profile. Those departments or units with low maturity scores would be represented by a Laggards profile. Each profile s scores on the three DMM dimensions would be an average or a range of the scores of the departments or units represented. Process Improvement The DMM provides a roadmap for an OU to improve its de-identification practices. The Practice dimension target for an OU that only occasionally uses and discloses data for secondary purposes would be Practice Level 4. The Practice dimension target for an OU that uses and discloses data for secondary purposes on quite a regular basis would be Practice Level 5. Process improvement plans generally focus on all three dimensions. These may be performed simultaneously or staggered, depending on resources. It is very reasonable for an OU to skip certain Practice Levels. Recall that the DMM characterizes the natural evolution of de-identification practices that an OU goes through. In a deliberate process improvement context an OU may skip Practice Levels to move directly to the targeted state. For example, an OU at Practice Level 1 (Ad-hoc) would not deliberately move to a still non-compliant Practice Level 2 (Masking), but directly to Practice Level 4 (Risk-based). As noted above, the Implementation Levels (Initial, Repeatable, Defined, and Measured) are cumulative. This means that skipping Implementation Levels is not recommended (and will not work very well). Ambiguous Cases We consider below some grey areas or edge cases that have been encountered in practice. These are intended to help interpret the DMM. Case 1 Assume that an OU is using a masking technique, but that technique is known to be reversible. For example, the method that is used for creating the pseudonym from a social security number or medical record number has weaknesses in that one can infer the original identifier from the pseudonym. Weaknesses with other masking techniques have been described elsewhere [1]. Would that OU be considered at Practice Level 1 (Ad Hoc) or Practice Level 2 (Masking)? In general we would consider this OU to still be at Practice Level 1 since the approach that is used for masking would be considered adhoc. To be a true Practice Level 2 OU the masking methods must be known to be strong, in that they cannot be easily reverse engineered by an adversary. Case 2 A Practice Level 2 (Masking) OU has implemented pseudonymization but has not implemented other forms of masking on direct identifiers. Would that OU still be considered at Practice Level 2? If all of the direct identifiers have not been determined and masked properly, then this OU is considered at Practice Level 1 (Ad Hoc). Pseudonymization by itself may not sufficient. Also, sometimes OU s will create pseudonyms for some direct identifiers but not for others a form of partial pseudonymization. Again, this would still keep the OU at Practice Level 1. 12

Case 3 A series of algorithms and checklists have been developed and implemented by an OU s staff to deidentify indirect identifiers. The algorithms always de-identify the data sets the same way, and do not take the context into account. Because the context is not accounted for, this would be considered a Level 3 (Heuristic) OU. Being able to adjust the parameters of the de-identification to account for the data use and disclosure context is important for the DMM definition of Practice Level 4 (Risk-based). Case 4 An OU needs to disclose data to allow the data recipient to link the data set with another data set. The two data sets do not have a unique identifier in common, therefore probabilistic linkage is necessary. This means that indirect identifiers such as the patient s date of birth, postal code, and date of admission need to be disclosed without any de-identification. In that case, because there is a necessity to disclose fields that are suitable for probabilistic linkage does not mean that the data set is considered to have a very small risk of re-identification. Unless the data custodian can demonstrate that the measured risk is very small, then this cannot be considered as having a very small risk of re-identification. Furthermore, there are alternative approaches one can use for secure probabilistic matching that do not require the sharing of the indirect identifiers. Mapping to the Twelve Characteristics In an earlier analysis we documented twelve characteristics of anonymization methodologies that are mentioned in contemporary standards, such as guidance and best practice documents from regulators [1], [5]. The following mapping indicates how the DMM maps to these criteria. There are three conclusions that one can draw from this mapping. First, that the DMM covers the 12 characteristics and therefore it is consistent with existing standards. Second, that the DMM covers some practices that are not mentioned in the standards. There are a number of reasons for this: The standards describe a high maturity OU. The DMM covers low maturity as well as high maturity OUs. Hence, we discuss some of the practices we see in low maturity OUs. The standards do not cover some practices that we believe are critical for the effective implementation of de-identification, such as automation (the Automation dimension) and performance measurement (the I4 level). These are practical requirements that enable the scaling of deidentification. We have observed that large scale de-identification is becoming the norm because of the volume of health data that is being used and disclosed for secondary purposes. Third, we see that the union of current standards describes a P5-I3-A1 OU. As noted above we consider P5 practices most suitable for OUs that perform a significant amount of de-identification. And the standards do not discuss performance measurement and automation 13

MAPPING THE DMM TO THE TWELVE CHARACTERISTICS CRITERION Is the methodology documented? Has the methodology received external or independent scrutiny? Does the methodology require and have a process for the explicit identification of the data custodian and the data recipients? Does the methodology require and have a process for the identification of plausible adversaries and plausible attacks on the data? Does the methodology require and have a process for the determination of direct identifiers and quasi-identifiers? Does the methodology have a process for identifying mitigating controls to manage any residual risks? Does the methodology require the measurement of actual reidentification risks for different attacks from the data? Is it possible to set, in a defensible way, re-identification risk thresholds? MAPPING I3 this is a requirement of the Implementation Defined Level P5 this is part of governance P2 onwards Customizing the fields to mask or to de-identify can depend on the data recipient P4 this kind of practice would mostly be relevant in a risk-based de-identification approach This is generally considered in the structure of the Practice dimension P4 this kind of practice would mostly be relevant in a risk-based de-identification approach P4 this kind of practice would mostly be relevant in a risk-based de-identification approach P4 this kind of practice would mostly be relevant in a risk-based de-identification approach Is there a process and template for the implementation of the reidentification risk assessment and de-identification? P4 this kind of practice would mostly be relevant in a risk-based de-identification approach Does the methodology provide a set of data transformations to apply? Does the methodology consider data utility? Does the methodology address de-identification governance? I3 a documented methodology with appropriate training would explain the data transformations P2 and P4 usually masking methodologies have a subjective measure of data utility and risk-based ones a more objective measure of data utility Elements of governance are covered in Practice Level 5 and also in Implementation Level 3. 14

Quantitative Assessment Scheme We now discuss a scoring scheme for the maturity model. This is the scoring scheme that we have been using and we found that it provides results that have face validity. Recall that the objective of using the DMM is often to identify weaknesses and put in place a roadmap for improvement, and a scoring scheme that supports that objective would be considered acceptable. An assessment can be performed by the OU itself or by an external assessor. It consists of a series of questions around each of the dimensions. The response to a question is Yes (and gets a score of 4), No (and gets a score of zero), or is being planned (and gets a score of 1). For each dimension in the DMM, start at the level 2 questions. Score the level 2 questions and take their average. If the average score is less than or equal to 3 then that OU is at level 1 on that dimension. If the OU has a score greater than 3, then proceed to the next level s questions, and so on. KEY DE-IDENTIFICATION PRACTICE DIMENSION CRITERION Yes No Planned Level 2 - Masking Does the OU s de-identification methodology require and have a process for the determination of direct identifiers and quasi-identifiers? Does the OU have practices for either deleting direct identifiers or creating pseudonyms from unique direct identifiers (such as MRNs, health insurance numbers, SSNs or SINs)? Does the OU have practices for the randomization of other non-unique direct identifiers? Is data utility explicitly considered when deciding which direct identifiers to mask and how? Does the OU require and have a process for the explicit identification of the data custodian and the data recipients? Level 3 - Heuristics Does the OU s de-identification methodology require and have a process for the determination of direct identifiers and quasi-identifiers? Does the OU have practices for either deleting direct identifiers or creating pseudonyms from unique direct identifiers (such as MRNs, health insurance numbers, SSNs or SINs)? Does the OU have practices for the randomization of other non-unique direct identifiers? Is data utility explicitly considered when deciding which direct identifiers to mask and how? 15

Does the OU require and have a process for the explicit identification of the data custodian and the data recipients? Does the OU have a checklist of indirect identifiers to always remove or generalize (similar in concept to the US HIPAA Privacy Rule Safe Harbor list)? Does the OU use general rules-of-thumb to generalize or aggregate certain indirect identifiers (e.g., never release more than three characters of the postal code) that are always applied for all data releases? Level 4 Risk-based Does the OU s de-identification methodology require and have a process for the determination of direct identifiers and quasi-identifiers? Does the OU have practices for either deleting direct identifiers or creating pseudonyms from unique direct identifiers (such as MRNs, health insurance numbers, SSNs or SINs)? Does the OU have practices for the randomization of other non-unique direct identifiers? Is data utility explicitly considered when deciding which direct identifiers to mask and how? Does the OU require and have a process for the explicit identification of the data custodian and the data recipients? Does the OU require and have a process for the identification of plausible adversaries and plausible attacks on the data? Does the OU have a process for identifying mitigating controls to manage any residual risks? Does the OU require the measurement of actual reidentification risks for different attacks from the data? Is it possible to set, in a defensible way, re-identification risk thresholds? Is there a process and template for the implementation of the re-identification risk assessment and de-identification? Level 5 Governance Does the OU s de-identification methodology require and have a process for the determination of direct identifiers and quasi-identifiers? 16

Does the OU have practices for either deleting direct identifiers or creating pseudonyms from unique direct identifiers (such as MRNs, health insurance numbers, SSNs or SINs)? Does the OU have practices for the randomization of other non-unique direct identifiers? Is data utility explicitly considered when deciding which direct identifiers to mask and how? Does the OU require and have a process for the explicit identification of the data custodian and the data recipients? Does the OU require and have a process for the identification of plausible adversaries and plausible attacks on the data? Does the OU have a process for identifying mitigating controls to manage any residual risks? Does the OU require the measurement of actual reidentification risks for different attacks from the data? Is it possible to set, in a defensible way, re-identification risk thresholds? Is there a process and template for the implementation of the re-identification risk assessment and de-identification? Does the OU conduct of audits of data recipients (or require third-party audits) to ensure that conditions of data release are being satisfied? Does the OU have an explicit process for monitoring changes in relevant regulations and precedents (e.g., court cases or privacy commissioner orders)? Is there a process for dealing with an attempted or successful re-identification of a released data set? Has the OU subjected its de-identification practices to external review and scrutiny to ensure that they are defensible? Does the OU commission re-identification testing to ensure that realistic re-identification attacks will have a very small probability of success? Does the OU monitor multiple data releases to the same recipients for overlapping variables that may increase the risk of re-identification? 17

IMPLEMENTATION DIMENSION CRITERION Yes No Planned Level 2 - Repeatable Does the OU have someone responsible for de-identification? Does the OU have clearly defined expected inputs and outputs for the de-identification? Are there templates for data requests and de-identification reports, certificates, and agreements? Is the privacy or compliance function involved in defining or reviewing the de-identification practices? Level 3 - Defined Is the de-identification process documented? Does the OU have evidence the documented process is followed? Are all analysts who need to perform or advise on deidentification activities receiving appropriate training? Level 4 - Measured Does the OU collect and store performance data, for example, on the number and size of de-identified data sets, the types of fields, the types of data recipients, how long deidentification takes? Is trend analysis performed on the collected performance data to understand how performance is changing over time? Are actions taken by the OU to improve performance based on the trend analysis? Are satisfaction surveys of data recipients conducted and analyzed? AUTOMATION DIMENSION CRITERION Yes No Planned Level 2 Standard Automation Does the OU use off-the-shelf data masking tools? Does the OU use off-the-shelf data de-identification tools? Is the functioning of the masking tools transparent (i.e., documented and reviewed)? Is the functioning of the de-identification tools transparent (i.e., documented and reviewed)? 18

References 1. International Standards Organization, http://www.iso.org/iso/home/store/catalogue_ics/ catalogue_detail_ics.htm?csnumber=60555, Accessed April 2013 2. K. El Emam, Risky Business: Sharing Health Data while Protecting Privacy. Trafford 2013. 3. K. El Emam, Risk-based De-identification of Health Data, IEEE Security and Privacy, vol. 8, no. 3, pp. 64 67, 2010. 4. El Emam, K., Guide to the De-identification of Personal Health Information. CRC Press (Auerbach), 2013. 5. K. El Emam, Heuristics for De-identifying Health Data, IEEE Security and Privacy, pp. 72 75, 2008. 6. K. El Emam, The Twelve Characteristics of a De-identification Methodology, Privacy Analytics Inc. Contact Us Privacy Analytics, Inc. 800 King Edward Avenue Suite 3042 Ottawa, ON, K1N 6N5 Telephone: +1.613.369.4313 Email: info@privacyanalytics.ca www.privacyanalytics.ca Copyright 2013 Privacy Analytics, Inc. All Rights Reserved. About Privacy Analytics Privacy Analytics Inc. is a world-renowned developer of data anonymization solutions. Its proprietary, integrated de-identification and masking software PARAT enables secondary users of personal information to have granular de-identified data that retains a high level of usefulness. For health information, PARAT operationalizes the HIPAA Privacy Rule De-Identification Standard, enabling users to quickly and efficiently anonymize large quantities of data while complying with legal standards. Privacy Analytics resulted from the commercialization of the research efforts of the Electronic Health Information Laboratory (EHIL) of the University of Ottawa, which over the past decade has produced more than 150 peer-reviewed papers. The work of EHIL has been influential in the development of regulations and guidance worldwide. With PARAT software, organizations can realize the value of the sensitive data they hold, while still protecting the privacy of individuals when conducting critical research and complex analytics. They can facilitate innovation for the improvement of society and still meet the most stringent legal, privacy and compliance regulations. Privacy Analytics offers education dedicated to data anonymization and re-identification risk management Sharing Personal Health Information for Secondary Purposes: An Enterprise Risk Management Framework. Privacy Analytics is proud to be recognized by the Ontario Information and Privacy Commissioner as a Privacy by Design Organizational Ambassador. The company is committed to the Privacy by Design objectives of ensuring privacy and gaining personal control over one s information and, for organizations, gaining a sustainable competitive advantage. 19