Collecting and accessing data is critical to meeting the mission of government. Data provides the basis for informed decision-making and advanced analytics that influence everything from logistics to policy and population health. However, when data contains personally identifiable information (PII) and personal health information (PHI), complicating questions need to be considered, like how can we use data that is highly representative of real-world datasets while protecting the sensitivity of PII/PHI data? One solution to balancing data utility and privacy protection is using synthetic data in place of real-world data. Synthetic data and its associated best practices involves artificially generating data that is not linked to real people, events, or circumstances, but mirrors real world information and preserves the statical properties of an original data set. As government increasingly looks to synthetic data to enable artificial intelligence applications (e.g., generating data to train models), accelerate our software development (e.g., generating data for edge cases to proactively ensure robust software), and safely democratize access to sensitive population health data (e.g., generating realistic data to securely facilitate population health research), best practice policies and risk mitigations must be established. This workshop sought to broaden the conversation toward a comprehensive synthetic data policy that would advance, empower, and protect our government as synthetic data is increasingly used.
LMI Chief Technology Officer Sharon Hays spoke early in the workshop about data privacy and how the first thing that comes to mind for many is healthcare-based data and the laws surrounding the protection of PII and PHI. Data privacy goes much further than that. We often don’t think of the personal information we provide through the everyday use of smartphones and social media applications. Following Dr. Hays’ opening remarks, Col Bobby Saxon, deputy chief information officer for CMS took the stage to talk about the ACT-IAC Health Community of Interest and its recent commissioning of a white paper on the topic of synthetic data policy opportunities, then introduced the workshop’s first keynote speaker, Kshemendra Paul, the chief data officer for the VA.
As our first keynote, Mr. Paul discussed the importance of using data to drive better outcomes for veterans, noting the VA’s top priority is suicide prevention. The key statistical aggregates the VA uses in their mission include demographic data such as location, utilization of benefits and services, periods of service, social determinants of health, and clinical cohort information. “We must overcome the cultural and siloed stovepipes that prevent data access to the research community,” stated Mr. Paul, but also brought up challenges that prevent the larger research community from accessing the VA’s rich data sources. Mr. Paul continued that fully realizing the analytic insight potential of the VA’s data for meeting the mission focus of serving veterans must start with overcoming the “structural and cultural stove piping” that prevents its use by the larger research community.
Keynote speaker Carolyn Clancy built on this narrative by sharing the importance of synthetic data in the VHA’s COVID-19 mitigation strategy and its role in continuing operations during the pandemic for the veteran community. With normal operations disrupted, the VHA needed rapid access to data for decision-making to create new operating procedures. Synthetic data addressed the issue of data scarcity by creating new data while protecting the sensitivity of the original data. Dr. Clancy went on to share that VHA is a learning health care system and that the use of synthetic data not only allows for rapid access to data to facilitate fast and effective decision making, but it also accelerates learning, the testing of new ideas, and the linking of learning to workforce training.
During the first panel session, Assessing the Current State, Haley Hunter-Zinck, Ph.D., from Sage Bionetworks discussed the definition of synthetic data, the key opportunities for its application, its critical vulnerabilities, and current state-of-the-art methods in today’s health communities. The National Institute of Standards and Technology’s Gary Howarth, Ph.D. spoke about the tradeoff that exists between data privacy and data utility and how the field is developing validation methods to quantify both the accuracy and privacy protection of synthetic data. The U.S. Census Bureau’s Roland Rodriguez engaged attendees on the Bureau’s history of using synthetic data, highlighting awareness of and mitigation of the risks across the government. Glenn Schmitz of the Virginia Department of Behavioral Health and Developmental Services educated attendees on the role of synthetic data in advancing mental health treatment research.
The second panel session, Enabling the Way forward, began with Michael B. Hawes from the U.S. Census Bureau discussing how to leverage open-source tools and initiatives and the role of accessible tools in advancing and safely applying synthetic data. James B.D. Joshi, Ph.D., of the National Science Foundation followed up this topic by highlighting the role of synthetic data and of the analytics community as a force multiplier to advance the science and application of synthetic data with PII/PHI. Herbert Wong, Ph.D., from the Agency for Healthcare Research and Quality highlighted the role of synthetic data moving forward to meet directives to improve healthcare and evidence-based policymaking, lessons learned from the Synthetic Healthcare Database for Research, and thoughts on how synthetic data may bridge efforts across federal, state, and local communities. Amanda Purnell, Ph.D., of the VHA closed out the panel discussion by covering the role of academia. She led the discussion on successes and guidance for the way forward when working with partners outside of government, and how to best leverage their expertise and advancements as contributions to policy development effort. Dr. Purnell left us with the reminder, “The goal here is that we are doing something meaningful. It’s not about the technology, it’s about how we get to important information that helps us better understand a specific question that has value for veterans so that we can solve problems and improve the quality of life for the veterans we serve.”
From both sets of panelist presentations and post presentation discussions, there were a handful of agreed upon themes. One theme was that there is a need for better governance of data and data policy—synthetic data lacks standardized validation metrics, and procedures around synthetic data are siloed by institutions or projects. Another theme was the importance of data tools. More open-sourced tools for synthetic data generation and validation that are free and user friendly are needed, as are more options for creating multimodal data. Another agreed upon point was that improved partnerships between government, industry, and academia would help resolve some of these present-day issues and would serve as a force multiplier, advancing data sharing, model sharing, and data-as-a-service. Many panelists discussed the need for tiered access as being critical to consider when developing data policy and procedures that account for privacy protection, use cases, and controlling how synthetic data is used. In addition, all panel experts agreed that there is a very delicate balance between data accuracy and data privacy, and that a well communicated and agreed upon level of risk exposure should be regulated.
LMI’s Vice President of Digital and Analytic Solutions, Kristen Cheman, provided closing statements by thanking ACT-IAC and the Health Communities of Interest, keynote speakers, panelists, and attendees for their contributions. She echoed the need for policy surrounding synthetic data, reminding participants that this is a broad and critical challenge that will impact all of government.