The untapped potential of clinical free-text: understanding privacy risk in electronic health records

Author: Arlene Casey, Vivensa Foundation Senior Research Fellow | Principal NLP Data Scientist

Clinical free-text data (typed notes from medical practitioners) is a treasure trove that constitutes 70% to 80% of health information, but often remains inaccessible for research purposes. Through our efforts at DataLoch, we aim to revolutionise health research, and ultimately patient care, by creating a pathway for data access which preserves patient confidentiality.

Privacy Concerns: a major hurdle for access to clinical free-text

Despite the wealth of information embedded in unstructured clinical reports, the majority of health research relies on structured data.

One of the significant hurdles in providing clinical free-text for research lies in privacy concerns. Traditionally, manual redaction processes have been required to ensure patient confidentiality. However, these processes are time-consuming, and therefore may not able to be undertaken at the scale required or always completed within standard research project timescales.

Our work addresses this critical issue head-on, focussing on the removal of identifiers – both direct (names, dates of birth) and indirect (potentially revealing too much information, or an accumulation of information, that increases the risk of a patient being identified, such as where a medical event took place and/or detailed family information) – to pave the way for more efficient and secure access to unstructured data.

Understanding Public Perspectives

Public perceptions around using free-text data for research and managing privacy risks is integral to our work and is a foundational element of our approach.

We combined insights across our engagement work to provide a nuanced understanding of participant deliberations concerning free-text patient data and Trusted Research Environment (TRE) reliability. These public perspectives have fed directly into our approach to understanding and identifying privacy risks in free-text.

Direct Privacy Risk: a collaborative approach

Whilst there is existing research in AI models to address direct risks, often these models are not publicly available or are built on non-UK based data meaning they do not work well when applied to NHS data. The Scottish Safe Haven Network has worked together to propose a national standard on what direct identifiers are and are labelling a large set of hospital records that will be used to train a de-identification model to work across the Scottish population.

Indirect Privacy Risks: a novel approach

In a first-of-its-kind effort, our project has explored indirect privacy risks within clinical free-text — an area largely unexplored in previous research. By extracting one year's worth of hospital discharge summaries from major NHS Lothian hospitals, the team applied innovative Natural Language Processing (NLP) approaches to identify and understand potential privacy risks associated with indirect identifiers.

List of indirect identifiers and descriptive examples. For those using screen readers, please go to the main body of the article for a link to a plain-text version of this information. — List of indirect identifiers and descriptive examples. For those using screen readers, please refer to this document for the text.

Through meticulous review and analysis, the team identified six main categories of indirect privacy risks: Unique Medical Events, Social, Locations and Organisations, Mental Health, Police, and Crime. The occurrence of these risks was not isolated: often, where one category surfaced within a sentence, a cascade of others could be found in subsequent reports. The age range of patients and the frequency of hospital attendance emerged as crucial factors influencing the cumulative risk of identifiability. This comprehensive categorisation underscores the necessity of extending privacy considerations beyond the scope of current research on direct identifiers.

From Data Extraction to Visualisation: a comprehensive approach

The team developed a prototype dashboard aiming to revolutionise the process at the data access stage, effectively de-risking free-text when it is shared for research purposes. The prototype helped specialists explore how they could apply a proportionate-based risk approach to better understand data privacy risks and make more-informed decisions.

The team continue to refine the prototype dashboard. Future iterations will encapsulate and track the de-risk stages, incorporating valuable insights from public engagement to ensure that semi-automated tools maintain a balance with human checks and audit trails.

The Road Ahead: bridging a gap in health data access for research

As the work successfully achieves its aims — exploring NLP-based approaches to understand risk, developing a prototype dashboard, and understanding public perspectives — we envision a future where clinical free-text is a valuable resource made securely accessible for innovative health research leading to improved patient care.

Our initiative not only breaks down barriers in accessing unstructured health data but also sets a precedent for future research endeavours seeking to harness the full potential of clinical notes in an ethical and informed manner.

Acknowledgement

This work would not have been possible without funding from DARE UK and Research Data Scotland as well as our collaborative partners across the Scottish Safe Haven Network and our academic partners at the universities of Edinburgh, Aberdeen, and Sussex.

This article summarises Arlene's presentation at Scotland’s Health Research and Innovation Conference in October 2023 – that was awarded best talk – which delved into our Scottish Safe Haven Network collaboration addressing the challenges of creating an access pathway to clinical free-text.

Discover more about our Natural Language Processing programme of work:

Natural Language Processing

Case Studies

Image credit: sturti via Getty Images

Case Studies

Developing improved assistance for homelessness services

In 2024, we played a key role in bringing complementary services together to greatly improve the support available for those at risk of experiencing homelessness.

Developing improved assistance for homelessness services

In 2024, we played a key role in bringing complementary services together to greatly improve the support available for those at risk of experiencing homelessness.

Tackling environmental impacts on childhood respiratory health

For a novel study looking at the connections between energy poverty and preschool respiratory infections, we have played a key role in fostering additional partnerships to expand the project and achieve greater real-world impact.