Our Natural Language Programme Lead Arlene Casey is leading a 12-month project that explores the use of new technologies to reliably de-identify free-text notes from health, education, and other administrative systems. The goal is to develop a toolkit that would allow information from free-text notes to be made safely available for cutting-edge research.

The core challenge

Administrative data are created when people access public services, including healthcare.

Currently, administrative data used for research are predominantly structured (also known as coded) data, and includes examples such as disease codes, air-quality measures, and exam results. Structured data are used because identifiable information – such as names, addresses, dates of birth/death, etc – is easily removed from a data extract prior to being made available to researchers. At DataLoch, we already enable valuable projects through providing de-identified extracts of structured health data.

Free-text (unstructured) data contains rich contextual information – symptom descriptions, educational backgrounds, life events – which are critical for understanding complex societal situations and outcomes. However, free-text data are rarely shared for research purposes due to the unpredictable presence of indirect identifiers, including recognisable locations, social circumstances, and familial relationships. When these indirect identifiers appear individually, the possible privacy risk is relatively small, but when combined they may reveal someone’s identity.

Developing a possible solution

Large Language Models (LLMs) could support the assessment of privacy risks within free-text extracts and therefore provide the foundation for a pathway in which these data could be made available for research. The STAR-TRE project involves three key steps:

1. Assess how LLMs could be integrated within existing de-identification processes.

2. Enhance our prototype privacy risk dashboard to support risk assessments of free-text extracts.

3. Test and improve the privacy risk dashboard through use cases requiring data from multiple domains (e.g. health, housing, education).

We will engage members of the public alongside specialists in data services and information governance to ensure the toolkit we develop is trustworthy and can be adopted by other Trusted Research Environments that provide de-identified data extracts for research.

Potential impact

The anticipated impacts are significant. Our work will safely unlock currently inaccessible data for vital research in sticky problems such as social inequalities, care in later life, and offending and rehabilitation. By enhancing research across diverse domains, service managers and policymakers will have improved evidence to support decisions that ultimately benefit everyone in society. The ambition is to safely enable access to more data that would accelerate the opportunities for new discoveries.

Acknowledgement

Funded by DARE UK [grant number: UKRI3005], the STAR-TRE (Safe and Trustworthy Assessment of Risk in Trusted Research Environments for Sensitive Free-Text De-Identification) project is led by DataLoch at the University of Edinburgh working in collaboration with the Scottish Safe Haven Network and University of Sussex.

Discover more

A public webinar introducing STAR-TRE and the other seven Next-Gen Catalysts will be held on Wednesday 4 March 2026, 1-3pm. Find out more and register through the link below.

Next-Gen Catalysts webinar registration page

 

The DataLoch NLP programme