The primary care data from GPs that are hosted by the DataLoch service provide huge opportunities for population-level observation research on large cohorts, as well as for NHS service management and service improvement. However, the data can be tricky to use and there are many complexities and subtleties that are useful to be aware of. 

The benefits of digital records

The first piece of good news is the data are digitised, so there is no need to decipher anyone’s handwriting. The second piece of good news is that the data are coded, therefore enabling many data analysis tasks to be carried out without reading the typed-in (free-text) notes for each patient encounter. This simplifies things enormously, but also ensures that identifiable information (such as patient and family names) does not need to be revealed to either analysts or researchers. These data are coded using Read codes.

What are Read codes? 

Read codes are a clinical terminology first invented in the 1980s by Dr James Read. There have been three versions, with primary care data in Scotland using version 2 (known as Read 2). In Read 2, 5-character codes are used to represent a huge list of events, diagnoses, test results and observations, including:

  • Encounters with primary care
    • Seen in GP’s surgery (9N11.)
    • Telephone encounter (9N31.)
  • Readings or test results
    • Blood pressure (246..)
    • Liver function test (44D6.)
  • Lifestyle conditions
    • Ex-smoker (137S.)
    • Teetotaller (1361.)
  • Symptoms
    • Chest pain (182..)
  • Condition diagnosis
    • Asthma (H33..)


For some of these codes, the presence of the code itself carries all the information, but for others – such as readings or test results – additional information is stored in other data fields or in the associated free-text field. 

What makes things complicated?

Coding systems help significantly, but when bringing data together within a service like DataLoch, there are still many challenges to address. 

Which data fields are used and the format in which the additional data are stored can vary depending on which computer system a GP practice uses, how the system is configured for that practice and, sometimes, each health care professional’s personal preferences. This results in ‘messy’ data that requires some processing before they can be easily used for other purposes. 

For some categories of data, DataLoch has already performed this processing and produced cleaner, more research-ready data, for example for blood pressure results, ethnicity, BMI, alcohol consumption, and smoking status. These data are within the DataLoch Observations tables in our Metadata Catalogue: see our About the Data page for more.


“Data matters. Clinicians and service-designers need to be aware of not just what data gets collated, but also what gets missed out. Crucially, what are the circumstances we need to create for the correct data to be accurately recorded at the appropriate time?”

Dr Peter Cairns, GP, Wester Hailes Medical Practice and Clinical Advisor, DataLoch


The crucial role of health care professionals’ expertise in addressing challenges

Our recent attention on smoking-status data provides some good examples of the challenges we face. 

Initially, we tested mappings from Read codes to smoking status used in publications, but this confusingly resulted in many people being classified as an ex-smoker and current smoker on the same day. Exploring this inconsistency further, we found that the majority of these cases were because the Read code for smoking-cessation advice (8CAL.) was considered as proxy evidence of someone being a current smoker. Exploring further, in the period 2004-2016, the Quality and Outcomes Framework (QOF) asked GPs to give smoking-cessation advice to ex-smokers. Hence, Read codes indicating smoking-cession (and therefore suggesting people were current smokers) were being recorded on the same day as Read codes designating the patient as an ex-smoker.

On discovering this, and after talking with GPs, we dropped Read codes related to smoking-cessation advice from the list of codes that mapped to the patient being a current smoker. 

It is important to understand not only computer systems, but also the (potentially competing) motivations of the humans involved in generating health care data. The impact of incentives under the QOF can be seen in other GP data with significant peaks in data recording for some Read codes.

Finding out more about medical codes

Although there are challenges, there are many code lists such as the HDR UK phenotype library that provide starting points for the appropriate codes for a specific analysis. Additionally, Read codes can be viewed by downloading the NHS Read browser (registration with NHS Digital TRUD required), and by accessing ‘5-byte Version 2 Read Codes (Scottish)’ version. 

Despite the potential pitfalls, GP Read-coded data offer a huge opportunity to support vital research. The DataLoch team is ready to work with researchers to understand the data and provide secure access to data extracts for projects that are in the public interest.