This article was co-authored by Dr. Julia Ive, Head of NLP Research and Dr. Vibhor Gupta, Founder at Pangaea Data.

Electronic Health Records (EHRs) are often seen as a burden by clinicians, administrators, and healthcare analysts. The textual data such as doctors’ notes, discharge summaries and family histories which comprise 80% of EHRs, however, contains information that is a key to improving patient outcomes. Emerging Artificial Intelligence (AI) techniques allow this information to be extracted at a low cost. This AI can assist clinicians in their work, learn from them while seeking help as needed, and iteratively improve the validity of results. Furthermore, all this can be done while respecting patient privacy. This capability to process unstructured textual data may transform medical record-keeping and clinical practice.

At Pangaea Data we are involved in developing tools to assist doctors in clinical tasks and improving patient outcomes. We have medical doctors on our team and work closely with other health professionals who spend 30% of their valuable time to diligently capture notes about patients on a regular basis. They are often frustrated that the information they capture is not used to its full potential and are skeptical about the usefulness of AI in their clinical work. From our perspective things look quite different – our novel unsupervised AI is proven to extract intelligence from textual health data (unstructured data) at 50x speed and 87.5% accuracy, which is 30% higher than the best known alternative methodologies applying text mining or brute force natural language processing (NLP).

We have successfully raised two funding rounds within a year to further our research and rapid growth through which we are helping clinicians and scientific researchers to accurately find 60% of all patients for chronic diseases who do not get diagnosed properly and also help with early identifications of patients at scale, from drug discovery to clinical trials and epidemiological studies. In this article, we will also explain how we are easing common concerns on using AI.

What is unstructured data?

In the last decade Electronic Health Records (EHRs) (clinical notes, discharge summaries) have been replacing paper-based medical records to optimise data storage and management. For administrative purposes (mainly billing), some parts of EHRs are completed in a structured, standardised way – by asking clinicians to choose options such as diagnoses, medications, and symptoms from lists and completing onscreen forms. The main part (up to 80% of the information) of EHRs, however, remains as textual data (written free text). This data has no predefined structure or delineated content other than broad category outlines. The text may include symptoms, patient and family history, social circumstances, summaries of diagnoses, tests performed, medications prescribed, etc.

Free text is very complex and difficult to analyse by computers. As an example, a computer can easily identify patient substance abuse status from a checkbox on an EHR form (if one exists). The computer, however, cannot deduce a “yes or no” answer from the textual description, “Her mother drinks but she denies all alcohol herself.” To do this, the computer would need to relate both the verb “drink” and the expression “all alcohol” to “substance abuse” and define that the second case refers to the patient. Second, the computer would need to detect the negation implicit in the word “denies” and that this word reflects the opinion of the patient. Finally, the computer would not be able to pick a yes/no answer to the question of patient substance abuse, since the sentence simply describes a patient statement which may or may not be true. This simple example outlines the challenge inherent in transforming unstructured textual data into useful medical information.

Why do we need unstructured data?

The accumulation of healthcare data has reached unprecedented levels. NHS datasets alone record billions of patient interactions every year. Moreover, new COVID-19–related services have dramatically increased this data (e.g., through Summary Care Records and self-isolation notes). Millions of pounds are invested to improve the ways to collect, process and use data. This has led to the belief that fully switching to entering only structured information would solve the problem of efficient usage of this information. The reality is, however, that this is not possible for several reasons, including the fact that it is not natural for humans to convey information by filling in numerical values or picking options from a preset list. Filling in more structured and standardized EHRs will result even in higher administrative burden and clinician burnout, even with carefully thought-out form designs. Studies already report that clinicians tend to spend 35% of their time documenting patient data and would wish to spend more time with patients.

Moreover, there is a whole spectrum of personalised patient information that can not fit into form checkboxes and drop-down lists: personal, social, and cultural circumstances of the patient, all the nuances of patient history and treatment, and the clinician’s own feelings and intuition regarding the diagnosis and treatment plan. This type of information can be crucial for understanding and interpreting a medical condition and context. For example, early detection and prevention of Alzheimer’s disease lies in a patient history of memory loss, agitation, anxiety, and depression, all captured as textual information. This free-text information also plays a major role in comprehending the nature of such heterogeneous clinical conditions as asthma and allergies. Manifestations of these conditions are so different that only natural language can describe them.

How can AI tools unlock the potential of unstructured data?

Textual clinical data does contain sufficient and important information to drastically improve patient outcomes, reduce its cost, and reduce healthcare providers’ burden (important goals in the NHS E/I Long Term Plan) if the data can be properly accessed and analysed. The area of artificial intelligence (AI), which applies algorithms to identify and extract information from textual data, is usually referred to as Natural Language Processing (NLP). Essentially, NLP techniques allow computers to understand and analyse free text by simulating human intelligence using computer algorithms. This form of AI is designed to learn from existing data and to assist humans in solving problems that are currently only addressed by direct human intervention (e.g., clinical decision-making). For example, current AI cannot triage patients or conduct telemedicine interviews – these require human interaction. AI, however, can provide algorithmic support to health care providers to improve both telemedicine interactions and patient triaging.

Currently, most clinical text mining (deriving information from text) methodologies can only ‘understand’ some words / short phrases in the text (defined by a dictionary of terms). For example, an algorithm can detect such terms as ‘anosmia’ and ‘olfactory dysfunction’. In order to help this algorithm deduce a diagnosis, however, expert clinicians would need to read through a substantive number of clinical texts and to annotate each term as it relates to a particular diagnosis. When a large number of such annotations is collected, a computer algorithm could then learn from these annotations and match the keywords/phrases to the diagnosis.

Such basic algorithms are not very efficient as they are very limited. This is analogous to a human who has a limited vocabulary in a foreign language and is asked to make decisions based on this. For example, olfactory dysfunction could be described in text in many different ways (‘lost smell’, ‘perceived odour’, etc.) and most words/expressions will simply be missed by a text mining algorithm. Add to that the complexity of then linking these terms to identifying patients with olfactory dysfunction across different diagnoses (patients with COVID-19, Alzheimer’s, etc.). Rules for all variations of a clinical characteristic or a symptom would have to be identified and defined with help from clinicians so that the machine can learn from those rules. This laborious process of defining rules defeats the very purpose of AI.

Recent advances in AI help alleviate these deficiencies: the new technologies, so called embeddings, machine-readable structures that allow meaning to be extracted from words and phrases without the need for defining rules upfront as explained above. For example, ‘anosmia’ and ‘lost smell’ would be recognised as related, similar to the way a native speaker of English would understand and relate those phrases. The use of embeddings techniques removes the need for costly human annotations and constant human involvement in the process of creating new AI models. This does not mean that the help of clinicians can be dispensed with. Quite the opposite, AI requires human-assisted “training” as it learns from free text in EHRs and improves itself during this process. Ultimately, this will result in less and less human intervention. Current AI algorithms are capable of assessing when they need human help, which is usually the case for new situations unseen before. This requested help will not add to the administrative burden: the cases are formulated in short texts and feedback will be submitted in form of a short answer. The final decision, expertise and legal responsibility will remain with the clinical experts. The AI will then be able to assist clinicians in their daily work, freeing more time that clinicians could spend with patients, providing suggestions for clinical decision-making, e.g. alert them to possible diagnoses, suggest treatment, provide help with filling forms, etc.

There is one important concern about the patient’s privacy that must be addressed. The creation and further development of AI models require sharing data between data holders, researchers and industry. Here AI algorithms can help by removing identifying information (e.g., names, addresses, numbers) from the text. For more complex cases, where a person could be recognized from some other description, AI offers technologies that can rewrite this information and obfuscate the identifying features.

Given these new technologies, unstructured data becomes a key to drastically improving patient outcomes, reducing the cost of healthcare, and the burden of clinicians. Modern AI technologies are powerful enough to extract knowledge from unstructured data, which can be particularly helpful in:

Early diagnosis: Success relies a lot on family history, social information and sparse mentions of first symptoms, all usually recorded in free text;
Personalised treatment schemes: Choice of optimal treatments is based on groups of individual metabolic and physiological characteristics (phenotypes) or presence or absence of comorbidities. This heterogeneous information is most commonly available as free text;
Explore rare diseases and their treatment: These diseases represent a large group of heterogeneous conditions that can remain undiagnosed for a long time. The accumulated volume of clinical text may contain very sparse information on such cases. AI provides powerful tools to efficiently analyse this volume;
Improve clinical decision-making: AI can be used as an important clinician decision-making support tool, it can learn from decisions of more experienced doctors and produce suggestions especially helpful for clinicians with less experience;
Improved search for subjects to participate in clinical trials by matching them to descriptions of those trials; lengthy searches and absence of subjects are main causes of failure for clinical trials; also improved design of clinical studies due to better analysis of previous studies as reported in the scientific and clinical free text;
Plan care for populations: Free text information is very useful to extract statistics for resource management (equipment and supplies);
Reduce the administrative burden for clinicians: Free text is a natural way to convey information to humans. Entering tabular (structured) information into EHRs is particularly laborious for clinicians. AI tools are capable to extract this information from free text and reduce the administrative pressure;
Improved information sharing between care providers: AI tools can efficiently summarise only salient information from large volumes of free text and promote efficient exchange between teams and care providers.

In conclusion, advances in AI using “embeddings” algorithms hold the promise of translating vital, but previously inaccessible, textual data into a rich data source. The use of this nuanced patient information opens new transformative possibilities that could fundamentally change how records are kept and medicine is practiced, not only driving time and cost efficiencies but also enabling higher quality, more personalized and attentive care.

Biography – Dr. Julia Ive, Head of NLP Research at Pangaea Data
Julia Ive is heading Natural Language Processing research at Pangaea Data. Julia is also a Research Fellow at the Data Science Institute at Imperial College London. Julia holds a PhD in Machine Translation from LIMSI-CNRS (France). Recently she has been mainly working on text generation problems, multimodality and classification in general and within the clinical domain. Julia has recently successfully led the EPSRC Healtex project pilot on creating privacy-safe artificial medical records to be used in research.

Author

Dr. Vibhor Gupta

Dr. Vibhor Gupta is the founder of Pangaea Data, which is headquartered in London with offices in San Francisco and Hong Kong. Vibhor has a breadth of experience in life sciences through his work in industry and academia over the last two decades. Prior to Pangaea, Vibhor started and built the European business for Quantum Secure, which was headquartered in Silicon Valley and was successfully acquired in 2015. Following this, Vibhor was a Senior Vice President of Commercial Strategy and Sales at Seven Bridges Genomics, which was founded at Harvard. Vibhor has also worked as a technology consultant for Deloitte. Vibhor’s academic career focused on conducting molecular biology studies and building bioinformatics tools and machine learning models with epigenetic, genomic, transcriptomic and clinical trial data in the context of oncology and infectious diseases.
View all posts

Dr. Vibhor Gupta February 8, 2021

8 minutes read

What is unstructured data?

Why do we need unstructured data?

How can AI tools unlock the potential of unstructured data?

Author

Related Articles

AI & Data Privacy – How companies can ensure ethical data use while leveraging AI for efficiency

The Role of AI in the Future of Healthcare Marketing

Is AI Data Center Growth Sustainable? Not without Sustainability Strategies

Reimagining Data Centers: Giving Back to Communities While Expanding Digital Infrastructure