|

Summary of CLEF Technologies
† Information extraction from texts to acquire data
† Integration of clinical information and development of “chronicles”
† Privacy, confidentiality, consent, and security
† User oriented query formulation and a “WYSIWYM" interface
† Knowledge resources and metadata
† e-Science Infrastructure
† Links to the new NHS Infrastructure.
Information Extraction & Language Technology
Doctors dictate. Much of the key information in clinical records continues, and will continue for the foreseeable future, to be contained in unstructured or at best minimally structured texts. Hence a major part of CLEF is devoted to adapting and evaluating mechanisms for information extraction from text. Four features of the cancer domain make information extraction feasible:
- The very limited sublanguage, even more so than for medicine as a whole
- Much of the specialized information is in common with molecular biology which is a major target for current text extraction efforts
- The well defined list of index events and signs that allows the template for extraction to be well defined
- The existence of multiple reports for most events.
The existence of multiple reports is particularly important and has not been widely noted elsewhere to the best of our knowledge. Cancer patients are seen over a long period of time and their records summarized repeatedly so that there are many parallel or near parallel texts – often 150 or more text documents per patient. What may be unclear or ambiguous in one text can be refined from others. This is particularly important when dealing with records from a referral hospital where the system usually will start in the “middle of the story”.
For example, first document might simply mention breast cancer in the past, concentrating on the current recurrence. A summary later might give a date for a mastectomy but no details of the tumour type. Eventually, perhaps after information from the referring hospital was received, a definitive statement of the time, tumour, spread, and treatment might be found. Subsequent notes might again refer to the initial cancer vaguely while concentrating on current concerns. By cross checking information, the picture of the overall “chronicle” gradually comes into focus, although still with varying degrees of certainty. What this means for the architecture is that extracting information from one document may involve reference to the repository as a whole.
Dealing with terminology is an essential step in natural language processing in technical domains. The CLEF team has made some considerable progress in the implementation and use of Termino, a large scale terminology resource for biomedical language processing, and already incorporated Termino into the AMBIT system where it collaborates with a term parser to perform more complete term recognition. Now they are working to extend the Termino data model to enable information about morphological variation to be stored in Termino, and also to build term induction modules to allow Termino content to be automatically acquired from corpora, in addition to deriving it from manually created resources such as UMLS.
To learn more about it please read: http://www.cs.brandeis.edu/%7Ejamesp/biolink2004/papers/pdf/BIO010.pdf
http://www.nesc.ac.uk/events/ahm2003/AHMCD/pdf/090.pdf
CLEF Chronicle
At the heart of CLEF is the compilation of a single coherent “chronicle” for each patient from distributed heterogeneous information that makes up the medical record. At one level, the Chronicle provides a clear presentation to clinicians and researchers of the course of one patient’s illness. At another they are data structures which can be easily aligned on “index events” – diagnosis, first treatment, relapse, etc. ‑ and aggregated for statistical analysis to answer questions such as “Of patients with breast cancer with a particular genetic profile, what is the comparison of the time to first recurrence for those treated with Tamoxifen as against those treated with a new proposed drug regimen”. “How many dropped out of each treatment and why?” “How many required supplementary therapy for the side effects of treatment and why?”
The classic problem for electronic health records is to maintain a faithful, secure, non-repudiatable record of what healthcare workers have heard, seen thought and done. The CLEF HER repositories follow standards designed to achieve these aims – e.g. OpenEHR2, CEN standard 136063, and associated development of “archetypes”. However, the central issue for CLEF is different – to infer a single coherent view of each patients’ history from the myriad documents and data in the EHR including and to align them with other similar patients in aggregates for querying and research.
CLEF is interested not only in the literal information in the documents but in their clinical significance – not only what was done but also why. It is not enough to know that the report of a bone scan claimed “only osteoporotic changes”. It is necessary to recognise that this indicates that there are “no bony metastases found”. It is not enough to know that the patient was taken off chemotherapy, it is important to know what side effect or concurrent illness intervened. Assembling the chronicle is therefore a knowledge intensive task that relies on inferences. The reliability of these inferences may vary, and it is essential to record not only the inferences but also the evidence on which they were based and their reliability.
Read the full article: http://www.allhands.org.uk/2004/proceedings/papers/118.pdf
Security and Confidentiality
The key ethico-legal goal of CLEF is to provide mechanisms and policies to ensure that patient privacy and confidentiality are preserved while delivering a repository of medically rich information for the purposes of scientific research. This requires policy/organizational safeguards and a multilevel technical framework.
Royal Marsden Hospital (RMH) is one of the main providers of pseudonymised patient records to the project. An approach has been developed by which real patient records (comprising structured data sets and narrative letters and reports) can be suitably pseudonymised for removal from the ROYAL MARSDEN HOSPITAL and included within the CLEF Electronic Health Record Repository. The process provides multiple layers for the protection of patient confidentiality and privacy:
† pseudonymisation – the removal of patient, geographical and organisational identifiers at source.
† depersonalisation – methods of access via language extraction and generation that conceal or remove potentially identifying information;
† security – policies and technical measures for the supervision and maintenance of the pseudonymous Electronic Health Record repository as if it contained identified patient records, in conformance with NHS and international standards including privacy enhancing technologies to reduce the risk of re-identification through queries;
† oversight – specific policies for controlling access to CLEF repository and handling requests to link researchers back to real patients;
† monitoring – organisational and technical measures to identify potential threats and intrusions.
Read the full article: http://www.clinical-escience.org/industrial/sep2003/AHM2003-DKalra-CLEF-Security-Confidentiality-Paper.pdf
The need for pseudonymised records
To achieve the aims of the CLEF project, we will need to process potentially sensitive medical information about cancer patients. Ethically, the source hospital or hospitals cannot release such information to us if it could potentially cause harm to patients. However, without the information, key parts of the project cannot make progress. Pseudonymisation of a document is essentially a three-step process:
1. Identify the sections of the document that contain potentially identifying information
2. Find and mark the identifying information in the document (noting the type of information, and maybe linking instances which relate to the same individual)
3. Replace the identifying information with pseudonymous information in a consistent way, i.e. so that instances which related to the same individual in the original text bear the same relation to each other in the pseudonymised version.
This is fairly straightforward, but information is lost in the conversion, as it is no longer apparent from the text whether all the “[HOSPITAL]” tags refer to the same hospital, or different ones. An improvement on this scheme would be:
Add a tag to the class-based identifiers, to distinguish between different entities of the same class: “[HOSPITAL-1]”, “[HOSPITAL-2]”, etc.
To enable the building of an accurate longitudinal record of care, these cross-references would need to be consistent across all the documents in a single patient’s record. If it were desirable for the CLEF system to answer queries such as “show me 5-year survival rates for this type of cancer, grouped by the hospital where they were treated,” then the identifiers would need to be consistent across all patients’ records.
In an ideal world, the whole pseudonymisation process would be carried out by a fully automatic tool. In practice, steps 1 and 3 could be fully automated, but step 2 will generally require human intervention, if the pseudonymisation is to be 100% complete and accurate.
Question answering and report generation
For the data in the CLEF Repository to be useful, it must be easily accessible to scientists and clinicians. CLEF is experimenting with a variety of textual and graphical query interfaces to the repository. The interface to the repository of health records and chronicles is being designed around techniques from language generation known as WYSIWYM –“What you see is what you meant” supplemented by various visual and graphic presentations. The WYSIWYM interface allows users to expand a natural language like query progressively to produce queries of arbitrary complexity and then summarises the results, again in generated natural language. The next stage of the project will include user studies to ensure that the interface meets users’ priorities.
Definitions for initial requirements for accessing the repository are defined in CLEF and identification and modelling of a limited set of types of queries is now supported by the CLEF workbench. ICD codes and associated terms for diseases have been compiled into a local knowledge base for use in text generation.
A generic WYSIWYM interface for EHR querying has also been implemented and is currently functional in CLEF. A communication procedure between the query interface and the EHR has been designed and implemented and is currently functional. Generation of summary reports of patient records is necessary in order to provide clinicians with a quick overview of a patient's history. The report generator developed in CLEF creates summaries that present this information from different perspectives: chronologically, focused on various types of event or user-tailored. The application is in an advanced stage of development and preliminary evaluation showed it may be of real use in clinical care.
At this point, we have reached a stage in the implementation of the summary generator that allows us to present to clinicians preliminary summaries of patient records. This should enable us to receive input from medical experts more easily.
Read the full article: http://www.clinical-escience.org/industrial/sep2003/AHM2003-ALR-CLEF-Poster-Paper.pdf
Metadata in the Repository
The CLEF repository requires at least four types of metadata:
- Resource information: what is in the repository so that it can be found
- Provenance information: where information comes from
- Usage and workflow information: how the information has been used, including information allowing monitoring potential compromises of privacy
- Annotations on certainty and evidence: what inferences have been made on the basis of what evidence with what confidence
The first three appear analogous to metadata within my Grid and related projects. The fourth is more specific to CLEF. Metadata standards also need to take into account emerging standards for annotating clinical trials and other areas of biomedicine.
|