Metadata Standardization and Augmentation
What is metadata standardization?#
Metadata standardization is the process of normalizing metadata by mapping raw text to ontology terms with an identifier. For example, disease names like ‘heart attack’ and taxonomic names like ‘Mus musculus’ often have synonyms, such as ‘cardiac arrest’ or ‘house mouse.’ Without linking these synonyms through an ontology, searching for information could become complicated and less accurate.
What is metadata augmentation?#
Metadata augmentation is the process of enhancing and enriching the metadata associated with datasets. This involves adding additional information, context, or attributes to the existing metadata to improve the discoverability, interpretability, and usability of the resources.
Augmenting metadata is valuable because metadata fields are often missing or incomplete at the source. By enriching and standardizing metadata, the Discovery Portal can fill some of these gaps, and provide more of this necessary information for users.
How is metadata in the Discovery Portal standardized?#
In order to improve discoverability for researchers, the Discovery Portal is committed to standardizing and augmenting metadata fields, in particular, ‘citation’, ‘funding’, 'species', ‘infectious agent’ and ‘health condition.’ The Discovery Portal uses API calls to PubMed and NIH RePorter to standardize metadata for ‘citation’ and ‘funding’ if an appropriate identifier (PMID or NIH grant ID) is available. The Discovery Portal uses Text2Term to annotate 'species' and 'infectious agent' metadata, mapping raw text to NCBI Taxonomy IDs and enhancing them with information from the UniProt.org database, including common and scientific names and URLs. Once standardized, a taxonomy-based heuristic is used to determine if the taxonomic metadata remains in the ‘species’ field, or is used to augment the ‘infectiousAgent’ field. For 'health conditions' metadata, the Discovery Portal standardizes data using a hierarchy of ontologies from the NCATS Biomedical Data Translator Program, including Mondo Disease Ontology (MONDO), Human Phenotype Ontology (HPO), Human Disease Ontology (DOID), and NCI Thesaurus (NCIT). The enriched metadata is stored in a SQLite database, providing a consistent and detailed repository for standardizing new documents. For ‘topicCategory’, the Discovery Portal uses Text2Term to map the ‘topicCategory’ field values to the EDAM Topics ontology.
How does the Discovery Portal augment metadata?#
The Discovery Portal augments missing metadata through a multistep approach that includes APIs, text processing services, and biological databases. For the first pass, the Discovery Portal leverages citation mapping and PubTator downloads of pre-processed extracted disease and taxonomy terms to supplement missing values for ‘healthCondition’, ‘species’, or ‘infectiousAgent.’ The citation mapping, in conjunction with API calls to NIH RePORTER, is also used to augment missing funding information. Records with metadata still missing after the citation-based augmentation approach are subsequently passed through the second approach.
The second approach uses EXTRACT to identify potential disease and taxonomy terms from the raw text of the ‘name’ and ‘description’ fields. Then, a method is applied to filter out non-specific terms and a standardization process applied to make the remaining terms consistent. For missing ‘topicCategory’ metadata, the Discovery Portal uses ChatGPT 3.5 to add relevant biomedical research domain categories and then applies the Text2Term-based standardization pipeline.
How to determine if a metadata field in the Discovery Portal has been augmented?#
For every dataset, metadata fields that have been augmented are indicated in the Metadata Compatibility Score badge with the
Last updated on