Introduction

Ontology learning[2] is the process of building ontologies automatically from text. Several ontology learning systems [3,1] such as Text2Onto, Doodle OWL and DL-Learner have been developed. Most of these systems support learning of classes, subclasses and taxonomic relationships. But they do not support mining of more expressive axioms from text such as union, intersection, quantifiers and cardinality relation among the concepts. A richer and more expressive ontology can be very useful to the downstream applications such as recommendation systems and question and answering systems.

We propose a mechanism to extract union and intersection axioms from the text with the help of an ontology and its relevant text. Examples of intersection and union axioms are given in Axioms 1 and 2. Axiom 1 models the information that a Mixed Glioma (type of tumor) is a combination of Astrocytoma and Oligodendroglioma. Axiom 2 captures the information that a Tumor can be either Benign or PreMalignant or Malignant. Some examples of intersection and union axioms are as below:


Mixed Glioma ⊑ Astrocytoma ∩ Oligodendroglioma (1)

Tumor ⊑ Benign ⊔ PreMalignant ⊔ Malignant (2)

Take Relation (1). It can be critical in research of missing properties of any disease that is found as a type of Mixed Glioma based on the other classes of diseases in the relation. That is why, enriching the ontologies with such axioms increase their richness in terms of the information they can provide.





Proposed Framework

Please refer to the following architecture and the corresponding algorithm:

Alt text Alt text
Architecture for extracting union and intersection axioms from the text.

As depicted in the architecture, the "Text" refers to all the articles that were extracted from the PubMed Central. All these articles are extracted based on their relevancy to the concepts of the Disease Ontology. Then, we extracted all the medically relevant entities, "[E1, E2, E3, ...]", from these articles using the SpaCy models for biomedical text processing and the Metamap application for recognizing those entities in the UMLS Metathesaurus.

Now, every entity is compared against every concept of the Disease Ontology in two steps.

Semantic Similarity

First, each pair of entiy and concept (e, c) is compared based on the cosine distance of their real-valued word embeddings generated from two different models:

  1. BioWordVec[4]: model captures the contextual information around an entity from the unlabelled biomedical text using the MeSH vocabulary.
  2. A Custom Word2Vec: model contains the word embeddings trained on the same articles that were extracted earlier from PubMed Central.

When the pair passes the threshold of both the models, it is then passed on to second step, UMLS Metathesaurus.

UMLS Graph Search

UMLS Metathesaurus is an semantic network of biomedical entities taken from more than 200 vocabularies. It provides various definitions, taxonomic and non-taxonomic relations for every entity present in the network.

Within Metathesaurus, the pair (e,c) is compared based on two scenarios:

Scenarios

Alt text Alt text
(a) represents scenario-1 where entity e is a descendant of concept c and (b) represents scenario-2 where both Entity e and Concept c have a common ancestor L.

  • In Scenario-1, the entity can be the descendant of the concept itself.
  • In Scenario-2, the pair will have a least common ancestor (L). Using UMLS Metamap, we compare the number of semantic groups the pairs (e,L) and (c,L) share. A semantic group is a broader group that a term (entity or concept in our case) can be a part of, according to Metamap. Further, we compare the UMLS generated context vectors of (e,c) and L (taken two at a time), using cosine measure. Based on these scores, we determine if e is an instance of c.

Do note that if a pair (e,c) is added to the ontology, then the entity e will be added as an instance of all the parent classes of c.

Axiom Formation

We get a list of concepts with their corresponding sets of instances (entities). Using these instances, we compute union and intersection of the concepts based on the below formula:


totalsets = nC1+nC2+nC3+nC4+nC5+nC6


where each nCi represents sets of i concepts. Each of these sets is compared with every candidate concept to check for union and intersection axioms. For example, to obtain Axiom-1, the intersection of instances of Astrocytoma and Oligodendroglioma is compared with the instances of Mixed Glioma. If the latter is a subset of the former, we can add this axiom to the ontology. We have considered such combinations up to a size of 6. The size was determined using experiments by comparing the F1 scores of sets of different size.






Results and discussion

DataSet Discussion

We could not find any biomedical ontology that has union and intersection axioms along with the concepts that have instances for evaluating our model. Moreover, there is no dataset having such ontologies and the corresponding text corpora in the medical domain. We choose the Disease Ontology as it is rich in such axioms, but it lacks the concept-instance pairs. So we made an approximation here and preprocessed the ontology. We observed that the lowermost leaf-child concept could act as an instance of the directly connected parent class. Subsequently, it can be connected as an instance of all the subsequent parent classes in the hierarchy. This process is executed for all the leaf-nodes in the ontology.

Evaluations

We extracted 739 articles from PubMed Central based on the concepts in the Disease Ontology. Within the ontology, there are a total of 10,085 intersection axioms and 323 union axioms. Based on the extracted articles and the Disease ontology, the F1 scores for the union axioms is 0.142, and for the intersection axioms, it is 0.1908.

Future Scope

  • There was the unavailability of a proper dataset consisting of a complete and rich ontology with the corresponding text corpora. Therefore, constructing a relevant dataset and defining baselines are some of the tasks of our future work.
  • Furthermore, to improve the architecture, we are working on incorporating English language pattern heuristics where we can apply the syntactical rules to extract axioms from unstructured text robustly.
  • We are also working on fine-tuned deep learning models like BERT[5] to improve the contextual word embeddings for the target word.




References

  1. Asim, M.N., Wasim, M., Khan, M.U.G., Mahmood, W., Abbasi, H.M.: A Survey of Ontology Learning Techniques and Applications. Database (2018)
  2. Lehmann, J., Volker, J.: An Introduction to Ontology Learning. In: Perspectives on Ontology Learning, pp. ix–xvi. IOS Press (2014)
  3. Wong, W., Liu, W., Bennamoun, M.: Ontology Learning from Text: A look Back and into the Future. ACM Computing Surveys (2012)
  4. Yijia, Z., Chen, Q., Yang, Z., Lin, H., lu, Z.: Biowordvec, Improving Biomedical Word Embeddings with Subword Information and Mesh. Scientific Data (Dec 2019)
  5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the NAACL) (Jun 2019)
  6. Karadeniz, I., Ozgur, A.: Linking Entities through an Ontology using Word Embeddings and Syntactic Re-ranking. BMC Bioinformatics (Dec 2019)