Imposing Connectivity on an SN Partition
Basis McCray et al.16 presented a partition of the SN into 15 groups, with each group representing a subject area. Six principles that such a partition should satisfy were proposed. One of them, semantic validity, can be assessed by seeing if a group's semantic types together with their IS-A links form a connected subgraph of the SN. We refer to this as the “connectivity property.” Because the SN's IS-A hierarchy consists of two trees, such a connected subgraph in the current SN must form a tree with a unique root.
In the analysis of McCray et al.,16 it was noted that: “In some cases, it was not possible to resolve anomalies in our attempt to create a coherent and semantically valid set of groupings.” In fact, some of the partition's groups do not satisfy the connectivity property. Such groups contain a forest, comprising two or more trees, or perhaps isolated semantic types. (Some groups have both.) For example, the Physiology† group contains a forest of two trees (Fig. 1). There are no hierarchical relationships between a semantic type of one tree and a semantic type of the other tree. Therefore, the Physiology group is not connected.
In our previous work,18 we presented an alternative partition of the SN based on the sets of relationships exhibited by the semantic types. In our technique, we required that the hierarchy of each group of the partition be a tree exhibiting the connectivity property. In this way, we obtained a partition that is strictly semantically uniform. A difference between the partition of McCray et al.16 and that of Chen et al.18 is that the connectivity is only a preferred, not required, property in McCray et al.,16 whereas it is required and enforced in Chen et al.18
In our semantic technique, we use the partition of McCray et al.16 as a basis for augmenting the SN's hierarchy, and, in particular, for identifying new viable IS-A relationships. The basic idea is to bridge the gap between the two partitioning techniques by imposing the connectivity property on the partition of McCray et al.16 To convert the disconnected groups of McCray et al.16 into connected groups, we need to identify and insert additional IS-As. This will yield a first version of our desired multiple subsumption hierarchy and an accompanying partition. Analysis of the definitions of semantic types within each disconnected group will guide the introduction of the new IS-A links. In this context, we will use four kinds of transformations with respect to the groups of McCray et al.16 Another methodology using exact string matching will then be used in a following subsection.
Four transformations to identify new IS-A links The possible transformations that can be applied to disconnected groups to make them connected are listed in the following. The choice of which transformation to use is based on reviews of the definitions of all semantic types within a group.
- IS-A Addition Transformation: Identify a viable IS-A and add it to transform the group into a connected subtree.
- Split Transformation: Split a group into several groups, each of which is either a rooted tree structure or can be transformed into a rooted tree structure by adding IS-A relationships.
- Root-addition Transformation: Create a new semantic type that will be an ancestor of all roots in the group. Make the new semantic type the group's root by adding additional IS-A relationships from all the roots of the group's connected components (either directly or through more new semantic types, if necessary).
- Root-moving Transformation: Locate a semantic type (from another group) that is a lowest common ancestor of the roots of all the disconnected group's subtrees and/or isolated semantic types. Move that lowest common ancestor into the disconnected group, making it the root of the group and thereby connecting the group. Also, move all the new root's existing descendants into the group.
The new network obtained by applying these transformations is called the Enriched Semantic Network (ESN). It has a DAG structure rather than a two-tree structure. We now demonstrate the various transformations and show their impact on different disconnected groups.
As demonstrations of the IS-A addition transformation and split transformation, we consider the group Disorders (Fig. 2). This group contains 12 semantic types, 11 of which belong to three trees rooted at Pathologic Function, Anatomical Abnormality, and Finding, respectively. Injury or Poisoning is an isolated member of the group. Clearly, Disorders does not satisfy connectivity.
The IS-A addition transformation is first applied to this group to connect Injury or Poisoning to the tree rooted at Pathologic Function. Actually, Injury or Poisoning should have a subsumption relationship to Disease or Syndrome and inherit its semantic relationships. Thus, we add an IS-A link to capture this.
Because in the original SN, Disease or Syndrome is a descendant of Phenomenon or Process, the original IS-A from Injury or Poisoning to Phenomenon or Process can be removed because it can be inferred transitively through the new IS-A link from Injury or Poisoning to Disease or Syndrome.
At this point, the group is still a collection of disconnected trees. To rectify this, we apply the split transformation to form three new groups. According to the definitions of the 12 semantic types, we find that Pathologic Function and its six descendant semantic types, including the new descendant Injury or Poisoning, emphasize phenomenon or process and are in the Event tree, whereas the remaining semantic types emphasize an entity or object and are in the Entity tree. Furthermore, Anatomical Abnormality and its children are descendants of Physical Object, whereas Finding and its child are conceptual entities. So, it is natural to partition this group into three smaller connected groups, each comprising a tree. These groups, Pathologic Function, Anatomical Abnormality, and Finding, are shown in Figure 3. Note that using a root-addition transformation for all or any two trees is not an option because this new root could not be placed anywhere in the SN due to the differences in the contents of the trees. The new groups are named after their roots.
| Figure 3. Three new groups: (a) Pathologic Function, (b) Anatomical Abnormality, and (c) Finding (via IS-A Addition transformation and Split transformation). |
In the next example, the Anatomy group undergoes a root-addition transformation; that is, we add new semantic types to make the group connected. The group contains a tree of seven semantic types rooted at Anatomical Structure and four isolated semantic types: Body Substance, Body System, Body Location or Region, and Body Space or Junction (Fig. 4). In carrying out this transformation, we follow the analysis of Michael et al.19 for definitions of anatomical concepts. For example, the new semantic type Material Physical Anatomical Entity is defined as “IS-A Physical Anatomical Entity which has a mass.”19 Body Substance is not an Anatomical Structure because it does not have a three-dimensional shape, but it is a Material Physical Anatomical Entity because it has mass. Thus, both Body Substance and Anatomical Structure are made children of the new semantic type Material Physical Anatomical Entity.
Furthermore, Body Space or Junction is not a Material Physical Anatomical Entity, but it is a Physical Anatomical Entity (defined in Michael et al.19 to have spatial dimensions). Hence, both Body Space or Junction and Material Physical Anatomical Entity are made children of the newly introduced Physical Anatomical Entity, which in turn IS-A Physical Object. The original IS-A from Anatomical Structure to Physical Object is cut because it can be inferred from the new IS-A from Anatomical Structure to Physical Anatomical Entity.
On the other hand, Body Location or Region and Body System do not have either mass or spatial dimension and thus cannot be descendants of Physical Anatomical Entity. Nevertheless, both obviously should belong to the Anatomy group. Following Michael et al.,19 we introduce the new semantic type Conceptual Anatomical Entity, which in turn IS-A Conceptual Entity, to complement Physical Anatomical Entity and serve as the parent of Body Location or Region and Body System.
Finally, the new semantic type Anatomical Entity is added as the parent of both Physical Anatomical Entity and Conceptual Anatomical Entity. In turn, Anatomical Entity IS-A Entity. In this way, the whole Anatomy group is transformed into the new group Anatomical Entity (Fig. 5). The dashed rectangles in the figure represent the newly added semantic types, and the dashed arrows represent the newly added IS-A links. We note that each of the four new semantic types should have at least the corresponding concepts suggested in Michael et al.19 assigned to it. (These concepts have been submitted to the NLM for inclusion in the next UMLS release [Rosse C, personal communication, 2002].)
In the next example, the root-moving transformation is applied to the disconnected Procedures group to make it connected. The group contains seven semantic types, with two trees rooted at Health Care Activity and Research Activity, respectively, and the isolated Educational Activity (Fig. 6). These three are children of Occupational Activity, which has another child, Governmental or Regulatory Activity. Both of these semantic types, in turn, belong to the Activities and Behaviors group. In the context of the UMLS, these five semantic types refer to health care-related issues. They describe activities of health care professionals. Thus, Occupational Activity, the lowest ancestor of the seven semantic types in the group, and its child Governmental or Regulatory Activity are moved to this group. By doing this, the group is transformed into the new Occupational Activity connected group (Fig. 7).
String Matching
Additional IS-A links can be found by using string matching involving names and definitions of various semantic types in the SN. To be more formal, we define a string match as follows:
Definition (string match) A string match from a semantic type T1 to another semantic type T2 is a triple (T1; T2; S) such that S is a string appearing both in the definition of T1 and in the name of T2. S is called the common string and must contain one or more (not necessarily consecutive) complete words (ignoring case).
For example, the definition of Plant contains the word “organism,” which happens to be the name of a semantic type. Hence, a string match (Plant, Organism, “organism”) exists.
The motivation for using this kind of string matching to find viable new IS-A links is based on the evaluation of string matches among the 132 pairs of semantic types that currently have IS-A relationships between them in the SN. By analyzing the definitions of the children in the pairs, we found that there are string matches from 88 children to their respective parents. The string match (Plant, Organism, “organism”) is one of them. Thus, the sensitivity of this approach with known IS-A links is 67%. This finding leads us to the following observation.
Observation If T1 IS-A T2, then there is a high likelihood of a string match from T1 to T2.
This leads us to formulate the inverse hypothesis.
Hypothesis If there is a string match from one semantic type to another, then it is likely to imply a viable subsumption relationship between them.
Based on this hypothesis, we developed the string matching method to identify additional viable IS-A relationships not already appearing in the SN. Our methodology is a human–computer interactive methodology and contains three steps:
- Preprocess names and definitions of semantic types to obtain the input file;
- Apply the “AllMatches” algorithm to the input file to get all string matches; and
- Manually review all resulting string matches and determine which constitute additional viable IS-A links between semantic types.
In step 1, we use some common techniques from the data mining and information retrieval fields.20
- Stop-words: All stop-words such as “a,” “the,” “of,” “for,” “with,” and so on are removed from names and definitions.
- Verb variant processing: All verbs and verb variants are removed from definitions of semantic types. In the string matching, we do not consider verbs and verb variants. The reason is that most semantic types' names consist only of nouns, adjectives, and adverbs.
- Lexical normalization: The Specialist Lexicon (coupled with highly efficient “lexical variant generator” code)21 is applied to stem-word variations. All adjectives and adverbs are converted to nouns, and all plurals are converted to singular forms. Also, uppercase letters are changed to corresponding lowercase.
In step 2, the following AllMatches algorithm is used to find string matches between any two semantic types not currently connected by a single IS-A link or a path of such links. The input file to the algorithm contains the names and definitions of semantic types after the preprocessing step.
In the description of the AllMatches algorithm, we assume that T1, T2, …, Tm are all semantic types in the SN. (In the 2002 version, m = 134.) We use the notation DEF(Ti) to represent the definition of the semantic type Ti, 1≤i≤134, after preprocessing. NAME(Ti) is used to represent the name of Ti, in the form of a string, after preprocessing. For example, suppose Ti = Anatomical Structure, which is defined as: “a normal or pathological part of the anatomy or structural organization of an organism.” After preprocessing, NAME(Ti) = “anatomy structure” and DEF(Ti) = “normal pathology part anatomy structure organization organism.”
In the following AllMatches algorithm, we use a list L to hold all common strings. We also use the following functions defined for lists:
- Length(): Return the number of elements in the list
- Retrieve(k): Retrieve the kth element of the list
- AllMatches algorithm: Find all string matches in the SN.
- For (i = 1 to m)
- For (all Tj, 1≤j≤m & j ≠ i)
- If (Tj is not the parent or an ancestor of Ti)
- { L = FindCommonStrings(DEF(Ti),NAME(Tj));
- //write string matches to the output file
- For (k = 1 to L.Length())
- { S = L.Retrieve(k); // get kth element of the list
- write (Ti; Tj; S) to output file;
- }}
The function FindCommonStrings(R1, R2) is used to find all common strings involving a given pair of strings R1 and R2. During a call, R1 is the definition of a semantic type Ti in a string format, and R2 is the name of a semantic type Tj as a string. For each pair (Ti, Tj) that has no direct IS-A relationship or directed path of IS-A relationships between its components, we call FindCommonStrings(DEF(Ti), NAME(Tj)) to get all possible common strings between DEF(Ti) and NAME(Tj). Each such common string is inserted into L. We say that a match M is redundant if its constituent common string S is a substring of another match's common string (again, ignoring case). FindCommonStrings(DEF(Ti), NAME(Tj)) identifies the redundant matches and does not return them. So, L contains no redundant common strings. Finally, all string matches (Ti; Tj; S) are written to the output file. After AllMatches has been executed, we have a file containing all string matches between pairs of semantic types not connected by IS-A relationships in the SN.
As an example, consider Enzyme whose definition is “a complex chemical, usually a protein, that is produced by living cells and which catalyzes specific biochemical reactions.” The AllMatches algorithm finds three string matches:
- (Enzyme, Cell, “cell”)
- (Enzyme, Cell Component, “cell”)
- (Enzyme; Amino Acid, Peptide, or Protein; “protein”)
In step 3, an expert is called on to review all resulting string matches to find new IS-A links not currently appearing in the SN. These newly discovered IS-A links can then be added to the ESN. As it happens, in the case of the three string matches involving Enzyme, the third match implies the existence of a new IS-A link, because any enzyme must be a kind of protein. Hence, Enzyme IS-A Amino Acid, Peptide, or Protein.
As noted previously in this article, the sensitivity of the string matching approach, when applied to known IS-A links, is 67%. To determine the sensitivity of our method for detecting unknown IS-A links, we established a gold standard by performing a manual review of randomly generated relationship pairs.