Parallel Knowledge Discovery from Large Complex Databases

1. Research Orientation

This report on grant NAS5-32337 begins by restating the objectives, approach, and arguments from the original proposal "Parallel knowledge discovery from large complex databases". The following section will describe recent progress made during this report period and planned activities for the next period. The last section will describe the educational benefits that have been derived from this effort.

1.1 Parallel Knowledge Discovery

NASA is focusing on grand challenge problems in Earth and space sciences. Within these areas of science, new instrumentation will be providing scientists with unprecedented amounts of unprocessed data. Our goal is to design and implement a system that takes raw data as input and efficiently discovers interesting concepts that can target areas for further investigation and can be used to compress the data. Our approach will provide an intelligent parallel data analysis system.

This effort will build upon two existing data discovery systems: the Subdue system used at the University of Texas at Arlington, and NASA's AutoClass system. AutoClass has been used to discover concepts in several large databases containing real or discrete valued data. Subdue, on the other hand, has been used to find interesting and repetitive structure in the data. Although both systems have been successful in a variety of domains, they are hampered by the computational complexity of the discovery task. The size and complexity of the databases expected from ESS will demand processing capabilities found in parallel machines.

This effort focuses on combining the two approaches to concept discovery and speeding up the discovery process by developing a parallel implementation of the systems. The two discovery systems will be combined by using Subdue to compress the data fed to AutoClass, and letting Subdue evaluate the interesting structures in the classes generated by AutoClass. The parallel implementation of the resulting AutoClass/Subdue system will be run on a massively-parallel machine (nCUBE), and will be tested for speedup on a number of large databases used by the Earth and space science program.

1.2 Goals Stated in the Proposal

A general goal of the proposed research was to develop a combined discovery system (Subdue + AutoClass) to be applied to a variety of earth and space science databases. Our objective was to improve overall system speed, discovery power, and generality, by

Improving the existing Subdue discovery system for application to structural earth and space databases;
Using the Subdue system as a pre- and post-processor for NASA's AutoClass discovery system; and
Speeding up the combined discovery process through improved algorithms and parallel implementations on MIMD machines such as the nCUBE and Connection Machine 5.

2. Recent Work and Planned Work

This year's work has focused on completing the parallelization of AutoClass and Subdue, starting on the distributed version of AutoClass / Subdue, and integrating Subdue and AutoClass.

2.1 Parallelization of AutoClass and Subdue

The first step to parallelizing the Subdue / AutoClass discovery system was to ensure that both systems use the same programming language, and that a language be chosen that is supported by a majority of parallel systems. Joe Potts, one of our students who worked with the NASA Ames group last summer, completed a port of AutoClass from Lisp to C. That port is now available on our anonymous ftp site. The NASA Ames group will advertise this distribution and is quite happy with the results and with the ability to distribute the code, as they cannot distribute the code from NASA. We have already received several requests for the C version of AutoClass from Lockheed, NASA Ames, and the Jet Propulsion Laboratory.

The SIMD parallelization of AutoClass on the Connection Machine 5 is complete. Data points are distributed among contributing processors, and class evaluation consists of local computation and machine-wide reductions. Joe Potts will be defending his Masters thesis soon on the design and evaluation of AutoClass' parallelization. The MIMD version of AutoClass gives each processor the same set of data points, and distributes the search for the optimal number of classes. Both yield good speedup results in empirical tests.

Discovery systems have been shown to be effective at discovering concepts in a number of databases, but do not scale effectively to large databases. To allow Subdue to scale beyond the capabilities of the serial implementation, we are designing a MIMD parallel version of Subdue that distributes the search for candidate substructures among individual processors. Each processor receives a copy of the same input graph, but generates and evaluates a unique set of substructure candidates.

When Parallel-Subdue (P-Subdue) is run on a graph with M unique vertex labels using P processors, processor i begins processing a candidate substructure corresponding to the i$h unique vertex label in the graph (i < M). If P < M, the remaining vertex labels are assigned to the last processor. On the other hand, if P > M some processors will initially be idle. Each processor receives a copy of the entire input graph, and begins processing its assigned candidate substructures.

Each processor continues expanding and processing a portion of the possible substructures until the combined number of processed substructures exceeds a given limit or until all processors are idle. Note that there exists a danger of replicating work across multiple processors. To prevent this duplication of effort, P-Subdue constrains processors expanding a substructure to only include vertices with a label index greater than the processor ID. In the example scenario, only processor 1 can generate the highlighted substructure. Although this mechanism reduces wasted work, it can lead to a load imbalance among the processors.

Quality control is also imposed on the processors in the P-Subdue system. One processor is designated as the master processor, and this master regularly receives status information on each processor. This information includes whether the processor is active or idle, the number of processed substructures, and the average value of candidate substructures. Using this information, at regular intervals the master computes the overall average substructure value and sends this information to all processors. Each individual processor then prunes from its list all substructure candidates whose value is less than the global average. In this way no processor is working on a poor substructure candidate that will not likely affect the results of the system. The master processor also collects and sorts the final list of best substructures found from each processor.

One hindrance to good performance from a parallel implementation is excessive idling of processors. Processor idling in P-Subdue stems from two main sources. First, if there are more processors P than unique labels M in the input graph, P - M processors will be idle at the onset of computation. Second, substructure expansion will finish earlier for some substructure candidates than for others, causing some processors to finish computation before others. Third, some candidate substructures will be pruned, and the corresponding processors may become idle as a result.

When a processor runs out of work, it requests work from a neighboring processor. If the neighbor has a sufficient number of substructure candidates left (greater than a specified threshold), the neighbor passes the highest-valued substructure and corresponding instances to the requesting processor. If no work can be shared, the requesting processor continues to ask for work until work can be shared or P-Subdue finishes computation.

Early results from the P-Subdue system indicate that good speedup can be achieved by distributing the substructure expansion and evaluation. P-Subdue benefits

from a large number of processors when there are more unique labels in the input graph. With fewer unique input labels, additional overhead results from increased load balancing.

2.2 Distributed Subdue/AutoClass

Although parallel implementations of discovery systems do improve overall runtime performance and therefore scalability of the systems, users of the parallel algorithms face two limitations. First, parallel machines are not easily accessible and the corresponding systems are often architecture-dependent. Second, applying discovery systems to large databases implies a substantial memory requirement as well as processing time requirement. A number of parallel systems cannot cope with these extensive memory demands.

To address these issues, we are currently designing distributed versions of Subdue and AutoClass. The corresponding algorithms will distribute the database information as well as computation. Processors will perform local computation on their portion of the database, and incrementally combine results with other processors.

Distributing the database for use by Subdue provides some unique challenges. Since graph partitioning is a NP-complete problem, approximation schemes will be necessary to divide the graph among different processors in a way that promotes discovery. Substructures discovered locally will have to be evaluated for global use. One way to execute a global evaluation is to swap locally promising substructures between processors and keep those that perform best across multiple portions of the database. As one of the test cases for the distributed hybrid system, we are using a 1024x1024 NASA LandSat image of Kansas with 7 bands of information used by the AutoClass group at NASA Ames. The graph representation of this image utilizes over 8 million vertices and over 9 million edges, taking up 300 Mbytes of disk space. As this database is impractical for a single machine, we will allow distributed Subdue / AutoClass to work on the data and report the results.

2.3 Adding Domain-Dependent Background Knowledge to Subdue

The Subdue discovery system was initially developed using only domain independent heuristics to evaluate potential substructures. As a result, some of the discovered substructures may not be useful and relevant to specific domains of interest. For instance, in a programming domain, the BEGIN and END statements may appear repetitively within a program; however, they do not perform any meaningful function on their own; hence they exhibit limited usefulness. Similarly, in the CAD circuit domain, some subcircuits or substructures may appear repetitively within the data; however, they may not perform meaningful functions within the domain of usage.

To make Subdue's discovered substructures more interesting and useful across a wide variety of domains, domain knowledge is added to guide the discovery process. Domain knowledge is added either in the form of substructure models, or abstractions of substructures for which Subdue should search in the input database, and graph match rules, which bias the search for substructure instances to allow some types of instance deviations and disallow other forms of deviations.

2.3.1 Model/Structure knowledge

Model/Structure knowledge provides to the discovery system specific types of structures that are likely to exist in a database and that are of particular interest to a scientist using the system. The model knowledge is organized in a hierarchy that specifies the connection between individual structures. Vertices of the hierarchical graph can be classified as either primitive (nondecomposable) or nonprimitive. The primitive vertices reside in the lowest level, i.e., the leaves, and all nonprimitive nodes reside in the higher levels of the hierarchy. The primitive vertices represent basic elements of the domain, whereas the nonprimitive vertices represent models or structures which consist of a conglomeration of primitive vertices and/or lower-level nonprimitive vertices. The higher the vertex's level, the more complex is the structure it represents. The hierarchy for a particular domain is supplied by a domain expert. The structures in the hierarchy and their functionalities are well known in the context of that domain. This knowledge is formed in a bottom-up fashion. Users can extend the hierarchy by adding new models.

For example, in the programming domain, special symbols and reserved words are represented by primitive vertices, and functional subroutines (e.g., swap, sort, increment) are represented by nonprimitive vertices. In the CAD circuit domain, basic components of a circuit (e.g., resistor, transistor) are represented by primitive vertices, and functional subcircuits such as operational amplifier, filter, etc. are represented by non primitive vertices. This hierarchical representation allows examining of the structure knowledge at various level of abstraction, focusing the search and reducing the search space.

2.3.1.1 Using model/structure knowledge to guide the discovery
Although the minimum description length principle still drives the discovery process, domain knowledge is used to input a bias toward certain types of substructures. First, the modified version of Subdue can be biased to look specifically for structures of the type specified in the model hierarchy. The discovery process begins with matching a single vertex in the input graph to primitive vertices of the model knowledge hierarchy. If the primitive vertices do not match toward the input vertices, the higher level vertices of the hierarchy are pursued. The models in the hierarchy pointed by the matched model vertex in the input graph are selected as candidate models to be matched with the input substructure. Each iteration through the process, Subdue selects a substructure from the input graph which has the best match to the one of the selected models, and can be used to compress the input graph. The match can either be a subgraph match or a whole graph match. If the match is a subgraph match, Subdue expands the instances of the best substructure by one neighboring edge in all possible ways. The newly generated substructure becomes a candidate for the next iteration. However, if the match is a whole graph match, the process has found the desired substructure, and the chosen substructure is used to compress the entire input graph. The process continues to expand the substructure until either a substructure has been found or all possible substructures have been considered.

To represent an input graph using a discovered substructure from the model hierarchy, the representation involves additional overhead to replace the substructure's instances with a pointer to the model hierarchy. In addition to this overhead, some domains involve extra parameters. Consider an example in the programming domain where a substructure of the model hierarchy (e.g., Sort(a,b), where a and b are dummy variables) is discovered in a program. Subdue replaces each of the discovered substructure's instances with Sort(a_i,b_i), where Sort is a pointer to the model hierarchy, and a_i and b_i are the parameters of ith instance.

2.3.1.2 Combining substructure discovery with and without model knowledge
In order not to overly bias the discovery process toward certain types of substructures based on the model knowledge, the discovery process can combine discovery without using model knowledge with discovery using model knowledge.

In each iteration of the algorithm, the Subdue system discovers at most two best substructures, one discovered using only the MDL principle (without domain knowledge), and the other discovered using the MDL principle and model knowledge (with domain knowledge). Each of the substructures is used to compress the input graph. Subdue selects the compressed graph yielding the maximum amount of compression as the input graph for the next iteration of the discovery process. The compressed graph which has not been selected is put in the unprocessed list. If after further iterations, Subdue obtains a compressed graph whose amount of compression is smaller than any compressed graph in the unprocessed list, this compressed graph is not processed immediately, but is inserted into the unprocessed list according to its $Compression$ value. Subdue resumes the discovery process using the compressed graph from the unprocessed list which has the maximum amount of compression. This discovery is repeated until the unprocessed list is exhausted.

2.3.2 Graph match rules

At the heart of the Subdue system lies an inexact graph match algorithm that finds instances of a substructure definition. The graph match is used to identify isomorphic substructures in the input graph. Since many of those substructures could show up in a slightly different form throughout the data, and each of these differences is described in terms of basic transformations performed by the graph match, we can use graph match rules to assign each transformation a cost based on the domain of usage. This type of domain-specific information is represented using if-then rules such as the following:

IF (domain = x) and (perform graph match transformation y) THEN (graph match cost = z)

To illustrate this rule, consider an example in the programming domain. We allow a vertex representing a variable to be substituted by another variable vertex, and do not allow a vertex representing an operator which is a special symbol, a reserved word, or a function call, to be substituted by another vertex. These rules can then be represented as the following:

IF (domain = programming) and (substitute variable vertex) THEN graph match cost $=$ 0.0;

IF (domain = programming) and (substitute operator vertex) THEN graph match cost $=$ 2.0;

The graph match rules allow a specification of the amount of acceptable generality between a substructure definition and its instances, or between a model definition and its instances in the domain graph. Given g1, g2, and a set of distortion costs, the actual computation of matchcost(g1,g2) can be performed using a tree search procedure. As long as matchcost(g1,g2) does not exceed the threshold set by the user, the two graphs g1 and g2 are considered to be isomorphic.

We have developed this method for integrating domain independent and domain dependent substructure discovery based on the minimum description length principle. This method is generally applicable to many structural databases, such as computer aided design (CAD) circuit data, computer programs, chemical compound data, etc. Experiments with several real and artificial domains indicates that this integration improves Subdue's ability to both compress an input graph and discover substructures relevant to the domain of study. Results also show that the time complexity of the discovery process depends on the amount of domain knowledge used and the size of the substructure found. For each of these domains, we show the results to human experts to rate the discovered substructures for their functionality. Although substructures found without domain knowledge do occasionally receive a high rating, on average incorporating domain knowledge into the search results in more functional substructures.

Since many rules about a domain are incomplete, and uncertainty can arise because of incompleteness and incorrectness in the domain expert's understanding of the properties of the environment, propabilistic models can also be used in the system.

2.4 Subdue / AutoClass Combined Algorithm

The integration of the structural discovery process in Subdue and the attribute-value clustering process in AutoClass will yield a system capable of discovering previously-undiscoverable patterns using the combination of structural and non-structural data features of the data. Our initial design for the integration was to use Subdue as a pre-processor of the structural component of the data in order to construct new non-structural attributes for addition to the set of existing, non-structural attributes. The new data, augmented with this non-structural information about the structural regularities in the data, can now be passed to the AutoClass system. The structural information will bias the classifications preferred by AutoClass towards those consistent with the structural regularities in the data.

As a test of the integration, we fed a database into AutoClass containing a random collection of square outlines whose edges were either red-red-green-green, green-green-blue-blue, or blue-blue-red-red. AutoClass represents this database using one entry for each line. The entry describes the line using the X and Y endpoint values, the line angle, length, and color. AutoClass alone finds 10 classes containing descriptions such as ``all horizontal blue lines'', ``all horizontal red lines'', ``all horizontal lines with length approximately equal to 47'', etc. No idea of colored squares is contained in this output. Even when structural information is integrated by numbering the lines and adding to the description values of lines that each entry intersects, AutoClass does not discover the squares.

Feeding the data to Subdue, we include features of the lines given to AutoClass and connect lines that intersect using edges in the graph representation. With this structural data, Subdue finds the square concept but does not recognize the squares' color sequence. Using Subdue's output we replace individual entries in the AutoClass database by a description of the entire Subdue-discovered substructure to which the line belongs. With this preprocessing, AutoClass successfully discovers the red-red-green-green, green-green-blue-blue, and blue-blue-red-red squares.

An alternative integration scheme is to use Subdue as a post-processor for the classes found by AutoClass, to further evaluate the classes based on the structural trends within and between classes. Since the evaluation measure used by AutoClass relates to a measure of how well a classification can compress the data, Subdue can continue this evaluation within the structural component of the data. First, a classification found by AutoClass can be used to partition the data. This partition will also serve to partition the structural information associated with the data. Second, Subdue can be run separately on the data within each partition to find underlying substructures. These substructures can be used to measure the amount of structural compression (similar to the MDL version of Subdue) possible within each partition. The better the classification found by AutoClass, the better the ability of the substructure found by Subdue to compress the structural data. Therefore, the integrated AutoClass / Subdue approach can further evaluate the candidate classifications found by AutoClass based on the substructures found by Subdue within the corresponding structural data.

3. Educational Benefits of This Work

Two Ph.D. students and five Master's students have been directly supported by this NASA grant. Surnjani Djoko received her Ph.D. in August of 1995 on the topic of the integration of domain knowledge into substructure discovery. Steven Poe completed his Master's thesis in August of 1995 on the topic of using Subdue for discovering patterns in structural information extracted from image data.

A third student, Joe Potts, used a portion of the NASA money to visit NASA Ames summer of 1994. Mr. Potts has completed the conversion of AutoClass from Lisp to C and the parallelization of AutoClass on the CM 5. Mr. Potts will defend his masters thesis in August 1996 and will continue investigating related issues for his doctoral work.

Gehad Galal is currently working on his Master's degree. His thesis topic is parallel and distribution designs for Subdue. Two other Master's students are developing GUI interfaces for Subdue and applying the system to NASA databases.

The research sponsored by this NASA grant has contributed to the course Dr. Cook is teaching on Parallel Algorithms for Artificial Intelligence. The results of the parallel discovery algorithms will be incorporated into the new syllabus so that additional students can benefit from ongoing research and contribute to the state-of-the-art in efficient machine discovery methods. One of the students in the class, Billy Harris, has been working on the parallelization of Subdue.

Many of the issues involved in discovery of patterns in NASA-related data sets have been included in Dr. Holder's course on Machine Learning. These issues have been incorporated into the projects currently underway by the students in the class. Knowledge of these issues have given the students a better appreciation for the difficulties and potential benefits of machine discovery in such domains containing both structural and non-structural information. One of the students mentioned above, Steven Poe, became interested in this project through Dr. Holder's Machine Learning course.

4. Publications

4.1 Publications Directly Supported by this NASA Grant

D. J. Cook, L. B. Holder, and S. Djoko, "Scalable Discovery of Informative Structural Concepts Using Domain Knowledge", to appear in IEEE Expert.

S. Djoko, D. J. Cook, and L. Holder, "An Empirical Study of Domain Knowledge and Its Benefits to Substructure Discovery", submitted to the IEEE Transactions on Knowledge and Data Engineering.

D. J. Cook and L. B. Holder, "Knowledge Discovery from Structural Data", to appear in the Journal of Intelligence and Information Sciences, 1995.

S. Djoko, D. J. Cook and L. B. Holder. "Analyzing the Benefits of Domain Knowledge in Substructure Discovery", Proceedings of the Conference on Knowledge Discovery in Databases, pages 75-80, 1995.

D. J. Cook and L. B. Holder, "Efficient Knowledge Discovery Applied to Structural Data", Journal of Artificial Intelligent Research, 1, pages 231-255, 1994.

S. Djoko, "Guiding Substructure Discovery with Minimum Description Length and Background Knowledge, Proceedings of the Twelfth National Conference on Artificial Intelligence, page 1442, 1994.

L. Holder, D. J. Cook, and S. Djoko, "Substructure Discovery in the Subdue System", Proceedings of the AAAI Workshop on Knowledge Discovery in Databases, pages 169-180, 1994.

4.2 Related Publications

L. Holder and D. J. Cook, "Discovery of Inexact Concepts from Structural Data", IEEE Transactions on Knowledge and Data Engineering, 5(6), pages 992-994, 1993.

L. B. Holder, D. J. Cook, and H. Bunke, "Fuzzy Substructure Discovery", Ninth International Machine Learning Conference, Aberdeen, Scotland, pages 218--223, 1992.

L. B. Holder. "Empirical substructure discovery". In Proceedings of the Sixth International Workshop on Machine Learning, pages 133--136, 1989.

Points of Contact

Diane J. Cook and Lawrence B. Holder
Department of Computer Science Engineering
University of Texas at Arlington
cook@cse.uta.edu, holder@cse.uta.edu

Table of Contents | Section Contents -- Basic Research | Subsection Contents -- CESDIS University Research Program in Parallel Computing