opener image
NERSC’s Science-Driven Analytics program provides the architectural and systems enhancements and services required to integrate NERSC’s powerful computational and storage resources to provide scientists with new tools to effectively manipulate, visualize, and analyze the huge data sets derived from simulations and experiments.

Building the analytics infrastructure

In its first full year of activity, NERSC’s Analytics Team has laid the foundation for an analytics infrastructure that combines software, development and application of analytics technologies, working with NERSC users, and hardware (DaVinci, an SGI Altix 350, which went into full production use in 2005). The key technologies that contribute to the analytics program are data management, data analysis and data mining; visual data exploration; and workflow management. The mission of the Analytics Team is to deploy and adapt these technologies—by combining them or extending them—to help NERSC users spend more time doing research and less time managing data and wrangling analytics software.

Scientific data management (SDM) refers to storage and retrieval of scientific data from various storage sources such as main memory, disk, and tape. SDM covers integration of data formats; data description, organization, and metadata management; efficient indexing and querying; and file transfer, remote access, and distributed data management across networks. Managing experimental data and files of simulation output, as well as converting files from one format to another, continue to be time-consuming aspects of high performance computing. Several groups in NERSC, as well as the Analytics Team, are working with LBNL’s Scientific Data Management Center to deploy the Storage Research Manager (SRM) software NERSC-wide in order to provide transparent distributed data management. SRM provides a fault-tolerant mechanism for transferring files from one location to another, as well as uniform access to heterogeneous storage (e.g., disk, tape). In addition, members of the Analytics Team have been working with collaborators at the Paul Scherrer Institute in Switzerland on H5Part, a storage model and high-performance, parallel data I/O application programming interface (based on the hierarchical data format HDF5) for simplifying data exchange within the accelerator modeling community.

Data analysis and data mining are broad categories that include many techniques. Data analysis techniques include simple post-processing of experimental data or simulation output (e.g., removing noise or corrupt data, merging data from different sources, selecting data subsets), as well as using mathematical methods (applying statistical tests or optimization methods, filtering data, etc.). Data mining usually refers to the application of more advanced mathematical techniques such as dimensionality reduction, classification, clustering, time series analysis, pattern recognition, and outlier detection in order to find features or trends in data.

image Figure 1. Application of the MTM to long-time simulations of surface air temperature. The black line shows the simulation results for changes in surface air temperature (ºC) vs. time. The red line shows the trend line determined by the MTM method. (Figure provided by Raquel Romano, LBNL) (Click on figures to enlarge.)

These analysis applications have been applied by the Analytics Team in areas such as climate modeling, astrophysics, and network security. In climate science, a member of the Analytics Team tested the application of two analysis methods for climate modeling. The first was the blind source separation (BSS) method, a statistical method for detecting unusual signals, to the output of a general circulation model to see whether the BSS method could detect the combination of simulation parameters and range of values that correspond to tropical storms without a priori defining them. BSS performed well for this application, and the features it detected as anomalies were variations on rotating low-pressure systems, a finding that is consistent with conventional definitions of tropical storms. A second example was the application of the multi-taper method (MTM), which combines spectral frequency analysis with a statistical model, to detect trends caused by model-induced drift in long-term simulations of surface air temperature. In order to determine spatial-temporal patterns of interest in the simulations, it is necessary to separate trends resulting from model drift from periodic oscillations due to natural variability. The MTM was able to determine the background trends, both on global and smaller spatial scales, so that the data from the surface air temperature simulations could be de-trended (Figure 1).

image Figure 2.Relative flux from observations of two supernovae (top figure) and analysis of model results (bottom figure). (Figure by Peter Nugent, LBNL)

Analysis of simulation results also played a role in confirming that the physics in the FLASH model explains the transition from deflagration to detonation in thermonuclear supernovae. A member of the NERSC Analytics Team developed software to generate synthetic spectra from simulation results in order to compare simulation results with observational data from supernovae. The synthetic spectra capture the characteristics of the observed spectra as shown in Figure 2.

In another collaboration with NERSC users, analysis and visualization of supernova data and comparison with model data led to the detection of a new type of Type Ia supernova that has a mass greater than the Chandrasekhar mass limit of 1.4 solar masses.1

Working in collaboration with the Nearby Supernova Factory (described in more detail on see here), members of the Analytics Team developed improved algorithms for processing images of supernova candidates2 and applied machine learning algorithms3 to discover and classify supernovae more efficiently and accurately. Development and implementation of these analysis methods, together with workflow management (described below) have led to significant performance enhancements and savings in time and effort.

The success of astronomical projects that look for transient objects (e.g., supernova candidates in the case of the Nearby Supernova Factory) depends on having high-quality reference images for comparison with images of candidate objects. A member of the Analytics Team has developed a processing pipeline to co-add 2º-by-2º sections of one million existing images from sky surveys. The pipeline consists of a serial task, which determines the relative depth of each image with respect to the other images in the set and the physical scale of each image and its location on the sky, and a parallel part, which does the co-addition of hundreds of images to produce a single 2º-by-2º reference image. The outcome of this project, which is expected to take years to complete, will be a library of high-quality reference images available to the astronomical community.

image Figure 3. Visualization of electric field and high-energy electrons in a laser wakefield accelerator (LWFA). Experiments with LWFAs have demonstrated accelerating gradients thousands of times greater than those obtained in conventional particle accelerators. This image shows a horizontal slice through the electric field in LWFA. The electrons, displayed as spheres, are colored by the magnitude of the momentum. This example shows how several key types of data can be combined in one image to show their spatial relationship. (PI: Cameron Geddes, LBNL. Visualization by Cristina Siegerist, LBNL.)

Visualization of data is the transformation of data—experimental data or simulation results—into images. Visualizing data is an invaluable tool for data exploration because it provides multiple views of the data (e.g., isosurfaces, volume rendering, streamlines) and facilitates searching for features or regions of interest. In addition, when presented as a series of images representing a change in time, parameter values, etc., data visualization provides a medium for sharing and communicating results. The NERSC Analytics Team includes members of the Berkeley Lab Visualization Group, whose mission is to assist researchers in achieving their scientific goals more quickly through creative and inspired visualization of their data. The NERSC Analytics Team/Visualization Group provides the expertise to help users focus on visualizing their data without having to invest the significant amount of time required to learn about visualization tools and techniques. Team members also work with users to develop new capabilities, such as data readers, to facilitate importing simulation or experimental data into visualization applications.

image Figure 4. The cover of the March 2006 Journal of Physical Chemistry/Chemical Physics shows a plot of the wide range of attainable molecular hydrogen binding affinities with simple ligands and metal complexes, together with a contour plot of the H2 electrostatic potential in the background. (PI: Martin Head-Gordon, UCB/LBNL. Visualization by Rohini Lochan and Cristina Siegerist.)

Recent collaborative visualization work with NERSC users includes visualization of results from accelerator physics, fusion energy, hydrodynamic, and molecular dynamics simulations. Several examples are shown in Figures 3–5.

The Analytics Team also collaborates with other NERSC groups to provide support for DOE computer science research programs. The resources that NERSC provides for such research—both in terms of hardware and expertise—are unique. Working together, staff from NERSC network and security, ESnet, LBNL’s Scientific Data Management Center, and the NERSC Analytics Team developed a network traffic analysis application to detect the occurrence of “distributed scan” attacks. The application combines visual analytics with state-of-the-art SDM technology for indexing and querying data. A “hero-sized” data set consisting of 42 weeks of network traffic connection data was analyzed in order to discover and characterize a sophisticated distributed network scan attack and to identify the set of hosts perpetrating the attack. For details of the team’s analysis techniques and data mining strategy, see the following section, “Query-Driven Visualization.”

image Figure 5. The cover of the July 2006 Journal of Microscopy shows a slice from a volume-rendered tomographic reconstruction of the bacterium Deinococcus radiodurans showing similar high-density bodies as those seen in D. grandis. (PI: Luis Comolli, LBNL. Visualization by Luis Comolli and Cristina Siegerist.)

Workflow management is the last key technology that plays a role in the NERSC Analytics Program. The goal of workflow management is to automate specific sets of tasks that are repeated many times, thus simplifying execution and avoiding human errors that often occur when performing repetitive tasks. A large factor in the success of the Nearby Supernova Factory has been the application of a workflow management strategy by members of the Analytics Team to create a new pipeline, the Supernova Factory Assembly Line (SUNFALL), for managing data and processing images of supernova candidates. Major components of SUNFALL are improved image processing and classification methods, a Web-based workflow status monitor, data management services, and the Supernova Warehouse, a visual analytics system for examining supernovae and supernova candidates (Figure 6). The Supernova Warehouse provides convenient Web-based access to project data and information, easy-to-use data annotation tools, and improved context awareness. SUNFALL, which went into production in fall 2006, has automated moving images and data through the analysis process and has significantly improved not only the speed with which supernova candidates are classified, but also the accuracy of the classifications. Thus, the new pipeline has significantly reduced both the time and labor involved in identifying supernova candidates and, as a result, has increased scientific productivity.

image Figure 6. Supernova Warehouse screenshot. The Supernova Warehouse is a Web-based visual analytics application that provides tools for data management, workflow visualization, and collaborative science. This screenshot shows several supernova or supernova candidate events, including processed images, spectra from the SuperNova Integral Field Spectrograph, and other observational data and information.

In addition, the Analytics Team is assessing use of the Kepler workflow system. Kepler is a grid-based scientific workflow system that can orchestrate complex workflows such as the pre-processing of simulations (preparation of input files and checking of simulation parameters), checking of intermediate results, and launching post-processing analyses. Kepler is supported in part by the SciDAC Scientific Data Management Center. Though Kepler may prove useful for other domains as well, the Analytics Team’s current focus is on applying Kepler to projects in the accelerator modeling community.

Members of the Analytics Team in 2006—Cecilia Aragon, Wes Bethel, Peter Nugent, Raquel Romano, and Kurt Stockinger—worked with and provided support for NERSC users working in a dozen science domains (including accelerator physics, fusion energy, astrophysics, climate sciences, and life sciences) and at both national labs and universities. Collaborations between the Analytics Team and NERSC users have led to increased productivity for researchers, allowing them to focus more on science and less on managing data and simulation and analysis infrastructure, as well as development of methods and technologies that can be deployed at NERSC and elsewhere.

More information about the technologies described above, additional cases studies, and links to documentation on software applications installed on NERSC machines are available on the NERSC Analytics Resources Web pages at www.nersc.gov/nusers/analytics.

Query-Driven Visualization: A Case Study

NERSC and the NERSC Analytics Program offer resources uniquely suited to support research and development activities in the areas of scientific visualization, visual analytics, data analysis and other similar data-intensive computing endeavors. These resources offer capabilities impossible to replicate on desktop platforms and, in some cases, capabilities that cannot be met with “departmental clusters.” The centerpiece of these resources is DaVinci, an SGI Altix with 32 1.4 GHz Itanium 2 processors and 192 GB of RAM in an SMP architecture.

These resources were used effectively in 2006 in support of ASCR-sponsored high performance visualization research efforts. One such effort focused on performing analysis, data mining, and visualization of a “hero-sized” network traffic connection dataset. This work and the results, which include papers presented at IEEE Visualization 20064 and SC2006,5 were made possible only by the unique resources at NERSC. The team consisted of staff from ESnet and NERSC’s network and security group, visualization researchers from LBNL, and members of the NERSC Analytics Team.

A typical day of network traffic at an average government research laboratory may involve tens of millions of connections comprising multiple gigabytes of connection records. These connection records can be thought of as conversations between two hosts on a network. They are generated by routers, traffic analyzers, or security systems, and contain information such as source and destination IP address, source and destination ports, duration of the connection, number of bytes exchanged, and date/time of the connection. A year’s worth of such data currently requires on the order of tens of terabytes or more of storage. According to ESnet General Manager Joe Burrescia, the traffic volume on ESnet, DOE’s scientific production network serving researchers at national laboratories and universities, has been increasing by an order of magnitude every 46 months since 1990.6 This trend is expected to continue into the foreseeable future.

The steady increase in network traffic volume increases the difficulty of forensic cybersecurity or network performance analysis. Current network traffic analysis toolsets rely on simple utilities like grep, awk, sed and gnuplot. Though sufficient for analyzing hours of network traffic data, these utilities do not scale nor perform up to the level needed for analyzing current and future levels of network traffic.

To address the need for rapid forensic analysis capabilities, the NERSC Analytics Team combines two complementary technologies to analyze data for a network traffic case study. The first technology is FastBit, a state-of-the-art scientific data management technology for data indexing and querying developed by the SciDAC SDM Center.7 Data mining and knowledge discovery depend on finding and analyzing “interesting” data, so achieving maximum possible performance in data mining requires the best possible technology. Second, the team uses a query-driven visualization and analytics research application for formulating multi-resolution queries—specifically, multi-resolution in the temporal dimension—and for obtaining and displaying multidimensional histograms. A key concept in this work is that the team is computing and displaying data histograms and does not need to access the raw data directly.

image Figure 7. Unsuccessful connection attempts to port 5554 over a 42-week period shown as a one-dimensional histogram. While the source data are sampled at per-second resolution, the team created the histogram by requesting the number of connection attempts on a per-day basis, i.e., each histogram bin is one day in width. While the largest spike occurs on day 290, the range of elevated activity around day 247 is more interesting as it indicates what may be temporally coordinated activity. Each histogram bar is color-coded according to the number of standard deviations from the mean. Green bars are close to the mean number of unsuccessful connection attempts, while red bars indicate bins that have a count that are three or more standard deviations away from the mean number of per-bin counts.

Attackers often use sets of previously compromised hosts to collectively scan a target network. This form of attack, known as a distributed scan, is typically accomplished by dividing the target address space up among a group of “zombie” machines (systems that have been enslaved for the purpose of carrying out some action, usually malicious)and directing each zombie to scan a portion of the target network. The scanning results from each zombie are aggregated by a master host to create a complete picture for the attacker. An example attack would be a search for unsecured network services on a given port or range of ports. Identifying sets of hostile hosts under common control is helpful in that the group of hosts can be blocked from access to critical infrastructure, and the hosts can be reported to the larger community for further analysis or action.

To detect a distributed scan and identify the set of remote hosts participating in the scan, the team carried out the following sequence of data mining steps. Their initial assumption was that some type of suspicious activity is occurring on destination port 5554. Such an assumption is based on an intrusion detection system (IDS) alert. The first step is to obtain a global view—how many unsuccessful attempts occurred over a 42-week period on port 5554? Answering this question (Figure 7) helps the team begin to understand the characteristics of a potential distributed scan attack.

image Figure 8. Histogram of unsuccessful connection attempts at one-hour resolution over a four-week period. This histogram shows that the suspicious activity is temporally periodic, with a period of approximately one day and repeating over a twenty-one day window.

The large counts on days 247 and 290 are visible in Figure 7 as tall red spikes. Note that around day 247 there is a consistent increase in daily connection counts. The activity on day 290 looks different, since there appears to be a large increase in scanning activity, possibly indicating a new scanning tool or technique. The activity around day 247 appears at first glance to be careful work over time by a set of hosts, whereas the day 290 activity could easily be a single host scanning at a very high rate with the intent to “get in and get out” before its presence can be detected and its access blocked. Such high speed scans are quite common and often are part of the reconnaissance phase of a larger attack mechanism, or a combined scan and attack tool.

image Figure 9. Histogram of unsuccessful connection attempts to port 5554 over a five-day period of time sampled at one-minute granularity. The histogram indicates a repeating pattern of unsuccessful connection attempts that occur on a regular twenty-four hour interval. Each primary spike is followed by a secondary, smaller spike fifty minutes later.

The team drills into the data by computing a histogram containing the number of unsuccessful connection attempts over a four-week period at one-hour resolution (Figure 8). There is no need to increase resolution at the global level since they are interested only in those events within a smaller time window. Here is where the power of query-driven visualization and analytics comes into play—focusing visualization and analysis efforts only on interesting data.

The next step is to drill into the data at a finer temporal resolution by posing a query that requests the counts of per-minute unsuccessful connection attempts over a five-day period within the four-week window (Figure 9). The purpose of this query is to further refine our understanding of the temporal characteristics of the potential attack.

image Figure 10. Histogram of unsuccessful connection attempts that occur at one-second temporal resolution within a seven-minute window. This seven-minute window corresponds to a primary daily activity spike within the four-week period of interest. The figure shows a ramp-up in activity that then declines and drops off.

With the one-minute view shown in Figure 9, a distinct pattern of increased unsuccessful connection attempts on precise 24-hour intervals can be seen. Decoding the UNIX timestamp reveals the event occurs daily at 21:15 local time. Each such spike is followed by a secondary spike that occurs about 50 minutes later.

Drilling yet deeper into the data, the team constructs a histogram showing the number of connection attempts over a seven-minute window at one-second temporal resolution (Figure 10). The seven-minute window is chosen to completely contain the primary daily activity spike that occurs on one day within the overall temporal region of interest.

image Figure 11. 3D histogram showing the coverage of the potential attack in the destination network. Two of the axes are the destination C and D address octets, and the third (vertical) axis is time. Here two sheet-like structures can be seen that correspond to the primary and secondary daily spikes in suspicious activity. The sheet structure indicates that the suspicious activity is occurring across the entire range of destination C and D addresses — such behavior is indicative of a scanning attack.

At this point, the temporal characteristics of the suspicious activity can be determined—an organized scanning event is occurring daily at 21:15 local time within a four-week window of the 42-week data set. While there also appears to be a secondary event that occurs 50 minutes later within the same four-week period, to simplify the rest of this analysis, this case study focuses only on the daily event occurring at 21:15.

So far, the analysis has focused on understanding the temporal characteristics of the attack. Next the team wishes to establish that the attack is a scan, and then identify all the hosts that are perpetrating the attack in order to determine if the attack is from a single host, or from multiple hosts that are coordinating their effort. Their next steps will be to look at the destination addresses covered by the attack.

This phase of the analysis begins by constructing a 3D histogram to understand the extent of coverage in the destination address space (Figure 11). Two of the axes represent the destination C and D address octets; the third axis represents time. The time axis covers a two-hour window sampled at one-minute granularity.

image Figure 12. Different views of a 3D histogram showing the number of unsuccessful connection attempts to all destination C addresses from all source A addresses within a seven-minute window at one-second temporal resolution.

At this point, the second of the three analysis questions has been answered—the suspicious activity appears to be a scanning attack, as all C and D address octets within the destination network are being probed. The next step is to discover the set of remote hosts that are perpetrating the attack. To do so, a series of steps are performed to identify the A, B, C, and D address octets of the attacking hosts.

A set of histograms (Figures 12 and 13) are constructed to identify the A address octet. Figure 12 shows two different views of a 3D histogram in which two of the axes are the destination C and D addresses, and the third axis is time. The time axis encompasses a seven-minute window with a resolution of one second. The seven-minute window encompasses one of the daily primary activity spikes.

image Figure 13. Histogram of unsuccessful connection attempts from each of the 255 addresses within the source A address octet within the same seven-minute window of time.

The 3D histogram shows structures that are indicative of scanning activity. The slanted lines in the image on the left in Figure 12 show that different destination C addresses are being scanned over a relatively narrow range of time. The image on the right in Figure 12 shows that such activity is confined to a fairly narrow range of source A addresses.

The histogram in Figure 13 shows exactly the source A addresses participating in the attack, with the source A address of 220 showing the greatest level of activity. The team’s visual analytics application indicates precisely (not shown here) the top (i.e., highest frequency) bins in a histogram. In this case, a total of seven unique source A addresses are identified as participating in the scan. Because all hosts in the seven A addresses are engaging in probing activity at about the same time, the team assumes that they are part of a coordinated distributed scan.

image Figure 14. Different views of a 3D histogram showing the number of connection attempts to all destination C addresses from all source B addresses within a seven-minute window sampled at one-second temporal resolution.

The analysis is repeated, looking at source B addresses. The query iterates over the seven source A addresses identified in the previous step. The results are shown as two different views of a 3D histogram (Figure 14) and a one-dimensional histogram (Figure 15). In these histograms, the dots (Figure 14) and bars (Figure 15) are color-coded according to the source A address of the attacking host. Figures 14 and 15 show that different hosts are attacking different portions of the destination addresses. This type of behavior is indicative of a distributed scan, where the destination address space is divided among a group of zombie hosts.

image Figure 15. Histogram of unsuccessful connection attempts from addresses within the source B address octet within a seven-minute window of time.

This analysis step is repeated to identify the unique C and D addresses of the attacking hosts. As shown in Figure 16, the four analysis steps reveal that a total of twenty different hosts are participating in the distributed scan.

The data mining example above would not have been possible without the ability to quickly interrogate data to produce histograms. Generally speaking, the amount of time required to perform the queries varies according to the size of the source data (the size of the data indices, to be more precise), the complexity of the query, and the number of items returned by the query. The average time to perform the queries used in the analysis presented here is on the order of a few minutes and uses an iterative approach implemented in serial fashion on the NERSC analytics machine, DaVinci.

image Figure 16. 3D histogram of coverage in the destination C and D addresses by all twenty hosts participating in the distributed scan over a seven-minute window sampled at one-second granularity. The visualization is color-coded by unique source address to show how each source host is attacking a different part of the destination address space.

Several unique characteristics of this system enabled rapid progress on this research. First, by using an SMP with a large amount of memory, the team was able to quickly prototype and benchmark the performance of a family of parallel histogramming routines. Those results are described in more detail in the SC2006 paper.5 Second, the large amount of memory in DaVinci enabled rapid prototyping and performance benchmarking of serial versions of the visual analytics interface early in the research process. These results, which include performance comparison with other serial tools for indexing and querying, are described in more detail in the IEEE Visualization 2006 paper.4 None of these results would have been possible without a machine with a large SMP. Third, the architectural balance between fast I/O and large amounts of memory is especially well suited for data-intensive computing projects like this. The NERSC analytics machine has a large amount of scratch secondary storage (~25 TB) providing on the order 1 GB/s in I/O bandwidth—a capability that also was crucial to the success of this work.

As data size and problem complexity continue to increase, there are corresponding increases in the level of difficulty of managing, mining, and understanding data. Query-driven visualization represents a promising approach for gaining traction on the data mining and knowledge discovery challenges facing science, engineering, finance, security, and medical applications. The Analytics Team’s approach to query-driven visualization combines state-of-the-art scientific data management technology for indexing and querying data, with visual analytics applications to support rapid drill-down, hypothesis testing, and displaying results.

This work is novel in several respects. First, it shows that it is possible to quickly analyze a large collection of network connection data to detect and characterize a complex attack. Previous works in network analysis focus on a few hours’ or days’ worth of data—this case study involved 42 weeks’ worth of data. Second, it shows how a visual analytics application that combines state-of-the-art scientific data management technology with full-featured and straightforward visualization techniques can be brought to bear on a challenging data analysis problem. Third, the application shows how complex queries are formulated through an iterative, guided process that relies on statistics, rapid data access, and interactive visualization.

1 D. Andrew Howell et al., “The type Ia supernova SNLS-03D3bb from a super-Chandrasekhar-mass white dwarf star,” Nature 443, 308 (2006).

2 Cecilia R. Aragon and David Bradburn Aragon, “A fast contour descriptor algorithm for supernova image classification,” Lawrence Berkeley National Laboratory technical report LBNL-61182 (2007).

3 Raquel A. Romano, Cecilia R. Aragon, and Chris Ding, “Supernova recognition using support vector machines,” Lawrence Berkeley National Laboratory technical report LBNL-61192 (2006).

4 E. Wes Bethel, Scott Campbell, Eli Dart, Kurt Stockinger, and Kesheng Wu, “Accelerating network traffic analysis using query-driven visualization,” Proceedings of the 2006 IEEE Symposium on Visual Analytics Science and Technology, Baltimore MD, November 2006, pp 115-122; Lawrence Berkeley National Laboratory technical report LBNL-59819 (2006).

5 Kurt Stockinger, E. Wes Bethel, Scott Campbell, Eli Dart, and Kesheng Wu, “Detecting distributed scans using high-performance query-driven visualization,” Proceedings of SC06; Lawrence Berkeley National Laboratory technical report LBNL-60053 (2006).

6 Joseph Burrescia and William E. Johnston, “ESnet status update,” Internet 2 International Meeting, September 19, 2005.

7 K. Wu, E. Otoo, and A. Shoshani, “Optimizing bitmap indices with efficient compression,” ACM Transactions on Database Systems 31, 1 (2006).