In the preview: By John Lu 1. Data assimilation and Bayesian design for physical/complex/high- dimensional space-time process. Data assimilation and model-based Bayesian design have been proposed as a general approach to solving complex design and dynamic control issues for building future factory plants, which are built in to satisfy environmental law, automated data measurements and process control, to minimize costs and to maximize economic benefits. The problems will require optimization of multipurpose design goals over conflicting conditions, including economic and environmental factors, for large scale systems, which may be complex and nonlinear, automated in a reasonable computation time. Similar problems occur in other areas such as virtual measurements, material sciences, and Internet traffic control. Common to these diverse problems is the need to model and measure dynamic and high-dimensional processes, where only meager observations are available, and where incorporation of physical models as well as prior information is necessary. Bayesian approach offers the most natural and flexible solution to such problems. The advantage is to build a robust design strategy so as to take into account various uncertainty factors in inputs, models, and noisy environments. The key is to develop realistically fast computational algorithms for solving complex and nonlinear large scale problems. Data assimilation has been studied for a long time in geosciences, especially in atmospheric and oceanic sciences. The challenges of producing timely weather forecasts using data assimilation and numerical forecast model code have forced meteorologists to develop various computational tools for dealing large scale data assimilation and real-time implementation. The most recent techniques of targeted observation and ensemble forecasting are particularly noteworthy: the former is an economical way of dynamically collecting most critical data to improve intermediate-range forecast, and the latter is an efficient sampling method for high-dimensional nonlinear systems and for producing some kind of uncertainty measures for nonlinear forecast and may be potentially useful for operational probability weather forecast. Our goals in relation to the Bayesian project are to formulate and identify the data assimilation and Bayesian design problems in the context of metrology problems and to leverage the knowledge of data assimilation and Bayesian design gained in geoscience problems (e.g. Berliner, Lu, Snyder). We will use and develop realistic physical/dynamic/stochastic models in each context and identify and find a Bayesian solution. The milestone for this subproject is to develop realistic data assimilation and Bayesian design computational algorithms which can be applied to real-world dynamic systems and potentially useful in real-time environments. We vision that the strategy of sequential processing and updating algorithms, greedy search algorithms, hierarchical and hidden process models, spatial and dynamic modeling, and problem-based simplifications and approximations will play important roles. 2. Bayesian analysis and consensus guidance for measurement curves and images from interlaboratory studies (motivated by a consensus curve problem by James Yen, Bob Zerr) A general conceptual setup is that, we assume that measurements from each laboratory consist of laboratory specific bias and measurement errors, common functional curve, effects due to experiment conditions and time, potential interaction effects, and individual measurement errors. Then we propose a Bayesian formulation where block-based Gibbs sampling will allow us to separate the laboratory effects from modeling the common curves, and the MCMC samples also allow us to construct the uncertainty of the reconstructed curve as well as of laboratory effects. The functional data analysis framework allows irregular sampling points of input variable and different data format (such as missing data) from each laboratory. Milestone: Concensus and computational algorithms for nonparametric regression and Bayesian solution of functional data analysis b. Analysis of image maps Advances in instrumentation enable high throughput measurements such as images. Analyzing and assessing data quality from image maps such as from DNA chips or FMRIs presents challenging issues for future metrology. We vision that functional data analysis should provide some useful lessons for image analysis, and we will work on the image analysis problems in context of some real applications. 3. Statistical design and standardization of microarray experiments and data anlysis. There's an explosion of research activity in microarray technologies in last few years, and they are probably well justified, since these technologies are associated with greatly improved productivity in gene mapping and allow simultaneous measurements of thousands of genes at different experimental conditions at the same time. The number of runs and the number of genes that can be accommodated in one experiment will increase rapidly in future. This movement presents golden and huge opportunity for statistical and metrological research such as finding signals in the huge mountains of noisy data and standardizing or calibrating various array measuring devices. Statistical issues include data cleaning, normalization issues, image analysis, modeling and analysis of variations due to different factors, main effects as well as interactions, and experimental design. Statistical experimental design, originally developed in agricultural experiments (Fisher, Cochran, Cox, Bose), then in chemistry (Box, Youden, Cameron, Mandel…) will be extremely relevant. Many methods, such as comparison experiment, calibration and replications, factorial design, and balanced or partial balanced incomplete block designs directly applicable (Kerr and Churchill 2000). Another, more fundamental issue is to improve the data quality and information content through adding reference spots and calibration experiments. Bayesian analysis is used in analyzing the hierarchical models of gene expression models where prior information is generated from calibration experiments (Tseng, Oh, Rohlin, Liao, Wong 2001). Our goals are to investigate the crucial standardization and experimental design issues in microarray experimentation by keeping pace with the rapid advances in bioinformatics and the budding interest from the statistics community, and to actively collaborate with scientists from NIST and other agencies or institutions. 4. High-dimensional modeling: a. Classification and prediction Bioinformatics and data mining in IT has revived research on clustering and classification (pattern recognition), a traditional area multivariate analysis. Among the techniques studied, the classification and prediction method, proposed by V.N. Vapnik, which he called, support vector machines (SVMs), is especially a powerful and effective approach. It is based on statistical learning theory, so has good predictive properties, and avoids the usual ad hoc model selection step. By allowing constraints and penalization on the model coefficients, the technique solves a Bayesian classification problem, and is very close to the penalized likelihood method in situations when the data are noisy. Our goals include the following three parts: one is to futher understand and then develop software support for automated tuning of SVMs in both classification and prediction problems, and to develop some uncertainty measures and theory for SVM estimation and prediction using Bayesian method, and finally to apply the method to more complicated metrology problems, such as functional data analysis and image analysis. b. Hierarchial and space-time models The hidden Markov models (HMMs) are the most successful statistical and probabilistic models that have been proposed for DNA and protein sequence analysis. HMMs belong to the general class of Markov switching or hierarchical models, for which Bayesian method is the most effective approach. Hierarchical dynamic models have been used in many other contexts such as hydrology (e.g. Lu and Berliner 1999) and economics. Recently Internet traffic modeling presents new challenges of modeling heterogeneous and multiscale processes, potentially in continuous time or space-time. The aim is to come up with realistic and flexible models which can take into account the underlying physical process as well as multiscale and various subprocesses, so that the model can have good realism as well as predictive power. The application is to prediction and intelligent traffic monitoring and control.