Chapter Three:

Experiments:

What They Are and What They’re Good For

 

 

 

 

 

 

1. Introduction

One requirement for a philosophical account of scientific experimentation is that it should be possible to apply it to particular cases in a coherent manner, in such a way as to produce answers to important questions about that particular experiment and how it is employed in evidential reasoning. In this chapter I will present the philosophical perspective that I will employ in discussing various aspects of the CDF top search throughout the remainder of this work. My aim in this chapter is not to present arguments that will establish this perspective as the one true path toward understanding experimentation, but to explain what this model of experimental activity and reasoning is, and to show that by adopting this perspective one can make sense of this extremely complex experiment. The model itself is not my own. I will principally employ ideas that have been laid down by Deborah Mayo and Patrick Suppes. I will, however, focus on aspects of this model that have not received a great deal of attention thus far.

In particular, in this chapter, I wish to explore the relationship between two perspectives on experimentation that might appear to be far removed from one another: experimentation as a product of human agency, and experimentation as a means for producing formal models leading to conclusions regarding evidential relationships.

In what follows I will distinguish a ‘procedural’ notion of experimentation from a ‘formal’ notion, so as to better approach the question of what an experiment is. In section two I will explore these two perspectives. I do not claim to provide a criterion for determining what actions are or are not part of a given experiment, but will instead focus on explicating the "hierarchy of models" approach to describing experiments, so that, in section three, I can explore the relationship between this formal means of representing experiments and the procedures that it is meant to represent. In section four I will present a somewhat simplified formal model of the top quark counting experiment. Section five will consist of a discussion of how such a model is to be interpreted for purposes of evidential reasoning, in which I employ Deborah Mayo’s "error statistical" model of statistical reasoning and clarify the nature of the conclusions that this model underwrites. My aim here is not give a general argument in favor of this model, which Mayo has done herself in numerous publications over the years (see especially her (1985 and 1996)), but to explain it in sufficient detail so as to be able to apply it discussing the experimental episode at hand. In section six I will examine some inferential practices at CDF and some of the epistemic concerns that CDF physicists had in light of the error statistical model, and show that the error statistical model provides a convincing rationale for those concerns and practices. I will conclude in section seven.

 

2. Two Perspectives on Experiment

Experimentation is an activity, or a collection of activities. Consider the kinds of activities engaged in by the physicists at CDF: stringing cables, calibrating phototubes, scheduling meetings, mapping magnetic fields, ordering parts, writing tagging algorithms, creating plots, running Monte Carlo programs, sitting up all night with the detector as, for example, particles become pulses of light in scintillator, which become electrical charges, which become data. . . . Which of these activities are to be thought of as part of the experiment being performed? What do they contribute to that experiment? This is one way of understanding what an experiment is—the procedural aspect of experimentation.

But the kind of experiment we are examining may also be considered, not as an activity, but as a structure with certain relationships between its parts. There is a hypothesis to be tested, there is data that is to serve as the means for that test, there are relationships between inputs and outputs of the instruments used, probability distributions for possible outcomes, and so on. All of these aspects of the experiment can, it might be thought, be presented formally, and apart from the activities experimenters actually engage in, and so one might call this the formal aspect of experimentation.

What is the relationship between these two perspectives?

 

Procedure

First I will consider the work of David Gooding. In his (1990), he puts forward what he calls a "procedural" philosophy of experiment, that focuses on "human agency" and "practice." One of his theses is that "human agency is essential to both exploratory observation and experimental testing" (Gooding 1990, 10). He regards this view as opposed to the "received" view of analytic philosophy, which, he claims, "views the relationship between theory and experiment as a logical relationship between propositions" (Gooding 1990, 9).

Gooding’s book is not an exemplary exercise in clarity and precision. Nor, I will argue, is he correct in his characterization of what analytic philosophers believe about experimentation, as will become clearer in my discussion of Suppes. Nevertheless, his emphasis on the procedural aspect of experimentation does serve as a useful corrective to the relative neglect of this aspect of experimentation by philosophers of science. In particular, his contribution of a means of notation for representing experimental activity may prove useful to others (although the case of CDF’s top experiments is much too complex to apply this notation).

But what is most relevant in Gooding’s work for present purposes is the recognition that an important aspect of experimentation can be lost by focusing only on the formal aspects alluded to above. If one simply looks at the data as a finished product, waiting to be related to the hypothesis that is to be tested, one might lose sight of an important fact: the data, and the way in which the data is related by experimenters to the hypothesis being tested, might have turned out to be quite different. These aspects of the experiment might have come out differently if the experimenter had reacted differently to events occurring earlier, including events involving data, instruments, and other experimenters. This, I think, may be the best way to understand Gooding’s claim that "Observers’ experience of the world is construed, that is, mediated by their exploratory behaviour, their instruments and by their interactions with other observers" (Gooding 1990, 76). Prior to and during the collection of data relevant to the top results, CDF members were making many decisions that could affect what the data turned out to be, and how the data should be analyzed in order to be related meaningfully to the top hypothesis. As described in the previous chapter, even after the data had been collected, lengthy and sometimes heated discussions continued regarding the proper means of analyzing the data. Some measurements, such as the energies of jets measured by the hadron calorimeters, were known to be systematically in error and had to be "corrected." Decisions also had to be made, while data collection was in progress, regarding whether the SVX had suffered too much radiation damage to be producing usable data. Both during and after data collection there were disagreements as to which algorithm should be used to identify events with secondary vertices due to b decays. Decisions such as these, along with thousands of day-to-day decisions made while monitoring the incredibly complex CDF detector, had effects—some small, some profound—on the data produced and the way it was used in testing the top hypothesis.

What Gooding seems to neglect, however, is the extent to which the formal modeling of the experiment exists precisely to address the relevance of such issues, that these models are employed by experimenters as tools to help them avoid making various kinds of errors. That is, one of the purposes of models, particularly statistical models, is to help experimenters address the question: what might have happened that didn’t? (The importance of this question, and the role of statistical models in addressing it, is not only neglected but explicitly rejected by Bayesians, a point to which I will return in chapter five.)

 

Formalism

This brings me to the work of Patrick Suppes. In 1960, at the very first International Congress for Logic, Methodology and Philosophy of Science, Suppes presented a paper titled "Models of Data" (Suppes 1962). In this paper Suppes proposes that the relationship between experimental events and the theories that experiments are conducted to test be understood by means of a "hierarchy of models." He presents this idea in a very brief exposition, which more recently has been taken up and elaborated by Deborah Mayo (Mayo 1983, Mayo 1996). In what follows I will draw on the expositions of both Mayo and Suppes.

Suppes begins with the notion of a model of a theory as "a possible realization in which all valid sentences of the theory are satisfied," where a possible realization is "an entity of the appropriate set-theoretical structure" (Suppes 1962, 252). Suppes contends that "exact analysis of the relation between empirical theories and relevant data calls for a hierarchy of models of different logical type" (Suppes 1962, 253).

I do not take any particular stand here on Suppes’ contention that "this notion of model is the fundamental one for the empirical sciences as well as mathematics" (Suppes 1962, 252). As I use the term ‘model,’ I do not mean to restrict it to a set-theoretical concept, and it will soon be apparent that I at times use it in a looser sense. I leave the possibility of a completely set-theoretical explication of the issues here an open question, focusing instead on two aspects of Suppes’s discussion: (1) the importance of formal representations of the elements of experimental inquiry, and (2) the hierarchy of such representations that mediates the relationship between experimentally produced data and the theories that experiments (often) test.

First I will give a brief general description of the hierarchy, and then illustrate and clarify that description with an example.

At one end of this hierarchy there is a model of what Deborah Mayo calls the "primary hypothesis," i.e., the hypothesis that the experiment is meant to test. One might think of this as a structure representing certain parameters of a population, for example a probability distribution or class of probability distributions for a measurable characteristic of members of some species, or a region or class of regions of some phase space representing the states of a physical system considered possible under a given hypothesis. Below this is the model of the experiment, which "provides a kind of experimental analog of the salient features of the primary model," and specifies "analytical techniques for linking experimental data to the questions of the experimental model" (Mayo 1996, 133–34). That is, the theory of the experiment specifies possible realizations of the theory that are "the first step down from the abstract level of the . . . theory itself" (Suppes 1962, 255). Specifications of the particular sample to be collected in the experiment are used as the basis for a structure that is the analog of the model of the primary hypothesis for that sample. A testing rule may be specified as part of the model of the experiment that maps possible outcomes in the model of the data to regions in the parameter space employed in the model of the hypothesis. At the third level of the hierarchy is the model of the data, which puts the information contained in the data into a canonical form that can be meaningfully compared, via a statistical measure, to the model of the experiment (possibly as an input to the mapping function specified in the testing rule specified at the previous level). The model of the data is "designed to incorporate all the information about the experiment which can be used in statistical tests of the adequacy of the theory" (Suppes 1962, 258). Below this level is the model of the theory of experimental design, which incorporates "all of the considerations in the data generation that relate explicitly to . . . some feature of the data models" (Mayo 1996, 139), such as the calibration of instruments, the randomization procedures necessary to select a sample, etc. Finally there are the ceteris paribus conditions, including "every intuitive consideration of experimental design that involves no formal statistics" (Suppes 1962, 258).

To illustrate and clarify this framework, consider the following very simple type of experiment: A manufacturer of bowling balls wishes to test a new ball that is larger in diameter but less dense than conventional bowling balls, so that a bowler can hurl a fatter ball at the pins without having to manage a heavier ball. The company wants to know whether bowlers might perform better using this new kind of bowling ball (dubbed "Hume" by the grad school dropout who designed this rotund demolisher of standing orders). How might an experiment be performed to decide whether this is the case?

The hypothesis in question needs to be put into the form of a statistical model. The relevant population here is the class of bowlers using "Hume," and the quantity of interest is a bowler’s score, X. This quantity X can be regarded as having a probability distribution P, which can be characterized in terms of certain population parameters, such as the average value of X, denoted by q, and the variance, s2. The set of possible values of q is known as the parameter space, W, some subset of which is associated with each statistical hypothesis. The statistical hypotheses can be represented by a model M(q), where M(q) = [P(X|q), q ÎW]. We can then think of the hypothesis of interest as a hypothesis about which distribution P(X|q) in M(q) is the distribution of the population in question. Specifically, supposing that we know that the average score for bowlers using conventional bowling balls is q0, the hypothesis we are investigating, H’, will state that the distribution of X for the population of bowlers is to be described by a function P(X|q), where q >q0, and the null hypothesis, H0, will state that the distribution of scores among bowlers is to be described by the function P(X|q0).

Suppes notes "two obvious respects in which a possible realization of the theory cannot be a possible realization of experimental data." For one thing, a possible realization of the theory H’ will include an infinite sequence of pairings between bowlers and scores, which no actual experiment can include. Also, the probability distributions in the model of the hypotheses are continuous functions, where the data are discrete, and the parameter q that characterizes that distribution in the population is not itself something that can ever be experimentally observed, but only estimated. To be able to relate the experimental data to the theory, an intermediate level needs to be introduced.

At the next level of the hierarchy is the model of the experimental test. For the purposes of this experiment, the theory of the experiment specifies possible realizations of the theory that are "the first step down from the abstract level of the . . . theory itself" (Suppes 1962, 255). Suppose we set out to test our hypothesis by making use of the members of a particular bowling league, with, say, 200 members. Each member of the league is given a "Hume" ball of the same weight as the conventional ball that bowler usually uses, and with identically drilled holes. Each bowler is then asked to bowl three games with that ball, and a score s is recorded for each game played by each bowler. This amounts to recording values for 600 independent random variables S1, S2, . . ., S600. So an experimental outcome will be represented as a 600-tuple of values for those 600 random variables: s1, s2, . . ., s600, which can be abbreviated as <s600>. (Alternatively, one could represent it as a 200-tuple of triples. In fact, it might be wise to do so, for reasons that I will explain momentarily.) Just as at the level of the hypothesis a parameter space was specified, here a sample space, J, is given, which is the set of possible values of <s600>. As Deborah Mayo notes, "On the basis of the population distribution M(q) it is possible to derive, by means of probability theory, the probability of the possible experimental outcomes," here P(<s600>|q), for any <s600> Î J, and any q Î W. Also at this level one may specify a rule T that maps particular outcomes <s600> in J onto particular members of the set of probability distributions (statistical hypotheses) in M(q). This yields an experimental testing model characterized as ET(J) = [J, P(<s600>|q), T].

In our example, the mapping rule T makes use of a test statistic Y, which is the mean of the scores in <s600>. This statistic Y is a function of J, and it is possible, for a given value of q, to calculate its probability distribution. The mapping rule serves to stipulate which subset of J is to be associated with H’, and which with H0.

At the level of the data, the observation of the bowlers’ performance is modelled as an element of J. That is to say, it is put into the form of a 600-tuple of individual scores. Although the data in this form can be further modelled in many ways, in this instance it is modelled by means of the averaging function. This data model can then be represented as D = [<s600>, Y(<s600>)].

In short, the levels of experimental inquiry work as follows. The data must be modelled to yield a test statistic, to which a testing rule can be applied that, based on the probability distributions given by the model of the experiment, associates values of the test statistic with the probability distributions denoted by the hypotheses in question.

Thus far I have not said much about the two remaining levels of Suppes’s hierarchy, the levels of experimental design and ceteris paribus conditions. Suppes also has relatively little to say on this point, but does note that "[t]he analysis of the relation between theory and experiment must proceed at every level of the hierarchy. . . . Difficulties encountered at all but the top level reflect weaknesses in the experiment, not in the . . . theory" (Suppes 1962, 259).

Suppes notes that considerations that enter at the level of experimental design "can be formalized, and their relation to models of the data . . . can be made explicit" (Suppes 1962, 258–59). Here one is meant to think of questions such as randomization that are often relevant to the proper conduct of an experiment. We might ask, at this level, whether the bowling league we chose for our study was itself a source of error, such that the statistic Y could not be related in the way we thought to the experimental model (perhaps it was a league with an inordinately high number either of very good or of very poor bowlers), or a league which served more as an excuse to drink great quantities of beer than as a genuine sporting institution. Perhaps the choice of the number of games to be played was too small, as it would take more time for the bowlers to become accustomed to the new ball. (Whether this is in fact the case can be assessed in part by remodelling the data in the manner suggested above, as a 200-tuple of triples. One can then take the averages of the first, second, and third members of the triples separately: Y1, Y2, Y3. If these averages show increases from one to the next that the probability distribution deems unlikely to occur by chance, then one might either wish to conduct the experiment differently or analyze the data differently—by ignoring the data from the first two games, for example. Alternatively, one might choose to continue the experiment until average scores from one game to the next stabilize.)

Suppes readily admits that considerations at the level of ceteris paribus conditions will be generally resistant to formalization—he refers to the "seemingly endless number of unstated ceteris paribus conditions," which cover "every intuitive consideration of experimental design that involves no formal statistics" (Suppes 1962, 258–59). At this level we might ask whether the bowlers in question were bowling under unusual conditions of lighting, noise, etc., whether unusual environmental factors (famine, pestilence, a new law against serving beer in bowling alleys) might be impairing or enhancing their performance, and so on.

Mayo combines the levels of experimental design and ceteris paribus conditions into one, noting that "even those features of data generation that do relate explicitly to data models need not always be checked by formal statistics," and that "features assumed to be irrelevant or controlled may at a later stage turn out to require explicit scrutiny" (Mayo 1996, 139).

Whether these last two levels are combined into one or not does not particularly concern me, but the point just mentioned by Mayo is a very important one for making sense of the relationship between experimental procedures and formal models of experiments, to which I turn next.

 

3. Reconciling the Two Perspectives

What does this framework of formal models do for our understanding of experimentation? I believe it conceptualizes the ways in which arguments are drawn out of experimental activities via abstraction. On the surface, then, it would seem to be open to Gooding’s charge against philosophical studies of experimentation, that they abstract out the procedural aspect of experimentation and the fact of "agency," i.e., the fact that human beings actually set out to do things in experimental activities, and that experimental outcomes are contingent upon such actions as are actually performed during an experiment. Certainly, there is no explicitly defined "agency" element in any of the formal models in this framework. However, such a criticism could only be sustained by a superficial reading of Suppes. According to Gooding, the "received view" among philosophers depicts "the relationship between theory and experiment as a logical relationship between propositions" (Gooding 1990, 9). It is not clear who is meant to hold this view, but it is a mistaken description even of Suppes, an arch-formalist.

Suppes certainly is interested in depicting the relationships between models of the theory, the experiment, and the data, as logical relationships (although in set-theoretical, not propositional, terms), and one might be reluctant to agree with his position that all evidential relationships can be understood entirely by formal (and set-theoretic) explications. But Suppes’s description of the hierarchy of models as an account of the relationship between theories and experimental data agrees with Gooding’s description of the received view only if we understand Suppes to be claiming that the hierarchy of models is or exhaustively describes the experiment, and nowhere does he make any such claim. The purpose of the hierarchy of models is to model the experiment and the theory in such a way that the two can be related to one another in an argument leading to a conclusion about the theory, premised on what is learned in the experiment. The logical relationships are within, and with, the model of the experiment, which is necessarily abstracted from the actual performance of the experiment. He does not set out to describe a logical relationship between the theory and the experiment itself, in the sense of a set of procedures carried out by human agents. Indeed, it is hard to understand what the latter kind of relationship could be.

Perhaps what Gooding correctly notices in the work of philosophers like Suppes and Mayo is a kind of deemphasis of the very features that interest Gooding. It is certainly true that Suppes, in detailing the formal relationships between the parts of experimental inquiry, does not say very much about what experimenters do. In part this is because of his emphasis on the first three levels of the hierarchy. While the procedures of the experiment, the "agency" behind the the experiment, are reflected at every level of the hierarchy except the model of the hypothesis, they are most apparent at the levels of experimental design and ceteris paribus conditions, about which Suppes has less to say.

More important, however, is an apparent difference in the questions that these two authors are trying to address. Gooding seems to want to develop a framework for describing experimental activity as a phenomenon of interest for its own sake, whereas Suppes, like most experimentalists, sees the importance of drawing out from that activity formal structures in order to arrive at results, and it is that aspect of experimentation that he addresses in "Models of Data." This is especially clear in Mayo’s reformulation of Suppes’s model. For philosophers like Suppes and Mayo, the procedures of experimentation are relevant to the extent that they are relevant to the conclusions that one can draw based on the experiment. That is to say, it is the epistemic role of experimental procedures that is of primary interest. Procedures such as the selection of subjects, the recording of data, the control of complicating factors, can be modelled within the hierarchy. But to a large extent the design of a good experiment is a problem of making procedures (and agency) irrelevant to the conclusions to be drawn from the experiment. One wishes to avoid making relevant things that are difficult to model, because then it would be difficult to assess their effect on the reliability of the argument being presented. To be sure, it will generally take a great deal of purposive planning, practical acquisition of skills, and sensitive responding to the vicissitudes of life in the laboratory (and hence, a great deal of "agency") to be able to produce an experiment in which all of those facts about who did what and why can be left out of the list of relevant factors in the modelling of the experiment, as much of the recent work on experimental practice shows. But when such measures are carried out successfully, the arguments produced will be able to safely (i.e., without becoming less reliable) ignore such features of the experiment.

Another way of understanding this last point is this: Typically, the fact that it was such-and-such a person with a particular set of interests and abilities who performed a certain action during the experiment is deemed irrelevant because of considerations at the level of ceteris paribus conditions. That is, one intuitively understands that in a good experiment one does not, for example, set people to tasks that they are incompetent to carry out, or, in a case where a person has a strong reason to favor one outcome over another, tasks the outcomes of which are sensitive to manipulation. However, sometimes it becomes clear that such facts about "agency" cannot be ignored in evaluating the results of an experimental test. In such a case, the consideration becomes explicit and has to be addressed by means of either formal or informal statistical considerations, typically at the level of experimental design or of ceteris paribus conditions in the hierarchy of models. (Such a situation in the context of CDF's top search will be closely examined in chapter five.)

This brings us back to Mayo’s point that "features assumed to be irrelevant . . . may at a later stage turn out to require explicit scrutiny." In terms of a model of an experiment, the top three levels—the model of the theory, the model of the experiment, and the model of the data—seem to be explicitly identified. Yet at the levels (or level) of experimental design and ceteris paribus conditions, what is included in the model seems to be open-ended. A consideration that might not seem important at first may turn out to require careful study later. This seems to potentially complicate any attempt to provide a criterion for demarcating what is and is not part of any particular experiment. Perhaps we can say that the model of the experiment includes every formal or informal model that receives attention or consideration in the process of making inferences from the experiment. A procedural parallel would seem to be that every procedure that gets modeled either formally or informally in the process of making such inferences is part of the experiment. But note that there is no way to say in advance which particular procedures will require such scrutiny, so that if one were to adopt this position, one would not necessarily be able to say definitively which actions that were performed were part of the experimental procedure at the time one was performing those actions.

Thus, for example, stringing cable for a hadron calorimeter (an activity so mindless and undemanding that even the author has done it) would not normally be considered part of actually performing an experiment, but rather as an activity prefatory to the actual experiment. However, on the view suggested above, if one came to suspect that the cables carrying data from the hadron calorimeter had been systematically cut to the wrong length, causing the trigger not to function in its intended manner, then one would need to reconsider the model of the data that had been produced. In some cases, such errors might be corrigible simply by a remodeling of the data, in other cases, the data might be rendered useless. But either way, the cable stringing process would now have become modeled at least informally for the purposes of drawing inferences from the experiment. By this criterion, then, stringing cables was part of the experiment (performed rather badly, as it turns out) although this could not have been known by those performing that act while they were doing it.

This result might seem counterintuitive, and yet I know of no other principled way of distinguishing those procedures that are part of an experiment from those that are not. What this suggests to me is that the project of demarcating what is and is not to be included as part of a particular experiment is the wrong project to be pursuing. The potential relevance of such mundane activities as cable-stringing arises because of the function that experimentation is supposed to serve, which is the reliable inference from experimentally-produced data to conclusions regarding the likely truth or falsehood of hypotheses. It is such inferences that the representation of experimental activities in terms of the hierarchy of models serves. I turn now to the problem of how such inferences are to be drawn and interpreted.

 

4. A Model of the Counting Experiment as a Statistical Test

What kind of test is the counting experiment, and what conclusions does it warrant? A model of the statistical procedure employed may help to clarify this. To facilitate this, consider a somewhat fanciful scenario, which I will employ for illustrative purposes at several points in this work: Suppose that we wish to know whether there are any black swans. That is, like the searchers for the top quark, we are interested in testing an existence claim. However, our attempt to settle this question is complicated by the following fact: A genetic disorder with a known prevalence in the population of ravens causes them to give birth, at a known rate, to offspring that look just like swans, except for having completely black feathers. Let us call these anomalous swan-like ravens "swan-ravens." However, one difference has been noticed between swan-ravens and ordinary swans. Swan-ravens have, on average, somewhat shorter necks than swans. On the assumption that black swans, if they exist, would have necks of the same length as their white conspecifics, we might attempt to find black swans by looking for an excess of long-necked black swan-like birds.

It would be simple to formulate a test of the following sort. From amongst the population of black swan-like birds, a sample of n specimens is gathered (we will ignore for the time being how this is to be done, a feature of the experiment belonging to the level of experimental design). We are interested in the lengths of their necks, a quantity to be designated X. It is known that amongst swan-ravens, the average neck length is q0, and that the distribution about that mean in the population is normal, with a standard deviation s. We designate a normal distribution with mean q and standard deviation s by N(q, s). However, we do not know the actual average for X of the population from which our n birds were taken. The hypothesis that they were drawn from a population of swan-ravens without any admixture of black swans amounts to the hypothesis that they were drawn from a population with an average neck length of q0. So at the level of the model of the theory we have a set of distributions [N (X|q), q Î W], where W denotes the parameter space. Our null hypothesis is

H: q = q0 in N(q, s).

Thus, the model of the primary hypothesis is the probability distribution N(q0, s). And, since we are interested in learning whether these birds have longer (rather than shorter) necks than they would have if they came from a population of swan-ravens only, the alternative hypothesis to be entertained is

J: q > q0.

So that our alternative hypothesis states that the population from which the sample is drawn is described by some normal distribution N(q, s), where q > q0.

Corresponding to the n birds in our sample we have n independent random variables X1, X2, . . . Xn. Each possible experimental outcome will be represented by an n-tuple <Xn> of values for these variables, and the sample space J will be defined as the n-tuples representing possible outcomes for this experiment.

We assume that each random variable Xi would be distributed according to N(q, s). We can then define a test statistic S as the average

.

Thus the model of the data will include the n-tuple <Xn> of actual measurements and the average of those measurements S(<Xn>), the test statistic. If the null hypothesis is true, this test statistic will also be distributed normally about q0, but with a standard deviation s= s/n1/2.

Now we can choose a test rule that will allow us to use the test statistic S (= ) to make a decision regarding the null hypothesis. In this case, where we have an interest in cases in which the mean of the observed sample exceeds the hypothetical mean q0, but not in cases where the observed mean is less than the hypothetical mean (it is a one-sided test), we define a test rule T+, which maps from outcomes in the sample space to claims about the statistical hypothesis. Thus our test rule will take on the following form:

T+: Reject H: q = q0 iff ³ q0 + das.

The choice of a value for da determines the value of a, known as the size or significance level of the test. The aim is to define a test that will reject the null hypothesis when it is true no more than a(100)% of the time, which is to say we seek to choose da such that

P( ³ q0 + das/H) ≤ a.

In principle, we can say then that the experimental model includes (i) the sample space J, (ii) the testing rule T, and (iii) the probability distributions for any <Xn> Î J, given any q Î W. For practical purposes, it is the probability distribution of S, given q = q0, that is of interest, particularly in guiding the choice of T.

So to find an excess of long-necked black swan-like birds indicative of the existence of black swans, we could, if we knew the mean length of the necks of swan-ravens, collect a sample of black swan-like birds, measure their necks, thus fixing an n-tuple representing the outcome, calculate the mean of that quantity, and then apply our rule, "rejecting" or "accepting" the null hypothesis according to whether or not the mean neck length in our sample sufficiently exceeds the hypothesized mean neck length for the population.

This would be a very orthodox application of the statistical testing theory known as Neyman-Pearson theory (NPT). As that theory of experimental testing is usually taught, one chooses one’s test statistic, one’s sample space, and one’s testing rule in advance. Consequently one determines in advance what outcomes will cause one to "accept" the null hypothesis and what outcomes will cause one to "reject it." But CDF did not strictly adhere to these rules.

The choice of a test statistic was a complex matter for CDF in many ways. Top events could not be distinguished from background simply by choosing one parameter, and then looking for a difference in the mean for one's sample as opposed to the expectation for background. To distinguish the signal from the background, CDF had to make use of differences involving many parameters. They defined a set of cuts that would distinguish "candidate" events from everything else. This simplified the statistical task insofar as, rather than conducting tests involving many different test statistics simultaneously and then trying somehow to combine those results (as was suggested by proponents of various "likelihood" techniques), the physicists could define just one test statistic, "number of candidate events," and use a very simple statistical technique. High energy physicists trust their ability to use complicated instruments to distinguish one kind of event from another better than their ability to use sophisticated statistical methods, as is evidenced in the fate of the likelihood approaches to finding the top. (Recall Dan Amidei’s observation, noted in chapter two: "People are willing to devote great parts of their life to building hardware to count things, but they’re not willing to devote excess neural capacity to understanding a new mathematical formalism" (Amidei 1995), as well as the great skepticism brought to bear on the assumptions made by the likelihood analyses.) Thus the statistical method employed by CDF is quite simple. It is as if, rather than measuring the mean of the neck length in a sample of black swan-like birds, one were to simply count how many of those birds had necks longer than a certain amount, and then compare just that number with what one would expect that number to be if the sample consisted only of swan-ravens. For the case of measuring one quantity, there may not be any apparent gain in the adoption of such a method, but for the complex kind of discrimination involved in distinguishing top quark decays from other particle processes, the difference is great.

The model for such a test will resemble that described above, but with some important formal differences. Rather than characterizing the sample by a series of random variables, the mean value of which in the sample is used as the test statistic, the counting experiment test regards each of the N events in the sample as the outcome of an independent process with some probability p of "success." This yields an expectation value for the number of successes in N events of Np = l0. Thus the model of the null hypothesis is a Poisson distribution with expectation value l0 and standard deviation s, i.e.:

H: l = l0 in P(l, s).

The alternative hypothesis is that the expectation value for the population from which this sample was drawn exceeds the value l0, and thus the model of this hypothesis is a composite of distributions with values for l such that

J: l > l0.

Thus each hypothesis is associated with a region of the parameter space W.

In other words, the statistical test being employed seeks to test the hypothesis that the sample of events has been drawn from a population in which the expectation for the number of candidate events equals a certain number, (which has been determined to be characteristic of background processes), as opposed to having been drawn from a population in which the number of candidate events exceeds that number.

Thus far, the account I have given of the test of the top hypothesis is perfectly consistent with "orthodox" Neyman-Pearson testing. But one feature of orthodox uses of NPT is missing: the dichotomization of the test prior to its implementation. In an orthodox NPT test a testing rule is specified, prior to the determination of the value of test statistic S in the sample, that maps from the sample space J to two possible outcomes: "reject H (and accept J)" and "accept H (and reject J)." The region of the sample space that is mapped to the outcome "reject H" is known as the critical region (CR). That is, the testing rule T consists of a map from J to {reject H, accept H} such that all instances of {S Î CR} map to {reject H}, and all instances that lie in the complement JCR map to {accept H}. The properties of size (a) and power (b) in an orthodox test are defined in terms of the error rates for these two outcomes, such that

 

P(T rejects H/H is true) = P(S Î CR/ l Î WH) £ a

P(T accepts H/J is true) = 1 – P(S Î CR/ l Î WJ) £ b.

The first type of error is designated "type I," and the second is designated "type II."

The sense in which probability is used here is a strictly frequentist conception. That is, the probability limits designated by the size and power of a test are upper limits on the relative frequency in the long run of certain outcomes in a series of applications of such an experimental test. It is this control of the relative frequency of errors that is extolled by proponents of NPT testing. Because this feature is retained even when the dichotomization of the sample space into "reject" and "accept" regions is not carried out prior to the measurement of S in the sample O, it is reasonable to regard the statistical test conducted by CDF in the counting experiments as part of the NPT tradition, as I will explain next.

Rather than specify a size for the statistical test in advance, with the accompanying commitment to a rejection/acceptance rule, CDF cites what might be called an "observed" significance level. Having found 15 "counts" in their data sample, they cite 0.0026 as the significance level of this result. They do not use the words "reject" or "accept" with respect to the background hypothesis, but, by combining this significance level with a broader argument about the features of the data, argue that the data provides "evidence" for the existence of the top quark. Let so designate the observed value for the test statistic S, and let ao(l) be the observed significance under the value l of the parameter of interest in the population. Then we can say that

ao(l0) = P(S ³ so/l0).

This simply specifies that ao(l0) represents the area under the Poisson probability distribution curve for S with mean l0 to the right of so.

Although in citing their observed significance CDF does not use the vocabulary of acceptance and rejection, it is implicit that were one to map all outcomes in repetitions of this experiment that had outcomes of S ³ so to the event {reject H}, one would reject H when it is true (commit a type I error) no more than 0.26% of the time, assuming that CDF has carried out the significance calculation correctly and on the basis of true assumptions. The situation would be no different (except for the feeling one would have that a remarkable coincidence had occurred) if it had been specified in advance that one was going to reject H just in case 15 or more counts were found, in order to have a test of size 0.0026.

 

5. The Error Statistical Model of NPT Tests

Thus far I have described how the hierarchy of models can be used to represent those features of experimental procedure that are relevant to one’s interest in drawing scientific conclusions from the experimental testing of hypotheses. This leaves unanswered, however, the question of just what the nature of such conclusions is, and how they are to be related to experimental procedures and the models that represent those procedures.

Deborah Mayo has analyzed the various approaches to interpreting NPT tests in terms of the criteria of a "good test" that different interpretive models would put forth. She has advocated her own model, called an "error statistical model" (originally termed a "learning model") for interpreting NPT tests that is intended to avoid the difficulties of the alternatives. In what follows I will examine the relationship between Mayo’s error statistical model and the alternatives against which it is opposed, explaining how it avoids problems that the alternatives face. I will argue that Mayo’s model does not constitute a complete rejection of an evidential interpretation of statistical tests (as one might think from the way her position is presented). Rather, it constitutes a rejection of the literal interpretation of the accept/reject terminology, along with the idea that significance levels can be translated directly into measures of evidential strength. In so doing it offers a means of evaluating evidential claims based on tests of hypotheses that is ojective and firmly grounded in some of the most important aims of scientific research: avoiding erroneous conclusions, and distinguishing real from spurious effects.

 

The Error Statistical Model as an Alternative to Other Views

As Neyman-Pearson theory was originally presented by Jerzy Neyman and Egon Pearson in two seminal articles published in 1928 and 1933, the rationale for using NPT testing methods is clearly based on the fact that such methods enable one to limit the rate of one's errors in making decisions. Any notion of relating outcomes of such tests to assessments of evidential strength is flatly rejected:

 

We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis. . . . Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong. (Neyman and Pearson 1933, 290–91)

In following the NPT method, one selects an accept/reject rule to meet one's requirements regarding the rate at which one is willing to make errors, but

 

[s]uch a rule tells us nothing as to whether in a particular case H is true . . . or false. . . . But it may often be proved that if we behave according to such a rule, then in the long run we shall reject H when it is true not more, say, than once in a hundred times, and in addition we may have evidence that we shall reject H sufficiently often when it is false. (Neyman and Pearson 1933, 291)

This approach may be termed the behaviorist interpretation of statistical tests. Mayo states the criterion for a good test on the behaviorist model as follows:

 

A good test is one that has an appropriately small frequency of rejecting H erroneously, and at the same time erroneously accepts H sufficiently infrequently (in a given sequence of applications of the rule). (Mayo 1985, 501)

The behaviorist interpretation provides a convincing rationale for the use of such methods to, for example, reduce the frequency with which a manufacturer ships out defective merchandise. A scientific researcher might find this way of thinking about statistical tests to be rather remote from her concerns, however. She wants to know, does the data at hand provide support for a certain scientific hypothesis, and if so, how much support? Consequently, one might also seek to derive from the outcome of a statistical test a measure of the strength of the evidence for a hypothesis that is contained in a given body of data. Many criticisms of NPT testing have assumed that statistical tests ought to provide such a measure, and have charged NPT tests with failing to provide it. In 1977, Alan Birnbaum proposed an interpretation of NPT that would support conclusions from the outcome of particular statistical tests regarding the strength of the evidence in support of a given hypothesis (Birnbaum 1977). This approach might be called evidentialist, and would yield the following criterion for a good test, according to Mayo:

 

A good test rejects H (accepts J) iff observed data . . . provides appropriately strong evidence against H and in favor of alternative J (i.e., weak evidence in favor of H as against J). (Mayo 1985, 503)

Critics of NPT have been quick to point out difficulties in formulating a measure of evidential strength based on the outcome of NPT tests. They note that, as pointed out originally by David Lindley (Lindley 1957), for any given level of significance, it is possible, given a sufficiently large sample, for a test to reject a hypothesis at that level, when intuitively we would say that the evidence favors that hypothesis over the alternative.

To see the problem, consider the following example put forth by Howson and Urbach (Howson and Urbach 1993). A purchaser of tulip bulbs is unable to recall whether he ordered a shipment of forty percent red bulbs and sixty percent yellow bulbs (H), or a shipment of sixty percent red and forty percent yellow (J). One way to decide the issue would be to select a sample at random from the shipment and plant them to see what blooms. Suppose that a sample of size ten were chosen, and one intended to reject the null hypothesis (forty percent red, sixty percent yellow) only if the results passed a test with a significance level of 0.05. In that case one would reject the null hypothesis only if seven or more of the bulbs bloomed red. What Lindley, Howson and Urbach, and other critics of NPT point out is that as the sample size increases, the proportion of bulbs needed to reject the null hypothesis at a significance level of 0.05 decreases, gradually approaching 0.4. For a sample size of 100,000, only 40.26% of the bulbs need bloom red for the null hypothesis to be rejected. Given that only two hypotheses are considered possible, however, and that the alternate hypothesis is that 60% of the bulbs are red, a result such as finding that 40.26% of the bulbs are red would seem to be much stronger evidence for the null hypothesis than for the alternate hypothesis. Howson and Urbach conclude, "The thesis implicit in the current [NPT] approach, that a hypothesis may be rejected with increasing confidence or reasonableness as the power of the test increases, is not borne out in the example, which signals the reverse trend" (Howson and Urbach 1993, 209).

Mayo argues that both the behaviorist and evidentialist models, and the criticisms directed at NTP based on them, are misguided, and proposes an alternative approach to interpreting statistical tests, the error statistical approach. Quoting some suggestive comments made by Egon Pearson, she puts forward the idea that "error frequencies are important, not because one is concerned simply with low error-rates in the long run; but because they provide ‘that clarity of view needed for sound judgment.’ . . . Tests accomplish this learning function by providing tools for detecting certain discrepancies between the (approximately) correct parameter values (within a statistically modeled problem) and the hypothesized ones" (Mayo 1985, 507, emphasis in original).

Consider again the model of CDF’s counting experiment analysis that I provided above. The value of the parameter l under the null hypothesis is l0. We might also consider the question of whether l exceeds some value l’, where l > l0. We can define the observed difference Dobs to be the difference between the observed value of the test statistic so and the hypothesized value l0 of the parameter: Dobs = sol0. Mayo claims that the question, what has been learned from this test? is answered by means of the following principle:

IND: (i) Dobs is a good indicator that l exceeds l only if (and only to the extent that) l' infrequently gives rise to so large a difference.

(ii) Dobs is a poor indicator that l exceeds l to the extent that l frequently gives rise to such a large difference. (Mayo 1985, 510)

Critics of NPT methods point to examples such as the tulip case cited above. From such examples they conclude that the outcome of a test might be such that, while the test rejects a hypothesis with a statistical significance that can be cited at a very low number, indicating that such a result would infrequently occur supposing the hypothesis were true, the data might provide stronger evidence for the rejected hypothesis than for the alternative.

Mayo argues for distinguishing the statistical conclusion (i.e., the accept/reject output of the test) from the scientific conclusion that one draws based on one's knowledge of the characteristics of the statistical testing situation. In light of this distinction, she proposes two distinct criteria for a good test:

 

LM: (i) A statistical testing procedure is good iff one is able to objectively evaluate what has and has not been learned from a statistical conclusion (reject or accept H).

(ii) A statistical test conclusion . . . is [poor] good for learning about a given discrepancy between l and l'‘to the extent that it is a [poor] good indicator that l exceeds l'‘[in the sense of IND (i) above]. (Mayo 1985, 512)

What Mayo proposes is that evaluating the outcome of a statistical test requires that one determine objectively the frequencies with which the test result occurs on the basis of a variety of values for l. Thus, in the tulip example above, supposing l represents the percentage of red tulips in the sample, one would not be satisfied with simply rejecting the null hypothesis at a given significance level regardless of sample size. One would want to consider the frequency with which the observed proportion of red bulbs in the sample would occur not only on the basis of the null hypothesis value of 40% red bulbs, but also on other possible proportions of red bulbs, particularly the possibility of there being 60% red bulbs. Naturally, one would find that 40,260 red plants out of 100,000 would be even less likely given a population of 60% red plants than given a population of 40% red plants. In terms of Mayo’s framework, if we take l to be a value just slightly higher than 40% (such as 40.26%), then the observed difference between the value of the test statistic and the value hypothesized by the null hypothesis would in fact arise rather frequently. Consequently, for l = 40.26% the test conclusion "reject H at a significance level of 0.05" would be a poor indicator that the value of l exceeds l(and an even worse indicator that l = 60%). Hence this test conclusion would be poor for learning about such a discrepancy, and the testing procedure would not be good, as it would not enable one to evaluate objectively what had been learned from the conclusion of the test.

This reflects a point with which users of statistical procedures such as the CDF physicists are quite familiar: as the size of your sample increases, the significance level at which the null hypothesis is rejected should decrease. In using such tests, one does not consider the outcome of a statistical test as a static, inert result, but as an element in the dynamic process of developing evidence for or against a hypothesis. Good statistical method requires that one observe how the outcome of the test varies as one changes parts of the analysis, or as one collects more data. If, for example, adding the data collected in the first few months of run Ib had served only to maintain the same significance level that had been calculated for the run Ia data, this would have been regarded as a heavy blow against CDF’s claim to have found evidence of the top quark. If the effect they had detected during run Ia was real, and not a spurious statistical fluctuation, the experimenters expected to (and in fact did) see the effect become more significant as more data was added to the sample. That the distance between the observed number of candidate events and the background prediction grew from 2.8 standard deviations to 4.8 when they included another 48 inverse picobarns of data was regarded as a good indication that the effect was not spurious. If the effect had not become more significant with the addition of more data, CDF would have been forced to admit that they did not understand the nature of their findings.

The thesis that Howson and Urbach claim is "implicit" in the NPT approach to statistical testing is explicitly rejected on the error statistical model: one does not assume that a hypothesis may be rejected with increasing confidence or reasonableness as the power of the test increases. Rather, one regards a statistical test as "a standard tool for detecting discrepancies between a hypothesized parameter or model, and the parameters or models indicated by the experimental observation. . . . The ability to guarantee or control error frequencies is what enables an orthodox test to function as a nonsubjective tool for understanding what observed test results indicate about their source" (Mayo 1985, 513, emphasis in original).

 

B. The Error Statistical Model and Judgments of Evidential Strength

Although Mayo’s error statistical model is meant to serve as an alternative to an evidentialist interpretation of statistical testing such as Birnbaum’s, it clearly does not entirely separate statistical tests from judgments regarding how good the evidence from a particular experiment is for or against a particular hypothesis. On the contrary, she explicitly attempts to provide a means for addressing just such questions.

The notion that lies at the center of Mayo’s approach to this problem is that of a good (or poor) indicator. According to Mayo’s error statistical model, a statistical difference Dobs is a "good indicator that l exceeds l only if (and only to the extent that) l infrequently gives rise to so large a difference" (Mayo 1985, 510). And a statistical test conclusion is good for learning about a given discrepancy between l and l"to the extent that it is a good indicator" that the former exceeds the latter. This requirement amounts to the severity requirement (to be discussed in more detail in chapter five), according to which, in one formulation, "Passing a test T (with e) counts as a good test of or good evidence for H just to the extent that H fits e and T is a severe test of H" (Mayo 1996, 180). The severity criterion may, for the kind of test mentioned in the above description of a good indicator, be formulated as: There is a very low probability (it would infrequently happen) that the observed difference (leading to the test conclusion in question) would occur given that the hypothesis that l exceeds l is false.

Mayo’s rejection of evidentialist interpretations is a rejection of the notion that any judgment regarding strength of evidence can simply be "read off" of the outcome of a single statistical test, in the sense that a hypothesis passing a test with given error characteristics is confirmed just as much as any other hypothesis passing a test with identical error characteristics. Nevertheless, one can, on the basis of a more sophisticated use of statistical testing, draw comparative conclusions regarding the strength of the evidence that an experiment gives for or against a certain hypothesis. How can this be done?

Consider how two tests with the same significance level but different sample sizes can be compared. In an example provided by Mayo, an investigator seeks to test the null hypothesis (H) that the average length, q, of a certain population of fish equals 12 inches, qH, and is distributed normally, against the alternative hypothesis (J) that the length of the fish is distributed normally, but with an average length exceeding 12 inches. The test statistic in question is the average length of the fish collected in the sample, Xobs. The distance referred to in IND is then defined as Dobs = XobsqH. We might wish to consider two testing rules. One rule, T+-1600, tells us to collect a sample of 1600 fish, calculate their average length, and then reject the null hypothesis H at level 0.02 iff the value of Xobs is such that it would occur no more than two percent of the time if H were true. The other rule, T+-25, tells us to collect a sample of 25 fish, and then reject H at level 0.02 iff the value of Xobs is such that it would occur no more than two percent of the time if H were true. The only difference between the two tests, then, is their sample size.

Now consider test results that reject H at level 0.02, and the different facts they might indicate, in light of IND. Mayo asks us to consider a specific value of "infrequently": fifteen percent of the time. On the error-statistical model, we can consider different possible values of q’. Mayo notes that the "values of q that give rise to .02-significant difference (no more than) 15 percent of the time are those that exceed . . . 12 by (no more than) [one standard deviation]." But since the size of a standard deviation depends on the sample size, this yields different values of q for the two tests. So if we consider rejections of H at the level of 0.02 from the two tests we find that "the result from T+-25 is as good an indication that q exceeds 12.4 inches, as is the result from T+-1600 that q exceeds 12.05 inches" (Mayo 1985, 511n). It may be that we have no scientific interest in a deviation from the hypothesized value of q as small as 0.05 inches. In that case, while both tests may reject the null hypothesis, it is from the test using a smaller sample that we learn more about the population as a whole. That is, while both test results have the same statistical significance, they do not both have the same scientific significance.

It is this comparative evaluation of confirmation that also allows us to resolve the tulip case. Finding seven out of a sample of ten tulips to be red and finding 40,260 out of 100,000 tulips to be red both led us to reject at level 0.05 the hypothesis that 40% of the tulips were red. But if we consider whether these same findings indicate that the proportion of tulips that are red exceeds 40.26%, we find that, while the rejection of the null hypothesis on the basis of seven red tulips in a sample of ten remains a good indicator of such an excess (because a proportion of 40.26% red would very infrequently give rise to so great a difference), the rejection of H on the basis of 40,260 red out of 100,000 is a poor indicator of such an excess (because a proportion of 40.26% red would frequently give rise to so great a difference). In other words, the former result is better evidence that the q exceeds 0.4026 than the latter result. And since the latter result is very poor evidence that q exceeds 0.4026, it is even worse evidence that q equals 0.6.

So the error statistical model proposed by Mayo in fact serves as a useful and reliable guide to deciding comparative questions of confirmational strength, for some cases in which we might wish to compare the extent to which e supports h to the extent to which e' supports h'. But note that it is not possible, on this model, to make such a comparative evaluation for every possible combination of evidence statement e and hypothesis h. In the fish-measuring example, we can say that the rejection of H by T+-25 is better evidence for the alternative hypothesis J (q > 12) than is the rejection of H by T+-1600, although both have the same significance level, because we can compare their frequency characteristics, not only under the supposition of the the hypothesis that q = 12, but also under the supposition of other values of q in which we are interested. That is to say, the error statistical model becomes informative with respect to the comparative confirmation question only when we can consider values of q' that are of interest and the frequency with which the outcomes (values of Dobs) in question would occur under those values of q'. Where we cannot make such a comparison, the error statistical model will not in general allow us to compare the strength of the confirmation lent to hypotheses by evidence statements.

One type of situation in which such comparisons will not be possible will be when we are asked to compare test outcomes for tests directed at answering completely unrelated questions. Suppose the owner of a plant nursery finds that on a particular property 40% of her hydrangeas produce blue blossoms and 60% produce pink blossoms, and she wants to determine whether a certain fertilizer additive will increase the proportion of blue hydrangeas. Sectioning off a parcel of the property containing 100 plants, the nursery owner finds that after use of the additive 48 of the plants produce blue blossoms, thus leading to a rejection of the null hypothesis (that the percentage of blue hydrangeas with the use of the additive remains 40%) at a significance level of 0.05.

Suppose you were asked to compare the evidence provided by this experiment for the hypothesis that the additive increases the ratio of blue to pink hydrangeas to the evidence provided by the outcome of T+-25 for the hypothesis that the fish in the population average more than twelve inches in length. While it is true that we can compare the frequency characteristics of the tests by simply looking at the significance levels, we cannot compare the strength of the evidence provided by those tests for the alternative hypotheses in question without being able to compare their ability do detect the kinds of discrepancies in which we are interested. And since these tests concern themselves with different questions, there can be no direct comparison of their ability to do so. We are presumably interested in different discrepancies in the two cases, and so there is no common standard for comparing the evidential strength of these two outcomes on the error statistical model. (Incidentally, the nursery owner may be quite unimpressed by her findings, knowing, as she surely would, that adding fertilizer with ammonium sulfate will turn all of her hydrangeas blue.)

So while we may use the error statistical model to make decisions employing a comparative concept of confirmation, this concept will be applicable only locally. That is, for a given hypothesis h, we can compare the confirmation afforded h by evidence statements e and e’, provided that we can compare the frequencies with which e and e’ would occur under the supposition of various other hypotheses h’, h’’, etc., describing other possible alternatives of interest in considering h. This, of course, requires also that e and e’ not be produced in just any haphazard manner, but that they be produced by a procedure with a determinate sampling distribution.

The moral of the error statistical interpretation of hypothesis testing is that the methods employed in Neyman-Pearson testing are not meant to provide a measure of the strength of the evidence produced by an experiment for a given hypothesis according to some universal scale of confirmational strength. Nor do they aim to determine the "probability" (in whatever sense) of a hypothesis in light of all the evidence at hand. Hence it is a mistake to criticize these methods as unjustified for their failure to provide such information. Instead, these methods are to be judged for their ability to indicate, in Mayo’s phrase, "what has been learned" from a given experiment. In very many cases, as in the top search at CDF, the aim is to determine whether a given effect, a given discrepancy from the expectation based on the null hypothesis, is in fact real or not. And for answering such questions, Neyman-Pearson methods of hypothesis testing are quite useful indeed.

 

6. The Errors of Their Ways: Error Statistical Reasoning and Inferential Practices at CDF

One feature of the error statistical interpretation of hypothesis testing that is worth noting, and coheres well with the attitudes of CDF physicists in their comments on their own statistical procedures, is that the accept/reject terminology is itself drained of epistemic importance. On this view the relevant question becomes, not "does this test reject the null hypothesis at a specified significance level?" but, "given that this test does (does not) reject the null hypothesis at a specified significance level, what does that tell me about the phenomenon I am investigating? Does the data in hand give me good reason to believe the alternative hypothesis or not?" And this question might not easily be answered without considering additional facts about the data at hand and performing additional tests on it.

By effectively and carefully using Neyman-Pearson testing methods one can, to a great degree, isolate what is learned from the data at hand from whatever other opinions or beliefs one might have regarding the phenomena being studied. Methods that seek to provide a general theory of inductive inference typically aim to provide a means for determining the probability or degree of confirmation of a hypothesis based on all of the available evidence. However, the importance for experimentalists of being able to neglect all (or at least most) of the available evidence except for that produced in one’s own experiment cannot be underestimated. The physicists of CDF expressed a wide variety of opinions about what could be expected of the top quark and its mass, based on considerations outside of their own experiment, such as the previous top searches and the calculations from electroweak measurements at LEP. But they were all in agreement about the need to maximize the extent to which the assessment of their own results was kept independent of these other opinions. The evidential impact of those other opinions on CDF’s findings could not be objectively evaluated. But by carefully devising a test, the frequency characteristics of which could be studied, the evidential weight of CDF’s own data could be objectively evaluated. That is to say, they could evaluate to what extent that data gave them reason to think that they were detecting a real, rather than a spurious, effect. The significance level simply reflects one aspect of that evaluation. It does not of itself indicate the strength of the rejection of the null hypothesis, or the degree of confirmation of the alternative hypothesis.

In citing a significance level one is simply making a statement about the frequency of a certain type of result given the truth of some hypothesis. In this case the "type of result" in question is "the observation of S ≥ 15 in a sample of this size of data collected in this way" and the hypothesis is "the population from which this sample is taken has a mean value for a sample of this size of S = 5.96 +0.49/–0.44." This simply tells one how good an indicator this test is of the effect one is examining. One can design a test that accepts or rejects erroneously with certain frequencies, but believing or not believing the hypothesis in question is a different, though related, matter. In other words, to say that a test accepts or rejects a hypothesis is very much like saying that a seismometer detects or fails to detect an earth tremor. When a seismometer detects a tremor, one believes that a tremor really happened only if one believes that the seismometer is a reliable and properly functioning instrument, the output of which is understood. When a statistical test accepts a hypothesis, one only uses that as a reason to believe or accept that hypothesis if one believes that the test is a good indicator of the truth of that hypothesis, and that the way in which the test works is understood. On this view a statistical test is a kind of tool for helping one to learn about what effects are real and which are not. Scientists believe or disbelieve hypotheses on the basis of the output of tests, which are related in a certain way to the input. Some statistical tests have a binary output which we happen to designate "accept H" and "reject H." Others may output at an earlier stage, yielding an observed significance. In either case, the scientist using the test is on her own in deciding what to do with that output. If she is wise, she will in many cases withhold judgment pending the outcomes of other tests.

The physicists of CDF frequently referred with pride to the many cross-checks they performed on their data, particularly in preparing the Evidence paper. Most of these cross-checks were carried out without using formal statistical methods, and most of them were not included in the Evidence paper itself. The cross-checks that yielded troubling results, such as the search for SVX-tagged Z events, which turned up more tags than one would expect if the SVX algorithm were identifying only real b-containing events, were included in the paper. A major purpose of the cross checks is to answer the question: is the discrepancy between the observed valued of our test statistic (number of candidate events) and the expectation based on the null hypothesis (the background) the result of a real physical effect of the sort that our experiment is designed to detect, or is it a spurious statistical fluctuation?

Having evidence for a genuine effect, however, is not generally the end of the story. CDF maintained that they had evidence, not only of a genuine excess of candidate events that would only rarely occur by chance, but of the existence of the top quark. That is to say, they considered the value of the test statistic "number of candidate events" based on their data to constitute evidence that top quarks had been produced, and then decayed, in their detector. To make this claim they had to substantiate not only a statistical excess but they also had to verify that the test results indicated the top quark hypothesis specifically, and not some other non-null hypothesis that might also explain those results. Reasoning at this level as well involves considering the outcomes of some of the cross-checks that were performed.

For example, one of the important features of the candidate events in the SVX and SLT searches is that they are events in which b quarks were produced. If the features of those events that caused them to be identified as having b content were the result of something besides the production of bs, then the class of possible alternate hypotheses to be considered would be larger. Numerous cross-checks are therefore performed to ensure that the SVX b-tagging algorithm really does tag b-containing events. The results of one such cross check are shown in figure 2.4, in which the effective proper decay length (cteff) is measured in a sample of SVX-tagged events from a control sample, and compared to the same quantity measured in a sample of b-containing events produced by a Monte Carlo using the world average b-hadron lifetime.

Here is a cross check for which no formal statistical analysis is presented, but on the basis of which one can informally assert with confidence that such a close fit between the results of tagging on real events and tagging on b-containing Monte Carlo events would be very improbable if it were not the case that the tagging algorithm were picking out events with real b content.

With reassurances such as these, when CDF presents potential alternate hypotheses besides the top quark hypothesis, they can restrict their attention to those hypotheses under which one would expect an excess of b-containing events. In general, they can assume, on the basis of their many cross-checks that the candidate events do have the features for which the cuts used were meant to select, such as, in the SVX search, a secondary vertex due to a b hadron, three or more energetic jets and high missing transverse energy due to an undetected neutrino. When considering alternate hypotheses, they ask what other processes besides top quark production might result in an excess of events such as these. They consider, for example, the possibility that they are detecting a fourth generation quark, b’. Depending on how such a quark decayed, they conclude that they would either find no signal (neutral current decay), hence no excess, in the decay channels that they considered, or they would arrive at strongly conflicting production cross section estimates from the excess candidate events and from the mass estimate (charged current decay) (Abe et al. 1994b, 3008).

Note how naturally this kind of reasoning fits with the error statistical interpretation of tests, and with the logic of a severe test. In essence, the argument asks one to think of the same statistical test as a test in which the null (background) hypothesis remains the same, but the alternate hypothesis is now the hypothesis of a fourth generation quark, the b’. The sampling distribution for the null hypothesis remains the same, as does the value of the test statistic, but does the excess indicate good evidence for an excess of the sort that we would expect if this new alternate hypothesis is true? Without using formal statistics, the CDF physicists answer that if the b’ tended to decay by charged current decays, then one might expect to see an excess of this sort, but then the cross check of comparing the cross section estimated from the excess candidate events to that obtained from the mass estimate would not be favorable. On the other hand, if the b’ favored neutral current decays, the distribution would be no different from that given by the null hypothesis, so that the data at hand is at least as good a reason to reject the neutral-current-decaying b’ hypothesis as it is a reason to reject the null hypothesis.

These and other similar arguments given in the "Alternate hypotheses" section of the Evidence paper (Abe et al. 1994b, 3008–9) essentially serve to show that, while the statistical test and assorted cross checks presented in the Evidence paper constitute a severe test of the top quark hypothesis, which that hypothesis passes, other hypotheses of potential interest are either not tested severely by this test or are tested severely, but do not pass the test.

Thus the error-statistical model of NPT testing not only explicates how one might give a reasonable interpretation to the outcome of such tests, it also shows a high degree of coherence with the way in which statistical tests were employed in this case. It seems unlikely that CDF physicists were being idiosyncratic in employing the modes of reasoning described above. Nevertheless, it would be worthwhile to examine statistical reasoning as conducted in other instances, and in other areas of scientific investigation, to see if the error statistical model does in fact reflect the typical use of statistical testing in scientific practice.

 

5. Conclusion

What I have attempted to do in this chapter is simply to lay the groundwork for future chapters. Some of the questions I have posed remain unanswered. I do not claim to have provided a completely satisfactory means for deciding which actions by experimenters are part of a given experiment and which are not. Such matters might seem of importance for addressing methodological claims that make when an experiment is performed seem important, such as the idea that the hypothesis to be tested, and the prediction it entails, ought to be framed before the experiment is performed (an idea associated with hypothetico-deductivism, to be addressed in chapter four), or that the fact discovered in the experiment ought to be a new one, not known before the experiment was performed (a version of the so-called novelty requirement, to be addressed in chapter five). Suffice it to say for now that the models of experiments and of statistical reasoning that I have described in this chapter turn out to be sufficiently powerful to address such concerns without having to appeal to a demarcation between those actions that are "really" part of the experiment and those that are not.