Jump to main content.


Principal Components Analysis (PCA)

MAIA land use example - Principal components analysis
 description | simple example | MAIA example | diatom example | how it works | caveats

Description: PCA takes a set of variables and defines new variables that are linear combinations of the initial variables. PCA expects the variables you enter to be correlated. PCA returns new variables, the principal components or axes, that summarize the information contained in the original full set of variables. PCA does not test any hypotheses or predict values for dependent variables; in that sense, it is more of an exploratory technique.

Simple example: Suppose you have many measurements of the physical shape of stream sites, {width, depth, flow, gradient, percent fines} and you would like to combine them into a single variable. PCA gives you back principal components, or axes, that summarize the information in your original data using many fewer variables. The model also tells you which of the original variables were most strongly correlated with the new variables, or axes. The results of the analysis are typically plotted using the first and second axes. If the cases fall into groups or clusters on the plots, you must determine what the cases have in common.

MAIA example: Bryce et al. (2000) developed a risk index that summarized the intensity of human disturbance in the watershed upstream of sampled reaches. The risk index integrated information from the regional, watershed and reach scale. Each watershed was scored from 1 to 5 representing minimal to high risk of impairment.

The authors evaluated the association between the risk index and water chemistry measures known to be associated with human disturbance. They divided the water chemistry variables into 2 sets, 1 related to ion concentration and 1 related to nutrients. Within each set, the variables were highly correlated with each other; therefore, they used PCA to reduce the number of variables to a single axis, or principal component. They plotted the risk index against the first component from each set of variables.

High values of ionic strength and nutrient entrihcment are related to human such as agriculture and urbanization. All 3 measures (ionic strength, nutrient enrichment, and risk index) increased together for both the ridges and the plateaus. Streams draining ridges are naturally lower in nutrients and ionic strength, indicated by the higher proportion of sites in the lower left quadrant. 

Figure

Risk Index - Risk index values for stream sites were derived from measures of human disturbance and indicate increasing disturbance from 1 to 5. Index values are plotted against the first axes from 2 PCA's, 1 related to ionic strength and 1 for nutrient enrichment.

Figure. Risk index values for stream sites were derived from measures of human disturbance and indicate increasing disturbance from 1 to 5. Index values are plotted against the first axes from 2 PCA's, 1 related to ionic strength and 1 for nutrient enrichment.

Diatom example: PCA of human disturbance measures: Within each watershed there were many different types of human disturbance. Principal components analysis was used to combine multiple measures into a single variable.

For the 1993-94 data, Bryce et al. (1999) developed an index to summarize the risk of human disturbance in a watershed. Metrics that were significantly correlated with the disturbance index were included in a multimetric diatom index.

Rather than test the diatom index with the data from sites sampled in 1993-94, independent data from 1995 was used. Unfortunately, the disturbance index was not calculated in 1995.

Rather than test the diatom index with the data from sites sampled in 1993-94, independent data from 1995 was used. Unfortunately, the disturbance index was not calculated in 1995.

The disturbance index was approximated using PCA. Different combinations of variables were tested the set that best approximated the disturbance index was selected.

PCA based on chloride, total N, riparian condition measures, road density, % urban, forest, agriculture, and mine cover yielded an axis that was most highly correlated with the original disturbance index.

How the method works: The data entered represent a cloud of points, in n-space. If you have 3 variables, then the cloud of points exists in regular 3-dimensional space, where each point has a value for x, y, and z. For example, each stream site has a value for width, depth, and flow which defines 1 point in the cloud. The cloud is probably longer in one direction than another, and that longest dimension is where the points are the most different; that's where PCA draws a line called the first principal component. That line is guaranteed to be the line that places your sample points the farthest apart from each other, in that way, PCA "extracts the most variance" from your data. This process is repeated to get multiple components, or axes.

Assumptions/limitations: You can't test whether the line selected by PCA is "significant," you can plot the results and evaluate where the cases fall in relation to each other. It's up to you to decide if the patterns make sense. Thus, PCA is typically considered an exploratory technique.

Although no hypotheses are tested with PCA, the model assumes data to be distributed multivariate normal.

Biological Indicators | Aquatic Biodiversity | Statistical Primer


Local Navigation


Jump to main content.