Watershed Classification System for Tiered Diagnosis of Biological Impairments: A Scalable, Central Plains Focus with National Applicability
The goal of this research was to produce an ecoregionally stratified classification system for estimating biological impairment of watersheds. Such a system can be used for ranking watershed vulnerability to impairment and can contribute to developing recommendations for ecosystem rehabilitation. The overarching hypothesis for the research is that landscape-scale surrogates for watershed condition (i.e., stressor indicators) derived from remotely sensed and geospatial data can be used to predict watershed vulnerability and measures of water quality and biological integrity (i.e., response indicators). Unique aspects of this research are (1) the use of time-series satellite data to characterize the dynamic nature of landscapes; (2) the use of classification and regression tree analysis for prediction and classification; (3) the implementation of the model using a trace map; (4) the inclusion of a broad range of watershed sizes, from first order streams to great rivers; (5) the integration of field data from multiple sources to create a large database of response samples; and (6) the large geographic scale of the study. This work was conducted under EPA Cooperative Agreement RD-83059701.
Data and Methods
The research project required development of a geographic database to characterize watershed landscape conditions linked to a field sample database containing information on in-stream conditions related to water quality, benthic macroinvertebrates, and fish. Spatially, the combined database covers the four-state EPA Region 7 (Iowa, Kansas, Missouri, Nebraska).
The geographic database consists, for the most part, of data readily available from public sources. These include infrequently updated or functionally invariant datasets such as digital elevation models, transportation networks, soils, hydrography, human population, and land cover (static landscape variables). Time-series data from the Advanced Very High Resolution Radiometer (AVHRR) satellite were used to derive a set of vegetation phenology metrics (VPMs) that express the annually dynamic character of the landscape (dynamic landscape variables). To explore the relationship of short- and long-term landscape trends to field data, two sets of 1- through 5-year VPM temporal averages were calculated; one set using values immediately prior to and including the sampling year and one that excluded the sample year. In total, there are 235 layers in the geographic database (15 static, 220 dynamic).
The field sample database was compiled largely from existing datasets collected for wadeable and non-wadeable streams by various groups and agencies from the four states that span a 10-year period (1994-2003). Extensive work went into standardizing and merging these separate datasets and in matching reported locality information to georeferenced stream and river segments. The field sample database was supplemented with additional non-wadeable field sites collected under this grant.
The field sample database includes more than 13,000 sampling events representing over 1,300 sites. The original datasets included far more sampling events and sites, but variation in the parameters measured and lack of standardization in field protocols, in addition to uncertainty of sample locations, limited the number that could be included. Measurements of water quality parameters were most numerous. There were far fewer fish and benthic macroinvertebrate samples collected. Unfortunately, very few sample events include all response indicators. After review of available response indicators from each type of sample, one measure was selected from each for analysis: total nitrogen concentration in the water sample (‘TotN’), the proportion of fish species collected in the sample event that are classified as taxa sensitive to water quality impairment (‘PrpSensF’), and the number of macroinvertebrate families collected in the sample event (family-level richness: ‘InvFmRch’).
Initially, we proposed to base model development and watershed classification on use of an Index of Biotic Integrity (IBI) representing a composite of multiple measured parameters, standardized by various criteria. During the course of this project, use of an IBI was rejected in favor of using the individual metrics. First, IBIs are specific to ecoregions (or even smaller geographic areas). Development of a region-wide index useful across multiple ecoregions is problematic. Second, the calculation of an IBI reduces all field variable information into a single value, making modeling results difficult to interpret. It was concluded that such a reduction would result in an unwarranted loss of potentially significant information that could only be gleaned through analysis using the metrics individually.
The analysis of the field sample data in relation to landscape variables was conducted in three stages. The first two stages focused on issues of spatial scale as they relate to implementing the classification system. In the third stage, regression tree analysis was used to examine temporal and additional spatial issues impacting the relationship between potential landscape stressors and environmental response variables.
Results were used to guide development of final classification models. An execution of a model using a trace map concludes the research.
Area of Influence
In the first project, the influence of spatial proximity of landscape stressors to biological and water chemistry response indicators was investigated. Analysis was conducted only on watersheds larger than 1,024 square kilometers. Ten areas of increasing size, or areas of influence (AOIs), immediately upstream of the sample point were delineated and summary statistics extracted from the two 1- through 5- year temporal average VPM datasets. Summary statistics were also extracted for the entire watershed. Bivariate correlation analysis was conducted between the VPMs and three field metrics. For comparison, correlation analysis was also conducted between percent cropland and the three field metrics.
The results showed that statistics extracted for the entire watershed correlated better with response variables than statistics extracted from the AOIs. Furthermore, the top performing VPMs explained more variation than did cropland fraction, suggesting that VPMs contain relevant dynamic information related to climate and/or land use management. Optimum VPM predictors for ‘TotN’ were rate of senescence and maximum season NDVI (R2 ~ 0.6, entire watershed), for ‘PrpSensF’ the optimum predictors were rate of greenup and average growing season NDVI (R2 ~ 0.38, entire watershed), and for ‘InvFmRch’ the optimum predictors were average growing season NDVI and peak season NDVI (R2 ~ 0.35, entire watershed). Corresponding R2 values for percent cropland and the three target variables were ~ 0.34 for ‘TotN’,
In the second project, the effect of watershed size on the relationship between landscape stressors and biological and water chemistry indicators was investigated in depth. Analysis focused on two questions: (1) do the ecological measurements demonstrate dependence on watershed size and, (2) does watershed size affect the relationship between watershed-level VPM statistics and the target values? Statistics were extracted for the entire watershed for all watersheds, and bivariate correlation analysis was conducted between the VPMs and the three field metrics and between percent cropland and the three field metrics.
For question one, the results showed that each field metric demonstrated some dependence on watershed size. For question two, the results showed that the strength of relationship for total nitrogen and proportion sensitive fish taxa improved with increasing watershed size, while the strength of relationship for invertebrate family-level richness remained relatively constant across watershed sizes. A secondary analysis found that longer temporal window averages of top predicting VPMs performed better than shorter temporal window averages. Optimum VPM predictors for ‘TotN’ were rate of senescence and date of maximum NDVI (R2 ~ 0. 5); for ‘PrpSensF’ the optimum predictors were average growing season NDVI and rate of green-up (R2 ~ 0.15); and for ‘InvFmRch’ the optimum predictors were average growing season NDVI and peak season NDVI (R2 ~ 0.3). Corresponding R2 values for percent cropland and the three target variables were ~ 0. 4 for ‘TotN’, ~ 0.05 for ‘PrpSensF’, and
Regression Tree Analysis and Classification Model Development
The third project used classification and regression tree analysis (RTA) to (1) identify landscape predictor variables (static and dynamic stressors) across multiple scales (spatial and temporal) that reliably model aquatic response variables, and (2) develop a classification scheme for watershed vulnerability. To identify suitable landscape predicator variables, regression tree (RT) models were developed to examine (a) only static predictors, (b) a combination of both static and dynamic predictors, and (c) only dynamic predictors. In general, model R2 values were higher using the dynamic predictors than the static predicators. R2 values for models developed using both sets of predictors performed either comparably to or better than those using the dynamic predictor set alone.
Examining global model results for ‘TotN’, initial splits using the static predictors most often were based on percent cropland and soil K-factor (R2 ~ 0.39); initial splits using the combined predictors most often were percent cropland and rate of senescence (R2 ~ 0.47); and initial splits using the dynamic predictors most often were rate of senescence and NDVI at green-up onset (R2 ~ 0.48). For ‘PrpSensF’, initial splits using the static predictors most often were based on ecoregion and soil K-factor (R2 ~ 0.57); initial splits using the combined predictors most often were also ecoregion and soil Kfactor (R2 ~ 0.62); and initial splits using the dynamic predictors most often were longer temporal window averages for NDVI at greenup onset and rate of greenup (R2 ~ 0.56). For ‘InvFmRch’, initial and secondary splits using the static predictors most often were based on ecoregion (R2 ~ 0.38); initial splits using the combined predictors most often were based on ecoregion, 5-year average maximum NDVI value, and 3-year average date of dormancy (R2 ~ 0.59); and initial splits using the dynamic predictors most often were based on various temporal windows for average growing season NDVI, average NDVI at dormancy onset, and average maximum NDVI (R2 ~ 0.56). Upon examination or random data subsets to assess the robustness of the results, tree topology (structure and composition) varied greatly in many cases due to the large number of predictor variables.
Three ecoregions were selected for RT modeling to examine how model results are affected when applied at a different spatial scale. The ecoregions were the Central Irregular Plains, the Western Corn Belt Plains, and the Ozark Highlands. Results generally supported the use of global models, while at the same time indicating the need for more locally determined models in particular situations.
Regarding model preference for short-term or long-term average VPM values, results showed that longer temporal window averages produced slightly higher R2 than shorter temporal window averages when ‘TotN’ and ‘PrpSensF’ were considered. For
‘InvFmRch’, there was no clear preference for temporal window size.
In an effort to derive a robust model suitable for use in classification, trees were grown using only the top performing dynamic predictors. Using these restricted variable sets, the largest common tree form observed across different data subset evaluations was identified as the final model. In terms of performance, the resultant robust models were 70- 80 percent as effective as their unconstrained counterparts (‘TotN’ R2 ~ 0.42, ‘PrpSensF’ R2 ~ 0.39, ‘InvFmRch’ R2 ~ 0.48).
This work concludes with an application of the ‘PrpSensF’ model to all stream points within a watershed in southeastern Kansas. The application provides an example of a stream trace map in which each point in the stream network within the watershed is “predicted” using the final model from the regression tree analysis.
VPMs are dynamic landscape predictors that explained 30-50% of the variance in the three response variables. VPMs are generally better at predicting the response variables than land use/land cover and other commonly used landscape predictors. Although VPMs are correlated with land cover (e.g., percent cropland), they contain additional information such as vegetation condition and cropping practice. It is this information that may contribute to the stronger predictor/response relationships. The strength of the predictor/response relationship was stronger when the entire watershed was considered. The strength of the relationship was often stronger for larger watersheds. Using regression tree modeling, VPMs can be used to rank order (classify) watershed response conditions.