Commissioner's Remarks - Benefits and Limitations of States Benchmarking to International Standards:<br />A Meeting to Assist States in Making Informed Decisions about Participating in International Assessments

Mark Schneider
Commissioner, National Center for Education Statistics

Benefits and Limitations of States Benchmarking to International Standards:
A Meeting to Assist States in Making Informed Decisions about Participating in International Assessments
May 30, 2008

MEETING SUMMARY

Introduction

This document provides a brief overview of the presentations and discussion at the National Center for Education Statistics (NCES) May 30, 2008, symposium on the benefits and limitations of states benchmarking to international standards. Several prominent experts in assessment and standards were asked to present and be available for questions during the discussion with national and state education policymakers (meeting participants). The symposium consisted of two sessions of formal presentations. NCES Commissioner Mark Schneider provided introductory remarks for both sessions. Each session concluded with a discussion among the audience and presenters. Tom Loveless (Brookings Institution) moderated the discussion of the first session. Institute of Education Sciences Director Grover J. "Russ" Whitehurst moderated the second session's discussion.

Synopses of each session's presentations, and the topics of discussion at the end of each session, are presented on the following pages of this document in the order in which they occurred. The formal presentations in each session were as follows:

Session 1: What Do International Assessments Measure?

Trends in International Mathematics and Science Study (TIMSS) and Progress in International Reading Literacy Study (PIRLS): Ina Mullis (International Study Center at Boston College)
Program for International Student Assessment (PISA): Ray Adams (Australian Council for Educational Research)
Comparing assessment frameworks-the National Assessment of Educational Progress (NAEP), PISA, and TIMSS: Eugene Owen (NCES)
Aligning state policies to international assessment standards: Sandy Kress (Akin Gump)
Discussion

Session 2: How Are the Data From International Assessments Used?

Analytic possibilities and data limitations: Larry Hedges (Northwestern University)
Alternatives to Empirical Benchmarking: Gary Phillips (American Institutes for Research)
Comments on the Hedges and Phillips presentations: Jack Buckley (NCES)
Discussion

For further information about the symposium, contact Dan McGrath (Program Director, NCES International Activities Program) at daniel.mcgrath@ed.gov.

Session 1: What Do International Assessments Measure?

Introductory Remarks (Mark Schneider, NCES Commissioner)

To provide context for the session, Dr. Schneider reviewed past efforts to benchmark student performance, at the state level, to international standards. The TIMSS 1995 Benchmarking Study included five states and a consortium of school districts, and was funded by the states and districts involved. In 1999, the study was expanded to include 13 states and 14 districts or consortia of districts, with the additional costs underwritten by NCES and the National Science Foundation. In 2003, one state, and in 2007, two states participated: in these three cases, the states paid for the costs of administration. Some states have expressed to NCES an interest in participating in PISA 2009, although, as of mid-June 2008, none had entered into the contracts needed to undertake the assessment.

Dr. Schneider underscored that what the assessments measure-in terms of the knowledge and skills they emphasize, how they relate to states' curricula, and the ages and grade levels assessed-has important policy implications for states. PISA and TIMSS are very different from each other in what they measure. In turn, policy implications drawn from them may differ—and may differ from policy implications that would be drawn from the states' own assessments. Therefore, it is important for states to ask themselves, "Does this assessment measure something about which I care?"

What TIMSS and PIRLS Measure (Ina Mullis, Co-Project Director)

Benchmarking to International Standards - TIMSS & PIRLS: A Bridge to School Improvement (2,487 KB)

Dr. Mullis described TIMSS and PIRLS.

TIMSS

Population: Grades 4 and 8
Comparison countries: 37 countries (15 member countries of the Organization for Economic Cooperation and Development (OECD)) at grade 4 and 50 countries (11 OECD countries) at grade 8 in 2007
Subject matter assessed: Mathematics and science
- Mathematics subscales: Number, Geometric Shapes, and Data Display at grade 4; Number, Algebra, Geometry, Data, and Chance at grade 8
- Science subscales: Life Science, Physical Science, and Earth Science at grade 4; Biology, Chemistry, Physics, and Earth Science at grade 8
Cognitive domains: Reasoning, Applying, Knowing
Consistency with U.S. curriculum: Developed with participating countries' curricula in mind
Other data: Student background and school context data drawn from surveys of students, teachers, schools, education system (national or state), and parents (in some countries)
Next administration: 2011 (which will coincide with NAEP 4^th- and 8^th-grade math and science assessments).

PIRLS

Population: Grade 4
Comparison countries: 38 countries (19 OECD countries) in 2006
Subject matter assessed: Reading
Subscales: Literary and informational
Consistency with U.S. curriculum: Developed with participating countries' curricula in mind
Other data: Student background and school context data drawn from surveys of students, teachers, schools, education system (national or state), and parents (in some countries)
Next administration: 2011 (which will coincide with the NAEP 4^th-grade reading assessment).

What PISA Measures (Ray Adams, Project Director)

What do international assessments measure: PISA (1,116 KB)

Dr. Adams presented on PISA. Key points included:

Population: 15-year-old students
Comparison countries: 57 countries (30 OECD countries) in 2006
Subject matter assessed: Reading literacy, mathematics literacy, and scientific literacy
Major domain: Reading (2000), mathematics (2003), science (2006), reading (2009)
Competency clusters (science 2006): Identifying scientific issues, explaining phenomena scientifically, drawing conclusions based on evidence
Consistency with U.S. curriculum: intended to measure the application of accumulated knowledge to real-world situations; not curriculum based
Other data: Student background and school context data drawn from surveys of students, schools, education system (national or state), and parents (in some countries)
Next administration: 2009

Comparing TIMSS and PISA to NAEP (Eugene Owen, NCES)

Comparing International Assessments to NAEP (452 KB)

Dr. Owen presented results from recent studies undertaken by NCES to compare TIMSS and PISA to NAEP in terms of their measurement frameworks, their relative emphases across content areas and cognitive skills, and the likelihood of the content/skills they assess being included in curricula in U.S. schools. Key points included:

TIMSS-NAEP 2007 comparisons

Similar content domains in mathematics and science (grade 4 and 8); some differences in distribution across cognitive domains
Vast majority of TIMSS items "likely" in U.S. schools' curricula

PISA-NAEP science comparisons

Content and cognitive differences greater than those found between TIMSS and NAEP (e.g., "rate" as a concept does not appear in the NAEP framework, whereas it is found in many PISA items)
More than half of the PISA science items were rated as "not likely to be a NAEP item"
The closer PISA comes to its objective of assessing scientific literacy, the harder it is to map items to NAEP

Aligning State Policies to International Assessment Standards (Sandy Kress, Akin Gump)

Aligning State Policies to International Assessment Standards (37 KB)

Mr. Kress discussed the relationship between standards and assessments and the importance of thinking strategically about standards, policy and practice, and the use of assessments as benchmarks. Key points included:

Assessments should be driven by standards and the execution of policy and practice toward learning to those standards. They should not be administered in a vacuum.
Standards in the United States are generally vague, lacking in rigor, poorly structured, and not well aligned to clear goals. This can lead to ineffective (even counterproductive) uses of resources, including assessments.
States need to ask themselves, "Are students taught to these standards? Should they be? Would our business and higher education leadership want them to be? Have we analyzed whether they're right, better than the specifications in other assessments? NAEP?"
The real work is getting our standards right: figuring out what students should know and be able to do to be ready for postsecondary success; setting strong graduation standards, benchmarked to the best in our own country and around the world and consistent with postsecondary requirements; back-mapping those standards to earlier grades; then allocating resources and assessing success, by solid tests of student learning, to those standards.

Session 1 Discussion

Moderated by Tom Loveless (Brookings Institution), the discussion at the end of the first session included the following main topics and points:

1. What international organizations are responsible for PISA, TIMSS, and PIRLS, and how are decisions about the assessments made?

The International Association for the Evaluation of Educational Achievement (IEA) sponsors TIMSS and PIRLS.
The Organization for Economic Cooperation and Development (OECD) sponsors PISA.

Questions that were asked about these organizations led to a discussion of the governance arrangements for each of the assessments.

The IEA is an independent, international cooperative of national research institutions and governmental research agencies. Overall policy for TIMSS and PIRLS is set through meetings of the IEA General Assembly with representatives from member countries. Operational decisions are made through meetings of National Research Coordinators and representatives of the IEA and the International Study Center at Boston College, which has the international contract for developing and administering TIMSS and PIRLS. Within the United States, TIMSS and PIRLS are sponsored by NCES. NCES staff act as U.S. National Research Coordinators for the studies and oversee the work of the national data collection contractors, which implement the studies in the United States.
The OECD is an intergovernmental treaty organization composed mainly of industrialized countries. Overall policy for PISA is set through meetings of the PISA Governing Board, which consists of representatives from the 30 OECD countries; non-OECD countries participating in PISA may also send representatives to these meetings as observers. Operational decisions are made through meetings of National Program Managers and representatives of the OECD and the Australian Council for Educational Research (ACER) and other members of the consortium of contractors, which has the international contract for developing and administering PISA. Within the United States, PISA is sponsored by NCES. NCES staff act as the U.S. National Program Manager and oversee the work of the national data collection contractors, which implement PISA in the United States.

2. Separation of data collection/reporting and policy interpretation

Audience members and presenters debated the importance of the separation of data collection and interpretation. A concern raised by Tom Loveless, the chair and discussion moderator for session 1, was the extent to which the international bodies (the IEA and, particularly, the OECD) mix the reporting and policy implications of the results of the international assessments. In the United States, Office of Management and Budget guidelines call for a separation in time and space of the reporting of statistical results and the policy interpretation of those results. Strict separation of data collection/reporting and interpretation for policy implications helps to maintain the credibility of the statistical results. To the extent that the bodies representing international assessments mix the release of results with policy prescriptions drawn from an interpretation of the results, they risk undermining public confidence in the integrity of the data. This is especially true if the prescriptions are not based on rigorous scientific evidence.

3. What is the difference between a curriculum-based assessment, like TIMSS, and an assessment that measures the "yield" of learning, like PISA?

Questions raised about the concept of a "yield," and what exactly PISA tests if it is not curriculum based, led experts to explain that PISA focuses on the application of learned knowledge and skills to situations that students are expected to encounter as young adults. These situations are not tied to specific curricular objectives.

4. Importance of aligning states' standards to those of the top-performing nations

Several people echoed Mr. Kress's statements about the importance of setting rigorous standards. There was considerable discussion about how to encourage the development and implementation of rigorous standards across the states.

5. How is curricular information collected in the United States, and how can we learn more about the curriculum of other countries?

Several people called for collecting information about the curricula of the top-performing countries, and there was discussion about how curricular information is collected in the United States and internationally. The problem with focusing on only data from the top-performing countries was also noted—since low-performing countries may be employing the same practices as top-performing countries. TIMSS and PIRLS each ask countries to report curricular information, and the information is available through the International Study Center at Boston College (timss.bc.edu or pirls.bc.edu). Reporting this information accurately for the United States is challenging. While NCES consults with national education groups (such as the Council for Chief State School Officers) when responding, curricular information is often reported as varying across states and districts.

6. What are the costs of administering international assessments in the states?

Because no states have participated in PISA in the past and relatively few have participated in TIMSS, it is difficult to estimate the costs. However, based on national costs and samples of 1,500 students per state, a broad estimate would be $1 million per state for PISA 2009. Questions were raised about why U.S. costs are apparently so high relative to costs in other countries, and why, in the United States, PISA has been much more expensive per student to implement than TIMSS. Costs of business for survey data collection in the United States are high relative to costs in other countries. Differences in costs between PISA and TIMSS within the United States are driven by the relatively higher recruitment, data collection, and scoring costs for PISA, as well as economies of scale resulting from the historically larger U.S. samples for TIMSS. NAEP is far less expensive per student than PISA or TIMSS, largely because of economies of scale.

Top

Session 2: How Are the Data From International Assessments Used?

Introductory Remarks (Mark Schneider, NCES Commissioner)

Dr. Schneider described large-scale assessments such as TIMSS, PISA, and NAEP as instruments that perform a limited set of functions exceedingly well, but cautioned against putting them to uses for which they were not intended. Dr. Schneider raised three main points to be discussed in more detail by the presenters:

1) The international assessments do not provide usable scores at the student, classroom, school, or, in most cases, district level. Since they do not produce student scores and since they are cross-sectional, these assessments cannot measure individual student gains over time.

2) Cross-national measures of what matters in learning are still in their infancy. While we have made great progress in developing cross-national measures of knowledge and skills, key aspects of classrooms, schools, and education systems remain unmeasured or poorly measured.

3) International assessments are expensive. There may be alternate ways to generate state TIMSS and PISA scores that are less expensive and less burdensome. However, linking studies are based on statistical models that rely on a set of assumptions, and people have to be comfortable with the assumptions built into the model. Furthermore, results obtained from these models do not lend themselves to the kinds of analysis for which many researchers use TIMSS and PISA.

Analytic Possibilities and Data Limitations (Larry Hedges, Northwestern University)

Benchmarking with National and International Assessments (39 KB)

Dr. Hedges discussed the benefits and limitations of state benchmarking with PISA, TIMSS, and PIRLS. Key points included:

Benefits

Benchmarking allows comparisons between student performance in a state and in other nations
PISA may be an easier assessment to compare cross-nationally because it measures real-world skills rather than academic skills
TIMSS and PIRLS are closely tied to academic skills, thus more relevant to curriculum, instruction, and school policies
Grade-based assessments are more relevant for U.S. purposes
Background material may be important for policy discussions, especially if material on instruction is included (although these data are hard to gather).

Limitations

Cross-national development of instruments requires compromises that limit an assessment's relevance for local purposes
Background questionnaires are imperfect for U.S. purposes
There is a temptation to draw policy conclusions from cross-sectional comparisons, which is not warranted
PISA, which is an age-based assessment of 15-year-olds, may not be particularly policy relevant for measuring the output of U.S. schools
Because it is not tied to school curricula, PISA has limited usefulness for monitoring school outputs

Other benchmarking possibilities

NAEP:
- Already a relevant U.S. state assessment in place, involving no international compromises on content specification
- However, its background questionnaire provides limited information
NCES longitudinal studies:
- Better suited than cross-sectional studies for hypothesis generation
- Superior background measurement
- Could provide comparisons over a wide age range
- However, they do not offer international benchmarking

Alternatives to Empirical Benchmarking (Gary Phillips, American Institutes for Research)

Obtaining International Benchmarks for States Through Statistical Linking (83 KB)

Dr. Phillips discussed alternatives to benchmarking that is based on testing large numbers of students in each state. He emphasized methods that would allow states to use NAEP scores to estimate scores on international scales. Key points included:

Statistical linking involves expressing scores on one assessment in the metric of another assessment that is thought to measure the same thing. Phillips has converted states' NAEP mathematics and science scale scores into the metric of TIMSS and estimated rankings for states among participating TIMSS countries.
Statistical linking has the advantages of being inexpensive and not requiring the administration of the international assessment in states.
However, the two assessments to be linked must be administered in the same year (or nearly the same year) and must measure the same knowledge and skills.
TIMSS was designed to be similar to NAEP and appears to be linkable.
PISA and NAEP are sufficiently different in what they measure and in the populations they assess that they are not good candidates for linking.

Comments on the Hedges and Phillips presentations (Jack Buckley, Deputy Commissioner, NCES)

What Use Are International Assessments for States? (339 KB)

Dr. Buckley served as discussant for this session and provided comments on the Hedges and Phillips presentations. He also presented ongoing NCES work on the feasibility of using small area estimation methods for producing state-level estimates from national samples of international assessments. Key points included:

With current national PISA sample sizes (approximately 5,600 students), state-level PISA estimates are possible; however, the confidence intervals around the estimates are very large.
The inclusion of additional information at the school level (for example, schools' rankings on state assessments) might improve the precision of PISA estimates.
Larger national sample sizes would improve precision as well.
Work is continuing on the improvements in precision to be expected from increased sample sizes and additional school-level data.
However, as with Phillips' statistical linking methods, the international estimates are modeled estimates and not empirically derived—hence, to use these estimates, the audience for these results must be comfortable with the assumptions of statistical modeling.

Session 2 Discussion

Moderated by Grover J. "Russ" Whitehurst (Institute of Education Sciences Director), the discussion at the end of the second session included the following main topics:

1. What are states interested in gaining from international assessments, especially as it relates to benchmarking?

Concerns about economic competition fundamentally drive the states' interest in international benchmarking: governors and business leaders are worried about how well our students and workforce measure up to those of our international competitors.

A new international assessment sponsored by the OECD, the Program for International Assessment of Adult Competencies (PIAAC), slated for data collection in 2011, should directly address concerns about the skills of adults. PIAAC will assess adults in literacy, numeracy, and problem solving in a technology-rich environment and provide benchmarks to most of the OECD countries.

2. Efforts needed to raise standards

Members of the audience asked whether state and national leaders would commit themselves to raising standards and agreeing on them. This would require better coordination on the part of the federal government and the states, as well as a discussion about what resources are needed to reach this goal. Many current high school exit exams are thought to test at a low level. There were calls for state leaders to design more appropriate tests and accept higher failure rates.

3. Seven questions for benchmarking internationally

There was considerable discussion around a set of seven questions that an audience member suggested we should be able to answer about the 10 top-performing countries (and perhaps the 10 bottom-performing countries):

What do they want their kids to learn?
What is their curriculum?
How do they deliver it?
How do they assess it?
What is their cut score?
How do they do against it?
How well prepared are those who pass it (based on the standards in their country)?

The symposium concluded with Commissioner Schneider thanking the presenters and participants and inviting further discussion.

Top

NCES Headlines