A Testing Primer: A Researcher Translates for the Rest of Us

By
Learn more about Mark Dynarski.
Mark Dynarski
Advisor, Education
George W. Bush Institute

Mark Dynarski, an education advisor to the George W. Bush Institute and founder and president of Pemberton Research, walks through key topics and consideration points in testing.

Mark Dynarski, an education advisor to the George W. Bush Institute, is the founder and president of Pemberton Research, which focuses on understanding and utilizing research evidence in decision making.

His accessible explanations of complex testing topics provide a handy reference for this series.

What is the difference between a diagnostic, summative, and formative test?

Tests have different purposes. A diagnostic test might be used when a teacher sees indications that a student may have a perceptional issue of some kind, such as dyslexia. A formative test typically is administered by a teacher to assess whether students have mastered a skill that has been the focus of instruction. Its purpose is to help the teacher identify whether students have mastered the skill and can move on to the next skill. It also provides teachers with useful feedback about whether students are learning the skill and which aspects of a skill they might be having trouble with.

A summative test is typically broader than diagnostic and formative tests, encompassing the material for a whole subject area, such as reading and math, and for a whole grade level. These are the tests that states use for accountability purposes. Because the same test is given, summative scores can be compared between schools and between districts.

The three kinds of tests do not overlap much but there is some overlap. For example, a diagnostic test might indicate a student has a perceptional issue. That same student might struggle on a formative or summative test because of that issue, but not necessarily. The student might have difficulty reading a passage to identify a theme yet have no difficulty manipulating fractions.

What are the essentials of a high-quality summative standardized test?

Most summative tests are built on principles of psychometrics, which is the science of measuring psychological traits, including knowledge. It is a highly mathematical and statistical field and most of its practitioners have doctoral degrees. Its two central concepts are reliability and validity.

Reliability is a broad concept with applications in many commercial and industrial fields, but, roughly speaking, a test is said to be reliable if it yields a stable and consistent score — if a reliable test is administered this week and is administered two weeks later, the two scores will be close.

A test is said to be valid when it measures what it is intended to measure. For example, a valid math test yields a score that is about the math skills being tested. If the skill is manipulating fractions, a valid test of that skill will yield higher scores for students who have greater skill in manipulating fractions.

A test can be reliable and not valid, or valid and not reliable. Imagine weighing yourself on a bathroom scale. The scale is reliable if it shows the same weight when your weight doesn’t change. But if it shows you an incorrect weight, it’s not valid. The scale is valid if it shows you the correct weight, but it is not reliable if it varies from day to day though your weight does not. A reliable and valid scale shows you the correct weight every time.

 

Imagine weighing yourself on a bathroom scale. The scale is reliable if it shows the same weight when your weight doesn’t change. But if it shows you an incorrect weight, it’s not valid…. A reliable and valid scale shows you the correct weight every time.

 

It is hard to find examples of state assessments that are not both valid and reliable. A huge amount of research and development underpins these tests.

How are they built? What is the process like? Who determines the questions?

Summative tests that are at the level of a whole state generally are created following a standard process. States first must have a set of standards of what students in each subject and grade should learn. A state can design its own standards, but differences between standards can create confusion (what a fifth grade math student is expected to learn in, say, Massachusetts, might differ from what a fifth-grade math student is expected to learn in Michigan).

Reducing that confusion was an important reason why in 2009 a group of 48 states created a consortium that prepared one set of standards, which we know today as the “Common Core.” However, since that time, various states have moved back to creating their own standards, often using the Common Core standards as a starting point.

Standards in hand, teachers and educators then create pilot questions to test knowledge related to the standards. For example, the question designers might come up with 10 questions that test a student’s understanding of a math concept. The pilot questions are reviewed to ensure that they test that concept, and that they do not use language that might advantage one type of student over others.

Pilot questions that are deemed the best are then assembled into a pilot test. That test is administered in a sample of schools and the results scrutinized to ensure that the test has the desired statistical properties. For example, a test that has many students getting no questions right or getting all of them right is not desirable. It means that students either know less than the test is indicating (when they get no questions right) or they know more than the test indicates (when they get all the questions right).

Ensuring that students do not cluster at the ceiling is one reason that tests include questions with high levels of difficulty for the grade level. These difficult questions sometimes lead to complaints that students can’t possibly know the answer to them — but that’s a feature of designing a sound test. At the opposite end, tests include easy questions that lots of students get right, but these kinds of questions rarely generate complaints.

How do standards and tests align?

The process described above in principle yields a test that is derived from the standards. Because each question emerges from a standard, the test includes only questions that relate to some standard. Of course, if a teacher does not cover material related to a standard that is the basis for a test question, it’s predictable that students of that teacher might do poorly on that question. That’s a feature of the instruction students received, but not the test itself.

What tests best measure student growth?

Student growth in a physical sense is easy to measure. If height and weight this year are subtracted from a student’s height and weight last year, we learn how much the student physically grew.

For tests, measuring growth is more challenging, but test designers have come up with various procedures that allow scores for the same student in different years to represent their achievement growth. For example, a student might score a 400 on their math test in fourth grade and a 500 on their math test in fifth grade. If the two tests are vertically scaled, the difference between the two scores reflects how much a student has grown in math.

 

For tests, measuring growth is more challenging, but test designers have come up with various procedures that allow scores for the same student in different years to represent their achievement growth.

 

When scores are reported as levels of proficiency, it’s difficult to know how much students are growing. A student who scores “proficient” in fourth grade math and “proficient” in fifth grade math may have grown in their math achievement, but how much is hard to measure because subtracting proficient from proficient is not meaningful.

What things should we worry about in testing that we don’t worry about now? 

Few might worry about educators altering test scores to improve their likelihood of receiving more positive evaluations or cash bonuses, but the criminal charges that emerged from the cheating scandal in the Atlanta Public Schools are a reminder that paper-and-pencil tests can be manipulated. Moving to online testing will mitigate this possibility.

On the other hand, some parents have expressed concerns that education should be about more than filling in bubbles on test answer sheets. The perspective that multiple choice tests somehow do not adequately test student knowledge has no basis, as the above discussion points out. However, that education is a broader concept than “stuff being tested” is undeniably true. A balance needs to be struck between wanting to know that students are learning and wanting education to be more than practicing and taking tests.

What testing innovations should we be excited about?

Two recent innovations are online tests and adaptive tests (which also can be online). An online test clearly is where the testing field is heading, because they can be scored quickly and are not amenable to manipulation. The tests that were designed for states adopting Common Core standards were online. There are concerns that students who do not regularly use computers (possibly because they do not have access at home or in their communities) might have more difficulty navigating online tests, but over time the number of these students is expected to decline.

An adaptive test feeds students increasingly more difficult questions based on questions that students answer correctly. Adaptive tests can accurately identify a student’s knowledge level with only a few questions. They are more complex to design than the “same for each student” type of test, but the time-saving may be worth the investment.

 

An adaptive test feeds students increasingly more difficult questions based on questions that students answer correctly…. They are more complex to design than the ‘same for each student’ type of test, but the time-saving may be worth the investment.

 

What is the difference between a test that establishes a floor of basic learning and one that looks at higher-order learning?

A test of basic learning is analogous to a written driver’s test. We expect that someone wanting to drive a car should have adequate knowledge about what speed limits mean or what various road signs mean. Passing the written driver’s test signals that an individual has that knowledge. It does not signal that the individual can drive a car at very high speed in pursuit of a lawbreaker, as police are trained to do. That would be a ‘higher order” of driving skill.

Some states require students to pass a basic learning test before they can graduate from high school. These tests create a floor for what students need to know before they can graduate. They differ from higher-order tests such as Advanced Placement exams that require deep subject knowledge

What are cut scores? Who determines them? How are they determined? How are they adjusted?

Many states use scores from tests to categorize students into levels such as “below basic,” “basic,” and “proficient.”  The categories are defined by “cut scores,” which are the scores at which a student moves from one category to another.

Cut scores are and need to be the product of professional judgments. Creating cut scores requires educators to identify, through some process (there are a range of them, of varying sophistication), the scores where a borderline student will fall.

For example, using the so-called “bookmark process,” an educator may conclude that a student who is on the border between “basic” and “proficient” would answer questions one through twelve of a 20-question test correctly, when questions are ranked from easiest to hardest. Another educator might conclude that the borderline student might answer questions one through thirteen correctly. Another might think the student can answer questions one through eleven. Averaging these “bookmarks” yields a cut score. The process needs to be applied every time the test changes.

 

Many states use scores from tests to categorize students into levels such as “below basic,” “basic,” and “proficient.”  The categories are defined by “cut scores,” which are the scores at which a student moves from one category to another.

 

An obvious weakness of applying cut scores is that they can discriminate between otherwise similar students. If a cut score is set to 200, a student scoring a 199 is “basic” and a student scoring a 201 is “proficient,” though the two-point difference is not meaningful.

What about English Language Learners and special education students? Why are tests important for these students?

An important feature of accountability going back to No Child Left Behind in 2002 has been that test performance for English learners and special education students are reported separately from other students. This separate reporting revealed that these students generally were doing more poorly than other students (as were ethnic-minority students).

Test designers have grappled with the challenges of designing tests in languages other than English and for special education students, and most states now administer alternate assessments. For example, Texas administers a Spanish-language version of its state assessment and has an Alternate 2 version of its test for students with significant cognitive disabilities.