September 15, 2020

The Evolving Lessons of Testing

Scott Marion, executive director of the Center for Assessment, details how new, richer, testing may evolve.

Scott Marion is executive director of the Center for Assessment, a New Hampshire-based organization that focuses on state testing and accountability systems as a way to help students advance academically. As part of his role, the former high school science teacher consults with state education leaders nationwide on their assessments and accountability systems. Before his current position, he was director of assessment and accountability for Wyoming’s Department of Education and received his PhD from the University of Colorado.

Marion, a school board member in New Hampshire, spoke before the 2020-2021 school year began with Anne Wicks, the Ann Kimball Johnson Director of the Education Reform Initiative at the George W. Bush Institute, and William McKenzie, senior editorial advisor at the Bush Institute. He detailed how new, richer testing may evolve. He explained the intricacies of setting cut scores for standardized exams. And he discussed strategies that could help schools assess any learning deficits from the end of the pandemic-stricken 2019-2020 school year.

What role does assessment play in teaching and learning? What role does it play in accountability? And how are those alike and different?

This question gets at the heart of one of the most important issues in testing: Tests are designed and validated for specific purposes.

I think tests should be designed primarily for supporting instruction and learning. The test should follow good instruction and curriculum. And the best tests are intricately linked with the day-to-day curriculum. Teachers can use them to help kids improve if they are tied to the curriculum.

We also need tests for “bigger” purposes, like monitoring. At the state level, we have tests for monitoring, but also to use as metrics in accountability systems. They provide crucial data. And state accountability systems, starting with NCLB, shifted the discussion from input-based systems, where the focus is on whether the school has enough resources, to an output-based system, where you concentrate on whether the kids are learning, for example, math, language arts, and science.

What role should high-quality assessment play – especially as we try to understand how all kids are being served or not – as this new school year unfolds?

Even though I direct the Center for Assessment, I think assessments should take a back seat to ensuring kids are okay emotionally, physically, and mentally. Then, you can start engaging in instruction.

What I worry about most with large-scale testing in the fall is that it’s going to set up a remediation mindset: “You didn’t learn all the stuff you should have learned in fourth grade, so we are putting you at fourth-and-a-half grade and remediate for half a year. And we are going to do that as opposed to trying to move you to fifth grade standards as quickly as possible.”

Good teachers will quickly see that their student is missing an important concept. But they will remediate as they go along, instead of teaching everything the student didn’t learn in the previous year.

What teachers really need are good pre-assessments for the first unit or two that they’re teaching this year. What key knowledge and skills do kids need to succeed in this first or second unit? Then teachers must figure out what they need to build those skills. After that, they should do a pre-assessment for every following unit to find out what they need to shore up.

All that said, if states have money to redirect to districts that are showing considerable gaps, it is fine to give a large-scale short test, like one class period of English language arts, in the beginning of the year. That would give a general picture.

Have you detected any willingness by a state to try that?

Some states are doing that. I wouldn’t do it until the second or third week when you can afford to miss one class period and so kids aren’t greeted with a test when they first get back in school. And only do it if you can do something with the results. Some states have discretionary money thru the CARES Act to provide extra professional learning support to help teachers catch their kids up.

Most states are trying to provide guidelines for sensible use of districts’ existing interim assessments, rather than requiring a common statewide assessment. There are tradeoffs with either approach.

What you are describing also requires great teaching and leadership for all kids, and we know that it is uneven. And we know that kids who can least afford gaps are likely to be disproportionately impacted by that unevenness. How can data help states appropriate their discretionary CARES ACT money to intervene quickly? How should we think about testing in that case?

Right. If every teacher and principal were great, we wouldn’t need people in my business.

High–quality curriculum, and I mean really high-quality, is the best bang for the buck. It will come with decent embedded assessments that will guide teachers about moving to the next step, whether that is reviewing a certain concept or moving to the next unit. In this day and age, there is no excuse for not having a curriculum that’s available online. It shouldn’t just be a textbook.

High–quality curriculum, and I mean really high-quality, is the best bang for the buck.

We have too much curriculum slop in this country, and I know we love our local control. But states could offer to help buy great curriculum. It does not have to be the same curriculum for all districts, but any one used should be high quality. That would be a big deal. And then the assessment can follow, because teachers will get smarter about their content area and student learning through that high-quality curriculum.

How should parents or even teachers think about the range of tests students take each year? What is the right balance?

A friend of mine had this great quote, “A collection of tests is no more a system than a pile of bricks is a house.” We do have a lot of tests and the inefficiencies are massive. Most of it is not the state’s fault, but the state has this big pink elephant that everybody sees.

A friend of mine had this great quote, ‘A collection of tests is no more a system than a pile of bricks is a house.’ We do have a lot of tests and the inefficiencies are massive.

When I first got on my local school board, my daughter was in fourth grade. I went to parent teacher conference, and the teacher — who was a terrific — gave us the results of 11 different tests. I asked why. She said, “I don’t know. We’re required to do this.” So in my first board meeting I asked the superintendent to create an assessment task force with the district to figure out what tests were being used and why.

You need a group of stakeholders to start by mapping out what assessments you need to make learning better in your district. Then you can take an honest look at the assessments currently in use to see if they meet those learning needs – and with what evidence. And then you can make high quality changes as needed.

What would you say to educators and parents about the importance of validity, comparability, reliability in assessments? How do you explain those?

Validity is the overarching consideration for any test. It is saying that I’m making an inference about what you know and can do as a result of that score. If you get a particular score on a writing test, you could infer someone is a good writer. Validity is the extent to which inferences you’re making from test scores are supported by evidence, logic, and reason.

Validity is the overarching consideration for any test. It is saying that I’m making an inference about what you know and can do as a result of that score.

Reliability is just the consistency of the results. It is essentially how consistently a result would appear if we could test you over and over again. Now, you could be consistently wrong on a test. So, reliability is only good if it’s also valid.

Comparability matters a lot when we’re talking about assessments used for monitoring trends over time. Or when we’re using them in accountability systems.

Let’s talk about cut scores, which can be mysterious to a layperson. They have so much meaning when we’re talking about comparability, validity, and reliability. How should we think about cut scores?

It was such a great idea at the founding of the standards-based education movement that we would have content standards that everybody should know. And then we would have achievement standards that say how good is good enough. But the implementation has been tricky.

But this is all a human endeavor. We have thought-out and defensible ways of establishing these cut scores, but they are human judgments. The panelists involved in determining them build some consensus around what proficiency, advanced, or basic mean when it comes to a set of test questions.

These cut scores come with a narrative, which is important. The better the narrative descriptions, the better they distinguish the knowledge and skills that a kid needs to know and be able to do at each stage. These narratives should be written in a way that a layperson should be able to read them and agree that they describe a student who is basic, proficient, or advanced.

It gets tricky for kids who are close to a cut score, who are a coin flip on one side or the other. If Anne scored 762 and Bill scored 760, you all are essentially the same. But if the cut score’s 761, Anne is proficient and Bill is basic. That’s a big label, right? Any time you set a cut score, it’s a big deal because the label carries a lot of inferential baggage, which goes back to our validity discussion.

What innovations are you seeing in testing? What are you paying attention to?

Many states are trying to “innovate.” They’re unfortunately trying to do it within the constraints of the Every Student Succeeds Act (ESSA). And ESSA doesn’t give you much wiggle room. You have to still meet the same accountability requirements.

States are going to have to figure out some way of evaluating students if kids are learning remotely for at least part of this school year. As they do, they should assume that students will have their calculator and access to Google. Still, the teachers can give them questions that require hard thinking.

For example, they could give students prompts that require them to dig deeply into all the resources they have. Some will know that they need to explore, say, Wikipedia on a question, but others will need some guidance. That would be part of the teaching: how to use these other resources.

In short, teachers could ask students much richer questions and provide much richer instruction. But it should never just be one or the other. Whatever you do in assessment should have already been learned in instruction. I can’t give you a really complex test question if I’ve never helped you learn to think complex thoughts.

Whatever you do in assessment should have already been learned in instruction. I can’t give you a really complex test question if I’ve never helped you learn to think complex thoughts.

Once we all go back after the pandemic ends, and schools see how these new assessments work, they won’t give up on them. They will see that their kids have learned to think more deeply and demonstrate that thinking in deeper ways. We hope this will lead to more individual work.

This will vary for kids with learning disabilities, but there are ways to support that. And for little kids, it’s trickier. Still, schools should push them towards deeper levels of understanding, which they will need as they advance beyond third and fourth grade.

What have we learned about online testing over the last three-to-six months? Are you less optimistic or do you think these obstacles can be overcome?

We have moved to online testing over the last 12 years, so testing companies have pretty robust platforms. We’ve gotten that down pretty well in our fixed settings, like schools.

But online testing is harder when you move to remote settings and don’t know what technological devices kids have. Some interim assessment companies have figured out how they can administer their tests as long as you have some sort of stable browser. But the ability to securely administer assessments remotely at a large scale is still lacking in terms of the trustworthiness of the data. Just like we’re warp speeding on a vaccine for the coronavirus, the industry is going to have to move with warp speed to deal with administering these assessments in hybrid environments.

The most likely scenario for getting decent data over the next year, even if schools are in this hybrid situation, is cycling kids into the building over a couple of days, keeping social distancing and using masks, and do the testing in person. And I do think we’ll need data next year, just to see what the heck’s going on.

I hope that when we look back in five years, and think how the system has evolved, that we will have learned the good lessons. We won’t have gone back to the status quo, but we will have learned about more personalized instruction and deeper learning. That is my hope.

Our Experts

Stay up to date on the latest stories and events with our newsletter