September 30, 2020

Disruptions Present Opportunities to Do Something New in School Testing

Larry Berger, CEO of Amplify, spoke about the purposes of different assessments, the way textbooks are selected, and how we could use the COVID-19 crisis to think anew about testing students.

Larry Berger is the CEO of Amplify, an education company that delivers digitally-enhanced curriculum, assessments, and educator support to districts across the U.S. A Rhodes Scholar and former White House Fellow, Berger’s varied interests include serving on the boards of the Southern Education Foundation, Lapham’s Quarterly and the Academy of American Poets.

In a conversation with Holly Kuzmich, the George W. Bush Institute’s executive director; Anne Wicks, the Ann Kimball Johnson Director of Education Reform Initiative at the Bush Institute; and William McKenzie, the Bush Institute’s senior editorial advisor, he spoke about the purposes of different assessments, the way textbooks are selected, and how we could use the COVID-19 crisis to think anew about testing students.

What role does assessment play in teaching and learning, and what role does it play in policy and accountability?

There are a lot of types of assessment. Some of the mistakes we make in education are when we hope that one kind of assessment, designed to play a particular role, can play several other roles. It’s like hoping a thermometer can take your pulse.

Some assessments are designed to address what the teacher is teaching this week. These are useful to the teacher, because they reflect how students are doing with what was just taught. And they are useful to the student because they indicate “I’ve mastered this,” or “I still have work to do.”

However, the more an assessment is particular to a given classroom, the less likely it can accurately measure the idiosyncrasies of the classroom next door. So, these formative, instructional assessments are hard to use for policy and accountability.

Conversely, the further we get from the specifics of what a given teacher is teaching, and instead ask more general types of questions that are designed to compare where kids are in their overall academic progress, then the more valid the assessment is for policy and accountability. These summative tests will tell you a student is advanced for a fourth grader or how all of your fourth graders did compared to last year’s fourth graders, but it won’t tell a teacher what a kid doesn’t understand about a particular question. Was there some background knowledge they were missing? Did they misunderstand the concept itself? Or, did they just make an error in working the problem?

Most districts and schools also use interim or benchmark assessments. There are many flavors of these, but the most widely used ones today are designed to predict performance on end-of-year tests. They are hard to use instructionally except to say whether a student is on grade level or how much they have grown over time.

Each of these types of assessments is well designed for what each is supposed to do. But, since people want to do less testing, people also want each type of measurement to do more things than it can.

There is another type of assessment that is gaining momentum – called embedded assessment. These are measures that use the day-to-day work of students in their curriculum and turn that into data that helps measure standards mastery and growth. Embedded assessments may be the best hope for a mode of assessment that can multitask— help teachers teach, while also measuring performance in generalized, comparable ways that policy makers and managers need.

A standards-based curriculum and standards-based test are supposed to help smooth out some of that. So what is happening?

In the early No Child Left Behind (NCLB) days, there was a lot of hope that we could get the whole system to align around the standards. Folks designed interim tests to measure specific state standards and align them with teaching and learning. This was supposed to be a system design breakthrough because then you’d have measures that were both valid for policy and valid for what the teacher was teaching.

But the process broke down for several reasons – the main one being that the local curriculum was always out of sequence with assessments. They were both aligned to the standards, but the curriculum taught those standards in one order, and the assessments tried to measure them in a different order. If the first assessment measured how well kids had mastered magnetism, but the curriculum wasn’t teaching magnetism until the end of the year, that wasn’t fair. When educators couldn’t align their curriculum with those interim assessments, they were more likely to throw out the assessments than the curriculum.

In the past decade, people shifted from interim assessments that measure each standard to ones that are curriculum independent. The current popular flavor of interim assessments give an indication of overall growth on standards. They don’t tell you this student has a problem with this standard. They instead say this student is at level X and is likely to have difficulty with standard Y that is considered above that level. But they may not have asked any questions at all about standard Y.

Could you talk about the textbook adoption process and the impact it can have on this high-quality curriculum and assessment challenge?

Before I started making products that competed in textbook adoption processes, I had this idea that big publishers owned the process and it was all about their reps schmoozing with district leaders to get their textbooks adopted, independent of their quality. This was a reasonable assumption given how poor many of the textbooks were.

These days, in most places, it is quite a transparent democratic process, where teachers have the definitive say in what materials get chosen. It doesn’t matter who the superintendent is friends with or the company’s marketing budget. A group of teachers will vote on the one they want to use. Smaller publishers with high-quality programs, and good reputations for service, can succeed.

The teachers may not always vote on quality. They may not vote on alignment to standards. They might vote on the fact the textbook looks fun or is easier to teach. But at least there’s some sort of grassroots democracy. And, in theory, the state has played its role by determining that the textbook aligns to the state standards.

I have come around to being a bit of a fan of the textbook adoption process because you really have to win the hearts and minds of the teachers on the committee. And that forces product developers to focus on building a product that works great for teachers.

I do think there’s room for improvement, especially around making sure that teachers look closely at products and ideally TRY the different curricula before they make their decision. In many places, the teachers merely “flip through” the materials in a big conference room. This may have worked when the curriculum was a textbook, but these days the curriculum is likely to be a blend of software, texts, data, and teacher tools. Teachers need time to explore, and ideally try, these out before they buy.

What needs to change? Is it the publishers saying they don’t want any change to the current system? Is it inertia at the state level? Why do we keep doing this the same way?

So much of this is habit. California law allows you to spend your textbook money on whatever you want, but, in practice, nobody does that. Almost all opt into one of the state-approved curricula.

Testing is sort of like that too. There were two or three providers making very similar tests, and whoever won was scored by National Computer Systems (NCS). They had the tornado-rated shelters with giant scanners, and nobody else really did. But things got disrupted when Pearson bought NCS, and suddenly the company that had those giant scanners and did the scoring was owned by one of the testing competitors. Everyone had to scramble, and things got a bit dynamic and competitive for a while.

Then, the Obama administration wanted to aggregate demand into big testing consortia, what became Partnership for Assessment of Readiness for College and Careers (PARCC) and Smarter Balanced. It was an interesting idea about being able to really invest in two great tests, rather than each state choosing their own. But the consortia weren’t as innovative as we hoped they would be. They were under time pressure and they were governed by people who were used to buying traditional tests. Suddenly we had new tests that were a lot like the old tests, and we had shrunk the competitive market in the process.

If I could disrupt in one move the way the testing system is working, I would shift the nature of the procurement from a competition among already existing tests to a competition for the design and development of a better test. For example, Texas could announce a five-year development process to build a more futuristic test that will be used when the current contract expires. The competition to design that test begins now, and in one year the state will pick its development partner based on the vision and prototypes that partner puts forward. Then that vendor has the next several years to get it built. In five years, you should have an original test that we designed from the ground up.

Aerospace companies do not all build space shuttles on spec and hope NASA buys theirs. That would be too expensive and uncertain. Instead, they draft a compelling vision and compete to win the contract that funds them to build something no one has built before. Let’s do that with testing.

We interviewed a Texas teacher who is part of a Texas Education Agency process that involves teachers in reviewing, refining, and selecting STAAR questions. The state also has a process to teach selected teachers how to write these questions. Is something similar happening in other states?

I see that in other places. But I don’t know how significant of a change that ends up representing. The test companies have always hired former educators who understand the standards to write items. Once those items are created, they’re fed into the same psychometric machine that will find which ones the testing companies will keep or throw out, and what changes they need to make. Even if some teachers are involved in developing the items upfront, they won’t necessarily see their influence in the end product.

That said, it is fascinating to ask a group of people – teachers, parents, citizens – that just hate testing what they think, say, fourth graders should know. When you read them a few test items without saying they are test items, they will say, “Oh yeah, kids should know that. The schools should teach kids that by fourth grade. It would be a problem if they didn’t know that. If that kind of thing were on the test, I’d be fine with it.”

It is fascinating to ask a group of people that just hate testing a series of questions of what they think, say, fourth graders should know. When you read them a few test items, they will be like, “Oh yeah, they should know that… It would be a problem if they didn’t know those things.

Are there testing innovations that you are paying attention to and are excited about? And when we do this well, what happens for kids?

You hear some people saying that the pandemic is disrupting our testing system, and we need to hold the line and get those tests back in place to structure our education system. And then there’s a huge group that’s saying let’s treat the pandemic as an excuse not to measure or have any accountability.

But in my view disruptions are great opportunities to try something new. When Katrina wiped out the New Orleans School District, there was a chance to experiment with portfolio management of public schools. So how could we respond to this moment and do something new?

To me, there are at least two big moves we should make.

The first is to start work on the next generation map of what we want kids to know and be able to do, and to ground this map in as much empiricism as possible. The technology is getting to the point where we could look at millions of kids doing thousands of tasks each, and see the ways that students get to mastery (or don’t get to mastery).

That could enable standards that are more of a probabilistic map than a list. The map could be very detailed in a field like math. You could determine these elements should be taught in this order for real reasons that are distilled from billions of data points.

And when we don’t know that a certain order matters, we could free up teachers to use their preferences. They could teach genes and then proteins and then traits, or the reverse. There may turn out to be no reason why it is better to learn one before the other.

There would still be policy decisions to make about standards, but we would have much more quantitative basis for our standards than is possible today. And perhaps it would all be less political because it would be grounded in data rather than expert analysis.

We can picture the map. It’s something a lot of people would understand, parents would get it, and teachers would get it more readily.

I do think teachers have “maps” in their head of the journey they are taking kids on from not knowing what a fraction is, to mastering fractions, ratios, percentages, and proportional reasoning. Let’s take their intuitive map and make it as objective as we can.

Then the curriculum is about helping kids make progress on the map, and the assessment is about helping teachers and kids know where they are on the map. And the line between instruction and assessment starts to blur.

Which gets us to the other breakthrough that I think is worth working on intensively: embedded assessment.

During teaching, the curriculum program could be learning as much as it can about where a student is – what they’ve mastered and what they could learn next. Sometimes the curriculum would know enough – just from the work that students are already doing – that a separate assessment would be redundant. Other times it might recommend that a student take a mini-assessment so we can better understand where they are.

Because this would all happen close to the instructional moment, it would be useful to teachers, but because it is part of a sophisticated probabilistic map where we can track different rates of progress in comparable, accountable ways, it should be useful to policy makers. And if the curriculum is doing much of the assessing, it should cut down on the amount of testing kids and families need to endure.

This is how good data-driven systems work – the measurement is precise and relevant to the work that needs to get done, and useful for helping everyone improve. It would be nice if our educational measurement systems worked that way.

Our Experts

Stay up to date on the latest stories and events with our newsletter