Rethinking Value-Added Models in Education: Critical Perspective on Tests and Assessment-Based Accountability by Audrey Amrein-Beardsley

Proponents claim that many sources of error can either be accounted for with some kind of statistical manipulation or that they are are insignificant. Research doesn’t support this. Here are the problems that don’t go away. 1) Items selected for tests are the ones that discriminate best rather than those that test the most important material. 2) The difference between a score or 50 and 60 should mean the same as the difference between a score of 80 and 90, but it doesn’t. To account for this, scores are normed. This assumes that teacher performance conforms to a bell-shaped curve distribution. As a result, when some teachers do better, it is at the expense of others. 3) Correlations between student scores and student demographics are so strong that one can effectively be used to predict the other. 4) Tests were designed to measure student achievement, not teacher performance. 5) There are so many variables beyond the teacher’s control that they can’t possible be accounted for. 6) Students are very seldom placed in classes randomly. 7) VAM assumes that learning is linear and consistent over time and that students with different aptitudes will learn at the same rate. Research don’t support this. 8) Tests are given each spring. This means that summer losses and gains occur between the pretest from the previous year and the post test for a given teacher. 9) The first part of the year takes place in the previous teacher’s classroom. Numerous other adults also interact with and impact student learning in addition to the classroom teacher. 9) Data is more likely to be missing for high-needs students. 10) Small sample sizes lead to more variability in scores. Statistically, samples like 20 or so are on the small side. Teachers with a smaller number of students are more likely to have lower and higher scores.

6. Reliability and Validity

This chapter offers good explanations of key statistical factors involved in VAM. If these scores were reliable, you would expect experienced teachers to get the same or nearly the same scores each year. Year to year correlations of VAM scores for individual teachers are low enough in most cases that you can see how little the individual teacher impacts the scores. Students who take more than one version of a test also show wide variations in scores. Thus the scores are not reliable. Audrey explains why variability of students and small sample sizes are a cause.
Validity is a measure of how a number represents reality. If the scale at the doctor’s office gives you one number and your scale at home gives you something very different, you would suspect that your home scale was not valid. When VAM scores prove to be unreliable, they can not be valid. If you are interested in details like content-related, criterion-related, consequence-related, and construct related evidence of validity, the last part of this chapter is for you.

7. Bias and the Random Assignment to Classrooms

Many teachers are rewarded or punished based on where they teach and who they teach as opposed to how well they teach. Teachers of poor, ELL, and special education students fall into this category. Gifted teachers also have a difficult time showing growth as their high-scoring students cannot improve test scores that are already at the top. The term construct irrelevant variance (CIV) applies to situations where variance in scores has nothing to do with the teacher’s ability. When the concept of confidence intervals is applied, a teachers true score can be anywhere with in a range of 35 or more percentile points for a 95% confidence range. There is also controversy about adjusting scores based on student demographic features. Some feel that the student’s previous test can serve as a control, while others engage statistical tricks to control for some variables.
The non random nature of student placement is well researched here. It is clear that very few principals use anything like random assignment. They are more likely to cluster students with common characteristics and give in to parent placement requests. Also, due to the testing plan for 3-8 and high school ELA and math, only about 30% of the teaching staff qualifies for VAM scoring. The other 70% often get a school-wide VAM scores, sometimes with consequences attached.

8. Alternatives, Solutions, and Conclusions

In the final chapter Audrey takes on the task of suggesting the type of teacher evaluation models that school districts should consider in place of what they may currently have. The key concepts are that any system should have multiple sources of data, be constructed with input from all involved, and rely heavily on human judgement. She would dial the impact of any value added scores way back and not let bad scores alone lead to dismissal. She also sees a roll for input from peers along with student and parent surveys.
I think it’s a good idea for administrators to listen to peers, students, and parents, but I would not want to see such input weighted in a final evaluation. I believe that any experienced observer can tell good teaching when they see one. As a principal, I focused on what the students were doing and what they produced. Were they actively engaged in the lesson? Did they look like they were enjoying what they were doing? Did they spend some of the time working with other students? While this book is written for an academic audience, it is one that should be on the shelf of every school’s professional development library. We can also hope that it gets into the hands of as many policy makers as possible. See what you can do to make that happen.