David Bressoud’s latest column, “Measuring Teacher Quality,” which focuses on a study of college student evaluations of teachers as compared to student performance, has me thinking about how we assess teaching at all levels. The study, conducted by Scott Carrell and James West at the U.S. Air Force Academy,
strongly suggests that such evaluations are even less useful than commonly believed and that the greatest long-term learning does not come from those instructors who receive the strongest evaluations at the end of the class. [Bressoud]
While K-12 students don’t often have a formal role in evaluating their teachers’ effectiveness*, Bressoud’s column raises the big questions we should all keep asking about measuring good teaching: how trustworthy are the proxies we choose? Do they reflect what we think is important? What might we be missing?
A while ago, I spent three years on the committee at Middlebury that reviews professors for promotion and tenure. We considered teaching, scholarship, and service. The rules, voted in by the faculty, prohibit evaluating teaching strictly on the basis of student evaluations (now called “course response forms” here); we also visited classes and read letters from colleagues and former students.
I grew to regard the end-of-course evaluation forms as sources of valuable but imperfect data. Students who rated teaching as excellent on one part of the form made comments elsewhere ranging from “he made us work like hell, and we loved every minute of it” to “a perfect Winter Term course — hardly any work.” Even the most thoughtful and perceptive student can’t know the full disciplinary and pedagogical contexts in which an instructor plans and puts forth a course, and a disgruntled student may see an anonymous survey as a place to exact retributive judgement (as one of my sons pointed out, having heard a classmate describe doing just that). Then there’s the question of student biases; my service on that committee was a few years after mathematician Neal Koblitz wrote a column for the Association for Women in Mathematics newsletter called “Are Student Ratings Unfair to Women?” So various things may compromise students’ ability to assess “effectiveness of instruction,” and my committee colleagues and I did our best to read through thick folders of those forms while keeping their value in reasonable perspective.
Letters from department colleagues and our observations in classrooms gave us insights that we couldn’t have gotten from students responding for ten minutes in the last week of class. It was a privilege (as well as a major time sink) to watch so many talented instructors weaving careful preparation together with on-the-spot responses to conditions in the room, in ways that I might have missed were I not an experienced teacher myself.
The Carrell and West study uses final course grades as a measure of teaching outcomes. They were able to do this because at the Air Force Academy, calculus grading is done en masse. As Bressoud points out, the study addressed some common concerns by comparing results
on course assessments for each instructor after controlling for variables in student preparation and background that included academic background, SAT verbal and math scores, sex, and race and ethnicity. … The authors reference a 2010 study … that shows a strong positive correlation between the quality of fifth grade teachers and student performance on assessments taken in fourth grade, suggesting a significant selection bias: The best students seek out the best teachers. This may be even truer at the university level where students have much more control over who they take a class with. For this reason, Carrell and West were very careful to measure the comparability of the classes. At USAFA, everyone takes Calculus I, and there is little personal choice in which section to take, so such selection bias is less likely to occur.
Compared to this level of controlling for variables, the “value added” approach that some states have adopted for evaluating public school teachers looks downright crude.
An interesting feature of the Carrell and West research is the focus on longer-term learning. In this case, they found a relationship between students’ Calculus I instructors and their grades in Calculus II. (This is probably no surprise to K-12 teachers.) In fact, students of the more popular instructors — that is, those with higher scores on student evaluations — had slightly better grades in Calculus I but worse grades in Calculus II. Implicit in the pejorative phrase “teaching to the test” is the idea that focusing on short-term gains may come at the cost of long-term learning; this study offers some evidence.
Carrell and West lay out the scoring rubrics that USAFA math instructors used to grade Calculus I and II students, but they (understandably) do not display the actual exams. The thing is, when I teach Calculus I, my primary goal is not that my students get good grades in Calculus II. My goal is that they don’t end up like the friends who’ve told me “I took calculus in college but I couldn’t tell you what it’s about.” I’m assuming that if I can succeed in the longer-term goal, then success in Calculus II or other academic work that requires an understanding of Calculus I will happen naturally.
It would be impossible to locate a large enough random sample of my students from, say, ten years ago and ask them to tell me what calculus is about. This was the problem faced by Ken Bain when he was conducting the study that became What The Best College Teachers Do. He and his colleagues “looked for something that would tell us more immediately that the impact was lasting. The concept of deep learners, first developed by Swiss theorists in the 1970’s, helped us spot indications of sustained influence.” He continues (page 9):
We assumed that deep learning was likely to last, and so we listened closely for evidence of it in the language students used to describe their experiences. … We were drawn to classes in which students talked not about how much they had to remember but about how much they came to understand (and as a result remembered). … We looked for signs that students developed multiple perspectives and the ability to think about their own thinking; that they tried to understand ideas for themselves; that they attempted to reason with the concepts and information they encountered, to use the material widely, and to relate it to previous experience and learning. Did they think about assumptions, evidence, and conclusions?
Given what I’ve read in How People Learn and elsewhere, this seems like a reasonable proxy for the longer-term learning I hope to foster. It’s still harder to execute than a standardized test or a typical Calculus I or II exam, of course. I’m now in the midst of Bain’s descriptions of what professors he identified as “effective,” by the evidence-of-deep-learning metric, actually do.
The last sentence from the above quote strikes me as a meta message. Thinking about assumptions, evidence, and conclusions is exactly what we as a society should be doing every time we choose a method of measuring teaching effectiveness at any level, and by and large, we don’t, even when the stakes are high. This brings me to another book in my half-finished pile, Thinking Fast and Slow by Daniel Kahneman. From page 104:
“Do we still remember the question we are trying to answer? Or have we substituted an easier one?”
It’s well past time for some slow thinking on whether the easier questions about student performance on particular tasks at particular times are worthy substitutes for the harder questions about what skills and habits of mind and attitudes about learning we want students to have at age 18 and beyond.
*New York City may include some kind of student input into teacher evaluation in its new system; see here and notice that “student surveys” is part of the “additional measures may include” list. (This summary was not easy to find on the schools.nyc.gov page; no, thank you, I really wasn’t looking for a 30-minute video about the new system.)