I thought of making this post a satiric dialogue. But the topic is not very funny, and there’s the danger that satire will miss its mark. So, for once, an academic-life post in all seriousness.
UTA, perhaps like many other universities, has started to evaluate teachers largely on the basis of numbers: scores derived from surveys that students fill out at the end of each semester. Here’s how it works. Students go online and respond to statements about the course they’ve just taken, statements like “The instructor used teaching methods that helped me learn” and “The instructor was well prepared for each class meeting.” The allowed responses range from 1 for “strongly disagree” through 5 for “strongly agree.” We’ve all done surveys like that, though they’re usually about dish detergent or the newest Iron Man movie.
The scores that these surveys generate appear in reports to UTA administrators and state-level officials. One standard method of reporting survey data is a table where each course has a separate row, and the score for each question is listed in a separate column.
Here’s how I did, for instance, in one of my World Literature courses in Spring 2013:
4.3 4.3 4.4 4.7 4.6 4.2 4.8 3.6 4.1 4.2 3.9 4.2 3.9
Pretty good, huh? Or pretty bad. Or who knows? “Who knows” will be the theme of much of this blog post.
I’m about to complain about being evaluated by numbers. The great danger of an English professor complaining about numerical evaluations is that I’ll be perceived as some touchy-feely poet who can’t bear the cold light of hard data. I’ll just have to run that risk. I have no ethos here. Pathos will get me nowhere. The following critique will be logos all the way.
Why is the line of numbers I gave above a weak way of evaluating History of World Literature I? Where do I begin?
- That’s not very many responses, and it’s not many people to begin with. Thirty students took World Literature. Ten of them filled out the survey. If only 10 answer the survey, each response drags the mean (the “average”) number for each response pretty far in one direction or another. A statistician might tell you that a response rate of one-third is an awfully good sample, but even if all 30 answered, just a few really low or really high “outlier” responses can drag the mean down or up in ways that make that mean – 3.5 or 4.5 or whatever it may be – less indicative of the whole (and even the whole 30 isn’t very meaningful). But it gets worse:
- There’s no context. We don’t know if 3.5 or 4.5 was bad or good in these circumstances. Was ENGL 3361 World Literature hard, easy, required, elective, a “gateway” course, a course self-selected by specialists, a course for majors, minors, merchant chiefs? The line of numbers tells you nothing about this complicated factor, in part because
- There’s no baseline. Look again at my scores: 4.3 4.3 4.4 4.7 4.6 4.2 4.8 3.6 4.1 4.2 3.9 4.2 3.9. You know that 1 is bad and 5 is good. But that’s all you know. You are given other things like the mode, and standard deviation of the ten responses to every given question, which are next to meaningless if you know the mean (and, in fact, you see all ten individual responses listed in a bar graph, so even the mean is pretty superfluous). But you don’t know what a typical score for UTA looks like. You don’t know what an average score for an average English course looks like. You don’t know what the usual score for a 3000-level course looks like. You don’t know what a typical score for ENGL 3361 History of World Literature I looks like, and in fact in the last of these cases, you can’t know, because I’m the only person who teaches ENGL 3361. There is no way of telling whether the survey is measuring me, or the subject I’m teaching, because those two variables always run together.
- But you think you do. Come on, be honest. You’re already looking at my line of 4.3 4.3 4.4 4.7 4.6 4.2 4.8 3.6 4.1 4.2 3.9 4.2 3.9 and thinking “well, Dr. Morris is clearly great at whatever 4.7 and 4.8 represent, but he’s pretty lousy at those 3.9 areas, and he really needs an intervention on that 3.6 category.” Now, you may be right about that. But you don’t know, and how could you? Still, you are ready to make all kinds of judgments on how I well I teach a course you’ve never attended, in part because
- Numbers like these convey a false precision. Look again. You’re pretty happy about my performance on that 4.2 near the end of the list, right? But less so on the 3.9s that surround it. Admit it, that’s what you’re thinking, and thinking it harder the more I argue against it. But remember: those scores are the means of ten responses. One student responded “1″ to each question: the eternal sorehead. One answered “4″ to each; four answered “5.” The rest were split between 3s, 4s, and 5s. Overall, the difference between the 4.2 answer and the 3.9s is a couple of students answering “4″ instead of “3,” on a question they spent about half a second thinking about. Now let’s say you’re comparing me as a teacher to someone else, and I get a 3.9 where they got a 4.2, or vice versa. You see the problem? It’s akin to the illusion that makes you think $29.95 is cheap and $30.15 is expensive. And it sometimes gets worse. I have seen means on these questions, derived from less than 20 student responses, expressed to the second decimal place: i.e. not just 4.2 or 3.9, but 4.27 or 3.96. I stress that that second decimal place cannot have a meaning in any possible mathematical world. Heck, the first decimal place doesn’t have much. And it’s not just a problem in the mathematics,
- It’s a problem of telemetry. Instead of watching me teach, instead of listening to my thoughts about teaching, instead of asking my colleagues or immediate supervisors about me, instead of really asking my students anything meaningful, you’ve been content to judge me as a teacher (inevitably! you’re still doing it, over my protests!) on the basis of numbers generated by a few staticky sensors attached more or less far from my classroom. And you’re content to do so, because a row of numbers is a lot handier than trying to figure out what goes on in that classroom. And because the evaluation is based on telemetry, is falsely precise, and has no context
- The reading of such evaluations becomes a WAG. I’ve heard eminent scholars look at a row of numbers like my World Lit scores and opine that someone’s teaching is good, bad, somewhere in between, higher than others they’ve seen, lower, or whatever, based entirely on impressions they’ve accumulated by looking at other rows of such telemetric numbers, similarly without context or baseline. And when I’ve raised objections like those above, they pause, nod, say “of course,” and come back with
- But administrators (and Regents and Coordinating Boards) like numbers. Which is fine, but if they like arbitrary, meaningless numbers, it doesn’t give me much confidence in administrators or Regents or Coordinating Boards.
Now let’s assume these numbers were sterling numbers, and gave a perfect depiction of what students took away from World Literature I. Let’s even assume that quantifying the quality of a complex humanities subject is a good idea. Those assumptions are false in so very many ways, but let’s make them. Are our problems over?
Perhaps not, because
But over and over, faculty and administrators, and English faculty as much as anybody else, still look at those rows of numbers and believe, in their hearts, that they tell a terrible and objective truth. Numbers don’t lie, after all. And I doubt these numbers are lying. They’re just not saying anything at all.
One should never just complain; one should suggest better alternatives. This post is now too long to do so, but I’ll try to compose a more positive and proactive one soon.