Numbed by Numbers

I thought of making this post a satiric dialogue. But the topic is not very funny, and there’s the danger that satire will miss its mark. So, for once, an academic-life post in all seriousness.

UTA, perhaps like many other universities, has started to evaluate teachers largely on the basis of numbers: scores derived from surveys that students fill out at the end of each semester. Here’s how it works. Students go online and respond to statements about the course they’ve just taken, statements like “The instructor used teaching methods that helped me learn” and “The instructor was well prepared for each class meeting.” The allowed responses range from 1 for “strongly disagree” through 5 for “strongly agree.” We’ve all done surveys like that, though they’re usually about dish detergent or the newest Iron Man movie.

The scores that these surveys generate appear in reports to UTA administrators and state-level officials. One standard method of reporting survey data is a table where each course has a separate row, and the score for each question is listed in a separate column.

Here’s how I did, for instance, in one of my World Literature courses in Spring 2013:

4.3 4.3 4.4 4.7 4.6 4.2 4.8 3.6 4.1 4.2 3.9 4.2 3.9

Pretty good, huh? Or pretty bad. Or who knows? “Who knows” will be the theme of much of this blog post.

I’m about to complain about being evaluated by numbers. The great danger of an English professor complaining about numerical evaluations is that I’ll be perceived as some touchy-feely poet who can’t bear the cold light of hard data. I’ll just have to run that risk. I have no ethos here. Pathos will get me nowhere. The following critique will be logos all the way.

Why is the line of numbers I gave above a weak way of evaluating History of World Literature I? Where do I begin?

  • That’s not very many responses, and it’s not many people to begin with. Thirty students took World Literature. Ten of them filled out the survey. If only 10 answer the survey, each response drags the mean (the “average”) number for each response pretty far in one direction or another. A statistician might tell you that a response rate of one-third is an awfully good sample, but even if all 30 answered, just a few really low or really high “outlier” responses can drag the mean down or up in ways that make that mean – 3.5 or 4.5 or whatever it may be – less indicative of the whole (and even the whole 30 isn’t very meaningful). But it gets worse:
  • There’s no context. We don’t know if 3.5 or 4.5 was bad or good in these circumstances. Was ENGL 3361 World Literature hard, easy, required, elective, a “gateway” course, a course self-selected by specialists, a course for majors, minors, merchant chiefs? The line of numbers tells you nothing about this complicated factor, in part because
  • There’s no baseline. Look again at my scores: 4.3 4.3 4.4 4.7 4.6 4.2 4.8 3.6 4.1 4.2 3.9 4.2 3.9. You know that 1 is bad and 5 is good. But that’s all you know. You are given other things like the mode, and standard deviation of the ten responses to every given question, which are next to meaningless if you know the mean (and, in fact, you see all ten individual responses listed in a bar graph, so even the mean is pretty superfluous). But you don’t know what a typical score for UTA looks like. You don’t know what an average score for an average English course looks like. You don’t know what the usual score for a 3000-level course looks like. You don’t know what a typical score for ENGL 3361 History of World Literature I looks like, and in fact in the last of these cases, you can’t know, because I’m the only person who teaches ENGL 3361. There is no way of telling whether the survey is measuring me, or the subject I’m teaching, because those two variables always run together.
  • But you think you do. Come on, be honest. You’re already looking at my line of 4.3 4.3 4.4 4.7 4.6 4.2 4.8 3.6 4.1 4.2 3.9 4.2 3.9 and thinking “well, Dr. Morris is clearly great at whatever 4.7 and 4.8 represent, but he’s pretty lousy at those 3.9 areas, and he really needs an intervention on that 3.6 category.” Now, you may be right about that. But you don’t know, and how could you? Still, you are ready to make all kinds of judgments on how I well I teach a course you’ve never attended, in part because
  • Numbers like these convey a false precision. Look again. You’re pretty happy about my performance on that 4.2 near the end of the list, right? But less so on the 3.9s that surround it. Admit it, that’s what you’re thinking, and thinking it harder the more I argue against it. But remember: those scores are the means of ten responses. One student responded “1″ to each question: the eternal sorehead. One answered “4″ to each; four answered “5.” The rest were split between 3s, 4s, and 5s. Overall, the difference between the 4.2 answer and the 3.9s is a couple of students answering “4″ instead of “3,” on a question they spent about half a second thinking about. Now let’s say you’re comparing me as a teacher to someone else, and I get a 3.9 where they got a 4.2, or vice versa. You see the problem? It’s akin to the illusion that makes you think $29.95 is cheap and $30.15 is expensive. And it sometimes gets worse. I have seen means on these questions, derived from less than 20 student responses, expressed to the second decimal place: i.e. not just 4.2 or 3.9, but 4.27 or 3.96. I stress that that second decimal place cannot have a meaning in any possible mathematical world. Heck, the first decimal place doesn’t have much. And it’s not just a problem in the mathematics,
  • It’s a problem of telemetry. Instead of watching me teach, instead of listening to my thoughts about teaching, instead of asking my colleagues or immediate supervisors about me, instead of really asking my students anything meaningful, you’ve been content to judge me as a teacher (inevitably! you’re still doing it, over my protests!) on the basis of numbers generated by a few staticky sensors attached more or less far from my classroom. And you’re content to do so, because a row of numbers is a lot handier than trying to figure out what goes on in that classroom. And because the evaluation is based on telemetry, is falsely precise, and has no context
  • The reading of such evaluations becomes a WAG. I’ve heard eminent scholars look at a row of numbers like my World Lit scores and opine that someone’s teaching is good, bad, somewhere in between, higher than others they’ve seen, lower, or whatever, based entirely on impressions they’ve accumulated by looking at other rows of such telemetric numbers, similarly without context or baseline. And when I’ve raised objections like those above, they pause, nod, say “of course,” and come back with
  • But administrators (and Regents and Coordinating Boards) like numbers. Which is fine, but if they like arbitrary, meaningless numbers, it doesn’t give me much confidence in administrators or Regents or Coordinating Boards.

Now let’s assume these numbers were sterling numbers, and gave a perfect depiction of what students took away from World Literature I. Let’s even assume that quantifying the quality of a complex humanities subject is a good idea. Those assumptions are false in so very many ways, but let’s make them. Are our problems over?

Perhaps not, because

  • Every instructor does pretty well on the numbers. Or at least, every instructor does about the same on the numbers, whatever that may mean. Granted, that’s my own hazy impression, but I’ve looked at a few rows of these numbers in my time, and they all look very much like the ones I got for World Lit. Even given all the problems with the numbers themselves, do they distinguish usefully among faculty? They actually might, just on the “eyeball” test, if you suddenly saw a row of 1.0s sticking out of other faculty who were at 3.9 and 4.2. But on the whole, the monotonous rows of near-identical numbers hovering around 4.0 don’t tell you anything – yet rankings of faculty for all sorts of purposes are made on the basis of numbers that pretty much represent a collective “that was OK” from the student body. Or I guess, because
  • We don’t really know what the students are saying. They are answering a number of anodyne questions by clicking radio boxes on a web form, an activity we all associate with those on-line quizzes that tell you What Kind of Dinosaur You Really Are or what year you are fixing to die. And what kinds of statements are the students asked to assess?
  • We are all interested in the future, for that is where you and I are going to spend the rest of our lives. No, seriously. Two of the statements on the survey are “I acquired knowledge that will be useful in my future” and “I acquired skills that will be useful in my future.” Now, think about that for a moment. Those are perhaps admirable course goals – future knowledge and future skills – but the problem is, how the heck does one know, from the perspective of the present, how to respond to those statements about the future? This is not only a silly demand, it’s logically unpossible. Which leads to a larger problem with these surveys:
  • Everybody pretty much does seem to think they’re nonsense. Because if you ask a 21-year-old whether they’ve just learned something that they’re going to find useful at 42, they’re smarter than that. They know they don’t know; or at best, they know they’re being asked for banana oil. So you lose their respect, and they take the whole exercise less seriously. Instructors take it less seriously, administrators take it less seriously, and all the way up and down the line, increasing amounts of time are being wasted by people going through the motions of attending to something that nobody takes seriously. So as a result,
  • There’s not much an instructor can do to do better. I know what I need to do to improve my research: publish more. I know what I need to do to improve my service: attend more meetings. But if my teaching is evaluated by a list of survey numbers, how do I change them? What steps can I possibly take to turn a 3.9 into a 4.2? We’ve seen how whimsical and haphazard these measures are: do I even want to improve on some of this stuff? Should I try to sell my students better on the idea that they will use these “skills” someday? (Is reading Boccaccio a skill?) And the “telemetry problem” means that to “improve,” I have to guess how my actions in the classroom will show up on some fuzzy and indirect indicators: not on how well my students did on an essay exam about Boccaccio, but on how they felt before the exam about whether they’d use their knowledge 20 years from now. It’s like being evaluated on your engineering research on the basis of whether a thermometer in a building across campus rose or fell by a degree or two. And beyond even that,
  • Is the last week of a course ends the best time to ask students what they’ve learned? It is probably a good time to ask them whether their instructor was chronically late, or drunk, or kept hitting on them; or more positively if the instructor dressed well, smiled, or deserved “chili peppers” for hotness that would not stop. But I am not sure it is the best time to ask what World Literature taught them about Homer and Dante and Montaigne. This problem obviously predates web surveys of student satisfaction; it was inherent in older “narrative” student evaluations, too. But it hasn’t been addressed. Oddly enough, the ubiquity of the Web and its attendant social media mean that one could now design longitudinal studies that tracked the influence of college courses across decades of a student’s life. But nah, that would require effort and patience. I snark; but I still hold that college teaching deserves consideration by means of more than an immediate reaction, more than a snap opinion about whether certain skills have been delivered.

    But over and over, faculty and administrators, and English faculty as much as anybody else, still look at those rows of numbers and believe, in their hearts, that they tell a terrible and objective truth. Numbers don’t lie, after all. And I doubt these numbers are lying. They’re just not saying anything at all.

    One should never just complain; one should suggest better alternatives. This post is now too long to do so, but I’ll try to compose a more positive and proactive one soon.

  • Published in:Tim Morris |on October 19th, 2013 |3 Comments »

    You can leave a response, or trackback from your own site.

    3 Comments Leave a comment.

    1. On October 20, 2013 at 4:21 pm Kathryn Warren Said:

      While I agree with you that students can’t possibly know whether what they learn in a given class will be useful to them in a future at the end of the semester, I still think it’s useful to learn something about their experience in the course, and the end of the semester is a good time to do that because their experience is fresh in their minds. But I don’t know how to best elicit “useful feedback.” Tim, I’m wondering if you think the comments students add (when they add comments; many of them don’t) are any more reliable as a means of evaluation. While allowing them to use words would seem to be a step in the right direction, sometimes the written feedback I get really isn’t any more illuminating than the numbers. –Kathryn

    2. On October 20, 2013 at 4:22 pm Kathryn Warren Said:

      In *the* future, not *a* future (although “a” future sounds kind of mysterious, as though the knowledge might be useful in one future but not another).

    3. On October 21, 2013 at 10:15 am Desiree Henderson Said:

      Thank you Tim for articulating this critique in greater detail and with greater accuracy than I have ever been able to. I would only add that the specific problems with this specific mechanism of assessment only compound a general problem with all student evaluations: the fact that race, gender, and (perceived) sexuality have a profound impact on how faculty are evaluated. Because there is no good way to control for social bias (except, perhaps, to educate our students out of their biases?), student evals should always already be taken with a pound of salt. But, as you point out, transforming biased written comments into numbers tends to erase the presence of those biases and make the feedback appear “legitimate” and “reliable.”

      Which it ain’t.

    Leave a Comment