The IDEA Blog



Read the thoughts and impressions on a variety of topics written by IDEA staff as well as occasional guest bloggers.

What’s in the Study: Exposing Validity Threats In the MacNell, Driscoll, and Hunt Study of Gender Bias
December 22, 2014

By Steve Benton and Dan Li 

Recent headlines proclaimed that “Students Praise Male Professors,” and “Students Give Professors Better Evaluations if They Think They’re Male,” hasty conclusions following the publication of an article by Lillian MacNell, Adam Driscoll, and Andrea N. Hunt, “What’s in a Name: Exposing Gender Bias in Student Ratings of Teaching,” which appeared in Innovative Higher Education. Although MacNell et al. (2014) believe their research demonstrates gender bias, a closer look at the study’s design and analysis reveals much to refute about that assertion.

In the study, students (N = 72) enrolled in an introductory-level five-week summer anthropology/sociology course were randomly assigned to six discussion sections. The course was taught entirely online. The professor took two sections and assigned two each to two instructors who moderated discussion boards and graded assignments. For one section each the instructors falsely identified their gender, thereby creating “actual gender” and “perceived gender” sections. Toward the end of the semester 90% of students responded to a 15-item online “course evaluation.” The researchers found no “significant” differences in student ratings between actual female and actual male sections; however, the ratings of the “perceived male” section were significantly higher (p < .05) than the “perceived female” section on fairness, praise, and promptness.

Here are our chief concerns about the study:

  • Researcher expectancy effects. Researcher expectancy effects can occur when those carrying out a study know what is expected. MacNell et al. report that, “All instructors were aware of the study and cooperated fully.” So, in other words the instructors knew that in one section they were identified as a person of the opposite gender. The authors should have employed a double-blind procedure so that neither instructor would have known which section was the “perceived gender.” 
  • As Krathwohl (1993) points out, “Researchers or their assistants may inadvertently tip the scales in favor of an experimental treatment in a variety of ways…for example with encouragement and clues” (p. 468). Notably, two of the three items that were significantly different between the perceived female and male sections were associated with encouragement (i.e., praise) and objectivity (i.e., fairness), which could have been subject to inadvertent expectancy effects, because the instructors might have responded differently on the discussion boards across their two sections. Given the complexity in the online interaction with individual students, it would have been difficult, if not impossible, for the instructors to “maintain consistency in teaching style” (p. 6). Although the authors apparently want us to assume the instructors behaved exactly the same way in each section they taught—the one for their actual gender and the one for their perceived gender—they provide no information about what actually occurred in those course sections. They could have performed a content analysis of the discussion boards, but they did not.

Related to this issue is the fact that the participating students were enrolled in an anthropology/sociology course. Was gender bias a topic in the course? Did the instructors inadvertently express views about gender bias?

  • Inappropriate design and analyses. The design and statistical analyses were flawed in several ways. First, the authors performed sophisticated analyses (i.e., principal components analysis, structural equation modeling, MANOVA) on a sample of only 72 participants who responded to 15 items. Such statistical procedures require a much larger sample size relative to the number of variables measured.

Although the authors viewed their study as a 2 x 2 factorial design, they failed to test the interaction effect of Actual Gender by Perceived Gender.

Class Section Teacher’s Perceived Gender Teacher’s Actual Gender
A (n= 8) Female Female
B (n= 12) Female Male
C (n= 12) Male Female
D (n= 11) Male Male


If the data were consistent with the authors’ hypothesis that “students would rate the instructors they believed to be male more highly than ones they believed to be female, regardless of the instructors’ actual gender” (p. 5), we would expect to see that students in Section C, who thought their female instructor to be male, would give a higher rating than their peers in Section A, who rated the same female instructor but knew she was female. In keeping with the authors’ hypothesis, students in Section D should in turn have rated their male instructor higher than those in Section B. We were perplexed as to why the authors did not conduct such comparisons. Moreover, they only reported descriptive statistics for a combination of two sections, which masked the actual distribution of ratings in the individual class sections. When we contacted the lead author, she did not respond to our request to provide descriptive statistics for each of the four class sections.

  • Differential loss of subjects between groups. With 72 students randomly assigned to six class sections, each section should have contained 12 students. However, this was not the case. There were only 8 students in the section where the female instructor was perceived to be female, while the other three sections contained 11 or 12 students. For such a small sample, such a difference in attrition may have had some noticeable influence on the ratings of instructors. Unfortunately, the authors did not provide any explanations for the variations in class size.
  • Inappropriate Type I error rate. The authors did not report an a-priori Type I error rate (i.e., probability of rejecting a true null hypothesis—that is the hypothesis of no difference). Then, in the results section they decided to use an unconventional .10 level on the student ratings index when .05 is typically used. Although they provided a rationale for this decision, we side with Krathwohl (1993) who recommends that if researchers are depending on a single study they should reduce Type I error to .01 or .001 (the opposite of what MacNell et al. did). At the very least the authors should have suspended judgment about any conclusions (Keppel, 1991) rather than boldly stating, “This study demonstrates that gender bias is an important deficiency of student ratings of teaching.”
  • Student gender. Readers are given no information about the breakdown of student gender within each of the class sections. Existing research has suggested that student gender may have a modest but significant effect on the ratings of male and female instructors (Centra & Gaubatz, 2000). While MacNell et al. claim to have collected information about student gender, they did not report it and thus the gender composition of the subjects remains unknown to readers. Therefore, the observed differences in ratings cannot be fully attributed to gender bias if the effects of student gender were not controlled.
  • Questionable validity of instrument. The 15-item instrument, apparently designed for this study, was comprised of Likert-type items inviting students to respond from 1 = Strongly disagree to 5 = Strongly agree. Six items were intended to measure effectiveness (e.g., professionalism, knowledge, objectivity); six were intended for interpersonal traits (e.g., respect, enthusiasm, warmth), two were included for communication skills, and one was intended “to evaluate the instructor’s overall quality as a teacher.” No information about the exact wording of the items was provided. Moreover, the authors provided no theoretical explanation for item development or whether the student ratings index correlates with any other relevant measures.

As we noted previously, the only significant differences (p < .05) were found on three traits—fairness, praise, and promptness. We contend that those three characteristics are not necessarily an indication of overall teaching effectiveness. In fact, the one item that measured “overall quality” was noticeably left unanalyzed. Why did the authors choose not to report any analysis on this important variable—one that is typically reported in studies of student ratings?

In conclusion, the MacNell et al. study falls short of other studies investigating gender and student ratings. Centra and Gaubatz (2000), for example, analyzed student ratings of instruction from 741 classes in 2- and 4-year institutions across multiple disciplines. They found a significant but nonmeaningful student-gender by instructor-gender interaction: female students, and sometimes male students, gave slightly higher ratings to female instructors. Centra (2009) also found that female instructors received slightly higher average ratings.

In a review of 14 experimental studies, Feldman (1992) found few gender differences (in only two of the studies) in global ratings. In a follow-up study Feldman (1993) found a very weak average correlation between instructor gender and student ratings (r = .02). In reviewing the experimental studies he wrote, “Any predispositions of students in the social laboratory to view male and female college teachers in certain ways (or the lack of such predispositions) may be modified by students’ actual experiences with their teachers in the classroom or lecture hall” (Feldman, 1992, p. 152).

And, in point of fact, no differences in ratings were found in the MacNell et al. (2014) study between sections that were taught by the actual female instructor and the actual male instructor.

This is not to say that gender bias does not exist. We grant that it can be found in all walks of life and professions. But a single study fraught with confounding variables should not be cause for alarm. The gender differences in student ratings that have been reported previously (e.g., Centra & Gaubatz, 2000; Feldman, 1992, 1993) are not large and should not greatly affect teaching evaluations as long as ratings are not the only measure of teaching effectiveness.

As has always been the case The IDEA Center recommends that student ratings count no more than 30% to 50% of the overall teaching evaluation. Moreover, ratings and other evaluative material (e.g., student products, peer observations, course documents) should be collected from at least 6 to 8 classes before summative decisions are made about an individual faculty member.


Centra, J. A. (2003). Will teachers receive higher student evaluations by giving higher grades and less course work? Research in Higher Education, 44, 495-518.

Centra, J. A. (2009). Differences in responses to the Student Instructional Report: Is it bias? Princeton, NJ: Educational Testing Service.

Centra, J. A., & Gaubatz, N. B. (2000). Is there a gender bias in student evaluations of teaching? Journal of Higher Education, 70, 17-33.

Feldman, K. A. (1992). College students’ views of male and female college teachers: Part I-Evidence from the social laboratory and experiments. Research in Higher Education, 33, 317-375.

Feldman, K. A. (1993). College students’ views of male and female college teachers: Part II-Evidence from student evaluations of their classroom teachers. Research in Higher Education, 34, 515-211.

Krathwohl, D. R. (1993). Methods of educational and social science research. New York, Longman.

Keppel, G. (1991). Design and analysis: A researcher’s handbook (3rd edition). Upper Saddle River, NJ: Prentice-Hall.

blog comments powered by Disqus



301 South Fourth Street, Suite 200, Manhattan, KS 66502
Toll-Free: (800) 255-2757   Office: (785) 320-2400   Email Us

GuideStar Gold Participant