What to Do About Assessment?

Sample Assessment Description

We present the following assignment example as a case study in the necessity of thinking through all repercussions of the assessment design, and not simply creating something that looks, on the surface, like it is fit for purpose.

In the flurry of “What do we do now??” that followed the November 2022 launch of ChatGPT, many GenAI enthusiasts and GenAI skeptics ran quickly in the same direction when it came to ensuring valid assessment: “Let’s use it.” The true enthusiasts saw ChatGPT as a tool that could be as important as the calculator, as spellcheck, as Google: a tool they use seamlessly every day that improves their productivity and their lives. The skeptics were not as happy about the tools and sought to “ChatGPT-proof” their assessments, potentially by leveraging/using ChatGPT in the assessments themselves.

At the time, one very common proposal, that would, at first blush, fulfill the requirement of

  1. assessing learning,
  2. teaching students about the limitations of LLM-based tools, and
  3. preventing unauthorized use of ChatGPT

was any variation on the idea of “Ask ChatGPT to do your homework, and then critique/correct it.”  The approach of students critiquing or correcting material is common in language classrooms (“identify and correct all the grammatical errors”) and computer science and math assignments (debugging code and redoing/correcting calculations). Some science courses may have students critique experiment designs, looking for flaws or improvements, and pharmacy students may be asked to review prescriptions, looking for errors in calculation or conflicting medications.

But, this assessment approach is not common in other disciplines, where students usually spend more time creating work than critiquing others’ work. Nonetheless, in the frenzy of “What do we do about ChatGPT?” this was viewed as an ideal assessment solution, as it fulfilled the three requirements listed above. However, under closer examination, and outside those few specific disciplines which already use this assessment technique for particular goals, this approach has a number of drawbacks.

The first is that “critiquing something that an LLM wrote” likely doesn’t address all of the intended learning outcomes that the original writing assignment did: “critiquing” is very different from “brainstorming, outlining, organizing, writing, proofreading,” so if the assessment changes, educators need to make sure that the learning outcomes are appropriate to the new assessment (and that those original learning outcomes, if they are essential requirements for the course, are addressed in some other assessment).

Secondly, in situations where students are asked to critique or correct a passage, some code, a spreadsheet, a calculation, etc., all students are given the same material to work with. The material to be corrected or critiqued is intentionally designed to have certain flaws that lead to specific learning in the discipline. The errors or problems with the object of critique are not random but are created for a particular function (linked to the learning outcomes). When students use ChatGPT to generate their own artifact to critique or correct, not only may the specific flaws be absent, but students will all be working on different, uncontrolled texts. Some students may have inadvertently generated a text with no errors, or with nothing to critique; these students could potentially lose marks because they “didn’t do any work.” Another student might have generated a text with multiple flaws, but if the student did not identify them all, the instructor may deduct marks for the oversight. Further, errors or hallucinations in LLM-generated texts are often nonsensical, which means that students may not even know what kinds of “errors” they are looking for and might desperately misidentify things just so that they have something to “critique” or “correct.” Or, the errors introduced will be specialized or subtle enough that students could miss them, as only an expert could identify them anyway. Or, students might see some errors, but they would be errors in non-germane concepts such as the wrong place name or author cited, but nothing to do with the actual topic of the assignment. An instructor would never design an assessment with random errors, rather:

  1. Any assessment that requires students to correct or critique would have chosen that design based on the learning outcomes, and
  2. The errors in the passage (or calculation or code) would be specific, and consistent with teaching a certain concept (“how to form irregular plurals;” “errors in logic;” “how to properly calculate medication doses;” etc.).

And finally, for the drawback of this approach that affects instructors more than students: in order for the instructor to properly ascertain whether the student had competently critiqued or corrected the LLM-generated text, the instructor would have to read not only every student’s critique, but also every student’s source text, to ensure that they had, in fact, caught all the problems, thereby doubling the grading load.

 

For many reasons, this assessment is suboptimal.

However, there is a type of assignment that takes a similar approach, can fulfill the requirements listed above (assessing learning, showing students the limitations of LLM-based tools, and preventing unauthorized use of ChatGPT), and has been implemented in various contexts and disciplines. We will look at David Nicol’s approach to “inner feedback” later in this section.

 

 

Media Attributions

  • This image was created using DALL·E
  • This image was created using DALL·E

License

Share This Book