What to Do About Assessment?
Low Stakes Activities Using ChatGPT
ChatGPT’s initial appearance on the educational landscape in late 2022 was a shock to many; we observed three main camps of instructors: those who
- sought to outlaw any GenAI use entirely,
- ran to embrace it, and
- were blissfully unaware of (or in denial about) its impact.
Educators in the first two camps, the “early recognizers,” if not early adopters, each learned hard lessons that drew them potentially away from their initial stance: enthusiasts were confronted with a number of issues around bias, privacy and security, reliability, accessibility, availability (e.g., Google Bard was available in the US as of March 2023, then rolled out to Europe and Brazil in July 2023, but was never available in Canada; Gemini replaced Bard around the world and finally in Canada in February 2024) etc., and avoiders soon realised how truly difficult it is to police the use of LLM-based tools. These hard lessons were sobering for both camps; it wouldn’t be as easy as just banning their use, but it also couldn’t be as easy as throwing the doors open and requiring everyone to engage fully with these new tools. The same points that we discussed in the section on GenAI detectors are applicable to the LLM-based tools themselves: student copyright; privacy and security of data and identity; bias, potential toxic speech; unreliability; etc. As of the time of this writing, Arizona State University was the first—and only—higher education institution to partner with OpenAI, gaining access to ChatGPT Enterprise (ChatGPT 4 with no usage caps), which it will use for “coursework, tutoring, research and more” (Field, 2024). We can expect that other such deals with follow (this one was reportedly more than six months in the making and no financial details were released) but this is the first instance of a higher education institution entering into an agreement with an AI company which, presumably, will require some concrete action on ensuring student safety, privacy, intellectual property protections, and with, potentially, the ability for students to opt out.
However, the lack of a formal relationship with one of the private companies that builds these blackbox tools does not appear to be slowing down all institutions: in Indecision About AI in Classes Is So Last Week, “[p]rofessors and administrators from five major public universities provide advice on how to get moving ahead with AI in the classroom right now” (Ward et al., 2023). Yee et al. advocate for helping students develop AI fluency with (over 60) ChatGPT Assignments to Use in Your Classroom Today (Yee et al., 2023). Britain’s National centre for AI in tertiary education has extensive and detailed suggestions for leveraging AI in assessments: Assessment ideas for an AI-enabled world. It may suddenly feel like the world has moved on to fully embrace GenAI in the classroom—no matter what the context—and students are feeling it too. As described in the introduction, 54% of Canadian university students are using GenAI in their studies, and 80% of those students are using it to answer more than one question per day. Fifty-six percent of Canadian students think that universities should promote the use of GenAI in assessments and 66% think that universities should change the way they assess students (Chegg.org, 2023).
If overhauling your entire course or assessment strategy seems impossible, you can always implement small, low-stakes activities and assessments that use ChatGPT. This will give you the chance to frame how you see GenAI tools best used in your class and your field, will give you the chance to teach your students not only how to use the tools, but also discuss the ethics and best practices for using them, all while experimenting in a low-stakes context.
We will close this chapter by describing a few smaller activities and assignments, some for in-class and some for online, that you can do with your students.
Generate, Then Regenerate
Have your students interact with ChatGPT on any topic: hobbies, sports, trivia, etc. Have them ask a few questions, and at some point, get them to click “Regenerate” with one of the prompts they’ve already used and compare the two answers they received. They can examine the quality of writing, the accuracy of the response, or evaluate the replies based on other criteria. Then, have your students enter a prompt related to a course topic, then click regenerate. Similarly, have them compare the outputs, evaluate their qualities, and determine which is better (and if either are actually good or accurate, or if they both have failings). Students can get into pairs or groups to compare their prompts and outputs.
Online students can do the first part of this activity on their own, at their own pace, and then share their reflections in a discussion board or live chat session.
This can be an ungraded activity, or students can hand in their prompts, outputs, and reflections for grading. The reflection is the most important thing to grade and provide feedback on.
This activity can also be done on multiple LLM-based tools, comparing the accuracy of responses across ChatGPT, Copilot, Perplexity, Gemini, etc. Students can determine whether one tool is better than another for specific prompts, types of outputs, different topics or subjects, or according to other criteria. Students can also change the context instructions to see if the tool alters its output to similar prompts.
Using the same prompt to generate different answers can be the beginning of a class-wide conversation about the strengths and limitations of using various GenAI tools, as well as a discussion of both ethics and best practices in the classroom and in the discipline.
Prompt Engineering Skills
Once we acknowledge that
- GenAI is here to stay,
- most people will use it in their jobs in one way or another (including higher education STEM instructors!), and
- 66% of Canadian post-secondary students would like their courses to include training in AI tools relevant to their future career
it seems clear that we must teach students how to use it. At some point in their education, students learned to print, and some learned (or will learn….) cursive writing. They learned to type, to use a word processor, to write an essay. They learned addition and subtraction, then how to use a calculator (or abacus or slide-rule) properly, and many use a variety of discipline-specific tools and programs, all of which they learned throughout the course of their education, usually in a formal instructional setting. A tool as far-reaching and as impactful as ChatGPT should be included in formal, potentially discipline-specific, instruction.
Once students and teachers have had discussions about bias, ethics, reliability, etc. of LLM-based tools, they can get down to learning how to optimize the output. Note: these discussions have the potential to be very interesting, not only because the topic of GenAI is likely outside of most instructors’ area of expertise, but also because the technologies, capabilities, and tools themselves are evolving so quickly and because this is an area where students may be better informed than their teachers. Or, perhaps students are better informed about some aspects, but not others (“Have you seen all of these amazing things it can do? Oh, but no, I had no idea about the copyright implications!”).
The major stride forward between previous AI tools and ChatGPT et al. is the ability to converse with the tool in natural language instead of explicit, structured commands. It is the conversational aspect of these tools (as opposed to a traditional, pre-LLM search engine where you ask one thing and get that one thing [hopefully]) that is one of their greatest strengths. And, just as with humans, there is a skill in being a good conversationalist, interviewer, or examiner. Many of the successful approaches for LLMs are different from those you would use with a human interlocutor, but others are shockingly similar. Enter the field of prompt engineering.
You have perhaps heard of GIGO (garbage in, garbage out), the idea that flawed or nonsense input produces bad output? Prompt engineering seeks to achieve the opposite, by designing prompts (the inputs) for AI models in order to elicit specific, high-quality or otherwise optimized outputs. In this case, the “garbage” consists of questions that are too vague, too broad, cover too many topics, or are otherwise unclear, and will result in equally unfocussed and disappointing results. Crafting queries that will unleash the power of an LLM-based tool trained on billions of words requires linguistic precision as well as an understanding of the model’s mechanics.
Lance Eliot goes so far as to say, “The use of generative AI can altogether succeed or fail based on the prompt that you enter” (Eliot, 2023). There is a whole domain of prompt optimization attempting to figure out how to best query these ever-evolving tools, and there are many ways to introduce your students to the practice of writing good prompts.
Deconstruct, Then Reconstruct
As an introduction to prompt engineering, the instructor can provide students with a range of prompts and their outputs (potentially just from one tool, or multiple). Have the students analyze what makes a prompt effective or ineffective. Then, ask students to reconstruct the ineffective prompts to improve the responses, which they can then test out. The goal is for students to be able to identify key elements of prompt engineering, such as clarity, specificity, and context. This can be done alone, or in groups. In an online course, students could use discussion boards or a tool such as Padlet to show each other their original prompts and their improved prompts. Students can work iteratively on the prompts in a group chat, discussion board, or shared document.
Design Workshop
Once students are comfortable with the basic components of a good prompt, you can move to the application level: students are assigned a specific question or problem they must get the AI tool to help solve. Students are tasked with
- creating prompts to elicit the best possible answers,
- testing their prompts, and
- iterating them based on the results.
Students can do this task on their own or in pairs/groups. The problems should be complex enough that multiple prompts are required.
Nuance in Prompting
In order to understand how slight differences in prompts can affect the output, have students create multiple prompts for the same task, varying their structure, tone, specificity, etc. Once they have a series of prompts, students can enter them and judge the output, marking a rubric or taking notes on how these differences impact the responses from the AI tool. Students can also create prompts in other languages, if they know them, to determine how well the tools function in languages other than English.
Bias Recognition
Based on conversations in class, your students will hopefully be aware of the bias inherent in LLM-based tools (although this is evolving, and recent mitigation efforts such as those undertaken by Google in Gemini in late February 2024 are definitely problematic). There are a number of scenario-based activities you can do to have them create concrete examples for themselves of bias—and hopefully of bias mitigation—in their work.
Gender Bias
Scenario: An AI-based tool is asked to describe professionals in various fields, such as engineering, nursing, teaching, and coding. The tool consistently assigns stereotypical gender to certain professions.
Discussion Questions: Discuss the implications—personal, societal, professional, etc.— of reinforcing stereotypes through AI responses. How can prompts be structured to neutralize gender assumptions? How should outputs be vetted to mitigate bias? What are the latest tools doing to address bias—and is it working?
Socio-economic Bias
Scenario: An AI-based tool is asked to suggest solutions for urban transportation challenges. It consistently favours high-tech, high-cost solutions over more accessible or low-cost types of mass transit.
Discussion Questions: How does the bias towards technologically advanced solutions affect the usefulness of AI recommendations in other socio-economic contexts? How can students encourage the tool to consider —and plan for— diverse economic realities?
Bias in Medical Advice
Scenario: Advice on health and wellness topics generated by an LLM-based tool reflects existing biases in medical research, such as under-representation of certain demographics (women, pregnant people, older people, children, etc.) in clinical trials, confounding bias (where correlative factors are not accounted for, leading to a false causal association), or language and cultural bias.
Discussion Points: Discuss the potential consequences of biased health advice (for individuals and society as a whole) and explore how prompts can be designed to ask for information that might be more inclusive of under-represented groups.
Bias in AI-Assisted Research Data Analysis
Scenario: A group of students uses an AI-based tool to analyze genetic data for a project. They notice that the AI’s interpretations and predictions heavily favour data from populations of European descent, reflecting biases in the underlying training datasets.
Discussion Points: What is the impact of dataset composition on AI analysis in scientific research? How can prompts be structured to account for or highlight the limitations of the data? Can outputs be trusted?
Ethical Implications of AI in Environmental Modelling
Scenario: Students employ an AI tool to model climate change impacts in various contexts. However, the AI disproportionately focuses on scenarios relevant to countries in the Global North, overlooking the nuances and specific needs of regions in the Global South.
Discussion Points: Discuss the importance of inclusive and globally representative environmental data. How can prompts ensure that AI models consider diverse ecological and socio-economic impacts? Do the LLMs have a rich enough training set to account for experiences outside Europe and North America?
Bias in Facial Recognition Technologies
Scenario: Students develop an AI project that involves facial recognition technology. They discover the model performs poorly on faces from certain ethnic backgrounds due to biases in the training data.
Discussion Points: What are some ethical considerations and societal impacts of biased facial recognition technologies? Discuss prompt engineering strategies to test for and mitigate these biases. Given that racial bias is a long-standing problem in AI-based tools, are things improving, or are errors and discrimination only becoming more widespread?
Ethical Use of AI in Academic Research
Scenario: A research team uses AI to automate the literature review process for a new scientific study. They find that the AI tends to cite papers from a limited set of journals, potentially biasing the review.
Discussion Points: Given the importance of diversity in scientific literature, what is the role of AI in ensuring a broad and unbiased review? How can prompts encourage a wider search of sources?
Dealing with Misinformation and Hallucinations
One of the most serious flaws of ChatGPT (and other LLM-based tools, to a greater or lesser degree) is its propensity to hallucinate, or simply make things up. Sometimes it gets answers wrong, and sometimes it makes things up entirely. Training students to watch for inaccuracies—while honing their critical thinking skills— is important for their future professional success.
Catching Hallucinated Sources
ChatGPT is especially prone to hallucinating sources and citations: it’s common for the tool to acknowledge “a real scholar, perhaps even the exact ideal scholar an expert might quote, but with publication titles or journals listed that sound realistic, yet do not exist” (p. 46, Yee et al., 2023).
Yee et al. suggest an activity where students use AI to generate a bibliography on a particular essay topic, then verify the articles in the library’s database, creating a screenshot to confirm (or reject) that all the sources are accurate. Students could summarize one of the articles, or, with ChatGPT’s help, the entire related body of work, for their classmates to read.
Challenging Fallacies
Similar to the idea of hallucinations is the existence of logical fallacies. Because LLM-based tools aren’t “thinking” on their own, but are predicting the next word in a string of text, and as we have already seen, some of the training datasets could be of dubious quality, ChatGPT may be recreating fallacies similar to those it was trained on. Students in all disciplines should be trained to watch for fallacies, and there are a number of logical fallacies that an LLM-based tool could perpetuate in a STEM context.
Appeal to Authority (Argumentum ad Verecundiam)
LLM-based tools might rely too heavily on authority figures in certain fields, suggesting that a claim is true simply because an expert or authority asserts it, without presenting concrete evidence. This can be misleading if the authority’s opinion is not widely accepted or is out of their area of expertise. This challenge is compounded by the opacity of LLMs: sources cannot be traced back to clarify meaning or nuance (Birhane et al., 2023).
Instructors could challenge students to research the purported authority figure and the claims to determine whether ChatGPT has accurately presented the information. It is difficult for non-experts to confirm ChatGPT’s inaccuracies in many topics, especially because it “speaks” with such confidence, so when it appeals to authority, it can be a double threat. Instructor/expert guidance for students is important for this type of activity.
Post hoc Fallacy
Similar to the fallacy of correlation implying causation, ChatGPT can fall prey to the post hoc fallacy by mistakenly asserting a causal relationship between two events just because they occur sequentially.
Educators can create activities or discussions focussing on examples of this fallacy and encourage students to identify similar uses by the AI. For example, ChatGPT output might
- suggest that because technological advancement has increased since the popularization of the Internet, it is the Internet that has directly caused all modern technological advancements.
- suggest that the introduction of genetically modified organisms (GMOs) in agriculture directly led to a decline in bee populations (ignoring the complex reasons behind bee decline, such as pesticide use, habitat loss, and climate change.
- claim that because a certain environmental change occurred before a particular species evolved a new trait, the environmental change directly caused the evolution of that trait (disregarding genetic variation, selection pressures, and other environmental factors).
- assert that an increase in vaccination rates in a population directly resulted in the reduction of a completely unrelated disease (conflating correlation with causation and ignoring other health interventions or natural disease progression patterns).
Students can write rebuttals to these fallacies, explaining why the arguments are specious, and offer alternate explanations. This activity can be done in pairs or groups, and groups can trade descriptions for their peers to review and expand upon.
False Dichotomy (False Dilemma)
In discussions involving complex problems, ChatGPT output may present issues as having only two possible solutions when, in fact, more options exist. Birhaine et al. warn of this over-simplification occurring when using LLM tools to summarize complex scientific papers. Using an AI tool that frames complex issues as binary choices overlooks the nuanced, multi-stepped, and potentially multifaceted solutions that are often required in STEM problems.
As with previous fallacy types, activities that encourage students to challenge the tool’s output, improve their critical thinking skills, and offer rebuttals and alternate solutions or descriptions are important to integrate into all courses.
Media Attributions
- This image was created using DALL·E
- This image was created using DALL·E
- This image was created using DALL·E
- This image was created using DALL·E
- This image was created using DALL·E
- This image was created using DALL·E
- This image was created using DALL·E
- This image was created using DALL·E