Critical Thinking about Critical Thinking Tests

Trudy Govier

12 Critical Thinking about Critical Thinking Tests

With instructors feeling a need to demonstrate the effectiveness of their teaching, there are hundreds of critical thinking tests. It would be a massive task to comment on their merits; obviously this chapter is at best of limited relevance years after it was written. Given the common requirement to indicate critical thinking skills and dispositions on a mechanically scorable test of about 50 minutes in length, there are restrictions. Questions must be about fairly manageable issues requiring only limited background material and they must be worded so as not to be open to different interpretations. There are many other requirements; these alone suggest that such tests will favour invented passages. Should essay questions supplement short answer tests?

If certain sorts of questions need to be removed, more issues will arise about the scope of the tests. How would those working on argumentation studies, critical thinking, and test development today respond to such concerns? When this chapter was written, I was concerned about social and institutional demands for quickly obtainable quantifiable results that would measure something as fundamental and important as critical thinking skills. I wondered whether there was something rather absurd about the idea that people should be asked to demonstrate, in short and clear order, their capacity to think critically – and also something absurd about the notion that philosophers should engage in constructing tests to do just that. Professional interests lead to such efforts and there are hundreds of results. But there is still space for reflection about the assumptions and institutional requirements underlying them.

As I write, in the U.S. presidency of Donald Trump, there is great concern about problems of fake and allegedly fake news, media bias and distortion, ‘alternate facts’, truthiness, and confabulation. In high quarters, there seems to be little respect for facts and truth, consistency and coherency – to say nothing of decent civility. We find increased public attention to the importance of critical thinking; indeed many cry out for more of it. By implication, there should be increased attention to the teaching and cultivation of critical thinking and to the quality of related testing material. Sadly some who respond seem carelessly unaware of decades of work by informal logicians and argumentation theorists, substituting instead their own sense that they themselves are competent critical thinkers. The challenges are many and of great importance.

If we are going to claim to teach critical thinking, we will want to check out our claims. On standard accounts, this would involve testing students for critical thinking abilities or skills. If students test higher after a course than before it, we would naturally infer that the course has improved their critical thinking skills. In addition, since we value critical thinking ability, we may wish to test it as a basis for admission to occupations or educational programs. Standard testing could then be used to rate students, and critical thinking ability as thus tested could be a factor in admission decisions.

Tests to measure critical thinking have existed for three decades at least. For senior high school and college age students, the best known are the Cornell Critical Thinking Tests and the Watson-Glaser tests.¹ The Cornell tests are at two levels – level X, which is for students as young as grades five and six and up to the high school level; and level Z, for college age students. Other critical thinking or reasoning skills tests are aimed at much younger students, in elementary schools, and some are designed to be used at various ages ranging from elementary to senior high school.

Critical thinking tests have, until recently, been given relatively little scrutiny and analysis from philosophers. Nor have they received the political attention given to I.Q. and some other psychological tests. A recent article in Harper’s magazine, scathing in its comments on tests marketed by the American Educational Testing Service and the social attitudes generating a demand for them, had not a word to say about critical thinking tests.² Since these tests are based on the idea that critical thinking ability is something teachable and acquired by students, rather than something fixed which might be inherited and racially or sexually linked, the context in which they have been developed is radically different from that underlying I.Q. tests.³

Nevertheless, when John McPeck attacked critical thinking tests in his book Critical Thinking and Education, his comments received a surprised, but sympathetic response from many philosophers. A seminar on the tests, in which McPeck exchanged comments with Robert Ennis, creator of the well-established Cornell tests, stimulated considerable interest at a symposium on informal logic.⁴ Audience consensus seemed to be that the debate was fascinating and the result a draw.

Reflection on critical thinking tests brings out many interesting issues about critical thinking, argument analysis, inferring skills and abilities from results, and the implications of using machine-gradable short-answer tests for a variety of social purposes.

1. Mechanized Tests and the Concept of Critical Thinking

The concept of critical thinking is a broad and contestable one. The kind of critical thinking one might do in improving on a play or revising the fundamental assumptions of a social theory is not likely to be elicited on a mechanically scorable 50 minute test.⁵ The context is too limited for either; the focus too verbal for the former and too atomistic and temporally restricted for the latter.⁶ In teaching there is inevitably some restriction as to what is taken as critical thinking. Restrictions are bound to be greater in the context of a short-answer machine-gradeable test that can be taken in an hour or so.

Mechanically scorable critical thinking tests have to deal with articulated thought about small, easily described issues where answers do not diverge due to differences in political or ethical perspective or on the basis of varying background knowledge. Questions and illustrative material have to be minimally susceptible of divergent interpretations. They have to be clear cut and expressible in brief phrases. They cannot presume much background knowledge about any particular substantive issue unless those to whom the test is to be given can be expected to have uniform knowledge on that matter – a condition which would rarely be met in practice. The material used must not be susceptive to a variety of interpretations; interesting figures of speech, irony and sarcasm, and suggestive ambiguities will have to be avoided. Thus tests almost always use invented passages rather than real ones. From all these requirements, we can see that the construction of such tests is indeed a challenging task.

Obviously, these necessary restrictions mean that many aspects of critical thinking could not possibly be included on such tests. We cannot in such a context test for abilities to elucidate judicious value judgments about sensitive topics, to find underlying metaphysical or social assumptions, or identify mechanical or aesthetic flaws not easily characterized in words. Nor can we handle material incorporating and exploiting subtle nuances of meaning. The format of the tests, and the social demands placed upon them, make these things impossible. These matters are too profound to be amenable to the short answer and time frame format, too controversial by nature for sufficient agreement in a key of answers, and too hard to sum up in a few words. If we value these aspects of critical thinking, we are likely to conclude that critical thinking ability is not testable by mechanical short-answers methods.

There is a distinction to be drawn between external and internal criticism of critical thinking tests. Internal criticism would focus on the details of particular tests. What range of items, within the range which can be handled by the format, is included on the test? Is the proportion of various items about right? Are instructions given in clear language? Most significantly, are keyed answers correct, and are they the only alternative answers that reasonably seem to be correct? These internal issues arise provided that one has decided to try to elicit responses in a mechanical test format and take these responses as indicative of critical thinking ability. If internal criticism should reveal flaws in popular tests, it is an important issue whether these flaws are endemic to the endeavour or whether they are, as it were, accidental. If we found flaws and decided they were of a type that could be avoided only by stringent restrictions on the content of tests, we would naturally be led to the stage of external criticism.

External criticism is broader in focus. It addresses the question of how significant those aspects of critical thinking that cannot in principle be handled on these tests are to critical thinking ability. It may address also the socio-political question as to why there is a need to attempt to measure critical thinking ability by means of a machine-gradable short-answer test and whether this need is one that philosophers and other academics should try to meet. Here, sections two and three treat internal issues, section four external ones.

2. Test Performance and Critical Thinking Ability

To reach on the basis of test performance a conclusion about someone’s critical thinking ability requires a number of inferences. Interestingly, if we examine the inference stages here, there is an asymmetry between the positive and the negative case. That is, different kinds of things can interfere with the merits of an argument from doing well on a test to having critical thinking ability and an argument from doing poorly on such a test to not having critical thinking ability.

To see how this works, let us first consider the positive case. Suppose that a critical thinking test has been constructed and an individual performs well on that test. We wish, then, to reason from that good performance to the judgement that he or she has critical thinking ability. This conclusion will be based on a number of steps, which may be ordered as follows:

S does Q. (That is, S answers a high number of questions on the test correctly. This is a straight behavioral judgment; it is to say that a suitably high percentage of S’s answers coincide with keyed answers.)

Therefore,
S can do Q. (That is, S is able to answer a high number of questions on the test correctly.) There are two presumptions here. The first is that the coincidence of S’s answers with keyed answers is not due to fluke, accident, or cheating, but due rather to features of S. The test-taker gets keyed answers because of some ability or mental power he or she has. This leads to the second point. The claim implies that the keyed answers really are correct answers. If keyed answers were wrong, or if they were correct but not uniquely correct, getting answer that coincided with keyed answers would not show that one can get these answers in the sense of having an ability or power to arrive at them.⁷ It would appear either to be accidental, the product of training, due to a similarity in tendency to err between the respondent and the person who constructed the test, or the result of a combination of these factors.

Therefore,
S can do Q’, where Q’ is a set of questions similar to, but not identical with, those on the test. (That is, the questions Q are a subset of Q’, and they represent the whole set sufficiently well that being able to do them correctly is good inductive evidence that one will be able to do the rest correctly. Questions represent an array of questions and problems an array of problem types. For example, if S can do, on the test, some particular deductions involving class relationships, then S can do many further deductions relevantly similar to those presented on the test.)

Therefore,
S has (a high level of) Q (critical thinking) ability. (Given that the questions in Q’ represent the whole array, or a very significant proportion, of questions that capture the concept of critical thinking and that S knows how to do questions of this type, we conclude that S has critical thinking ability.)

It is useful to look at this sequence of inferences and see what issues arise at various points. At the first stage, from (1) to (2), what is at issue is, in part, background circumstances pertaining to teaching and test taking and the construction of the test. It is important to note that these factors are largely within the control of teachers, professors, and test constructors, and do not pertain to personal idiosyncrasies and problems in motivation or concentration that may affect respondents. If a test is poorly constructed, someone may get a large number of correct answers by merely guessing, or by being attuned to the kind of thing test constructors are usually looking for and the kinds of constructions they usually make. To see this, consider some extreme cases. Keyed answers might fall into a pattern abba, bccb, cddc, or whatever, detected by the respondent. Or a person might have been taught to take that very test, and fill in many correct answers due to rote memory, so that the correctness of his answers did not show that he could answer even those questions he got right.⁸

The inference from (1) to (2) presumes that keyed answers are right and, among alternatives offered, uniquely right. We wish to infer from the coincidence of a subject’s answers with the keyed answers that he has the ability to get these answers. The notion of ‘ability’ in this context is normative and contains within itself the implication that the answers are correct. The ability to answer a problem in this context is more like the ability to sing than like the ability to breathe. A person cannot breathe incorrectly, so if he breathes at all and the breathing is a product of his own powers rather than mechanical intervention, he has the ability to breathe. However, a person can sing badly or well; speaking of the ability to sing, we would typically imply satisfactory performance. Thus, from the fact that a person sings at all, it does not follow that he has the ability to sing, as ‘ability to sing’ would commonly be used in this context. Clearly, in a test context, we are concerned not merely with the capacity to understand test questions and insert answers, but with the ability to answer the test questions correctly. To get any evidence for this from performance on the test clearly presumes that answers keyed as correct really are correct.⁹ If the test is well designed so that getting good answers is almost certainly a result of characteristics of the respondent, and if there are no contestable answers, then the inference from (1) to (2) should go through.

The second inference, from (2) to (3), raises a new issue. It concerns the representativeness of the various questions on the test. Given any question, we can construct a similar question by varying a nonessential feature. If a respondent can detect that the inference in ‘If Fred is thin, he is fit; Fred is fit; therefore Fred is thin’ is flawed, it would be absolutely astounding if he could not detect the same flaw in ‘If Joe is thin, he is fit; Joe is fit; therefore he is thin’. Necessarily, there will be generalizations. The question will arise as to how far we can generalize – how we generate Q’ from Q. Surely there would be no controversy about substituting ‘Joe’ for ‘Fred’ in the above example, but other substitutions are not so straightforward. Seeing the question as an example of faulty reasoning using a conditional, someone might suggest that anyone who could get this question right has understood the conditional and would get the right answer on ‘If Jane Fonda folk dances, Jane Fonda is fit; Jane Fonda does not folk dance; therefore Jane Fonda is not fit.’ Alternately, one might see the question as representative only of instances of affirming the consequent, but one might see this as independent of subject matter. Thus if a respondent could handle the initial example, he should be able to handle the formally similar one, ‘If mermaids are mathematicians, mathematicians have tails; mathematicians have tails; so mermaids are mathematicians.’ Alternatively, one might see the question as similar to another instance of affirming the consequent where component statements were more complex and involved a more abstract subject matter, as in ‘If science and philosophy have the same essential structure, then, if science is partially empirical, philosophy is partially empirical. If science is partially empirical, philosophy is partially empirical. Therefore, science and philosophy have the same essential structure.’

Probably most logicians would be comfortable with the first two variations as members of Q’ but uncertain about the third. Psychological evidence is surely relevant here, however. It is possible that formally similar examples, even at the same level of formal complexity (as the first two variations are and the third variation is not) are not handled similarly by most test respondents. The second example includes counterfactual material of a bizarre nature, which might confuse respondents. In fact, even the initial example, involving thinness and fitness, is emotionally charged in our culture in a way some other examples of affirming the consequent would not be. It might for this reason be assimilated to other cases of reasoning clearly on emotional material. Psychological evidence indicates that the expectations of logicians in this area are frequently not met. People may be clear on conditional relationships when they deal with familiar, unthreatening subject matter, yet handle them badly when they move to unfamiliar, highly emotional, or highly abstract subject matter.¹⁰

Thus, the generation of Q’ from Q depends on both logical and psychological considerations. It is a logical judgment that two examples embody the same logical principle. It is a psychological judgment that someone who can grasp and apply this principle to the first example can do so with the second. Possibly logic and psychology will diverge in unexpected ways. Perhaps people will be able to detect affirming the consequent when the example is mathematical and the conclusion reached false, but not when the example involves health and nutrition and the conclusion represents a widely held and socially powerful belief. Or perhaps people who easily detect persuasive definitions in contexts of advertising will not detect them in contexts of political speeches by authoritative figures, even though their logical structure is essentially the same in both cases. These questions are empirical.

The next and last inference, from (3) to (4), will depend on how representative the questions in Q’ are of issues calling for critical thinking. If we specify a list of all the kinds of things a critical thinker should be able to do¹¹ and compare Q’ with such a list, then the inference from (3) to (4) will depend on how close the comparison is. Suppose, for example, that we decided we wanted a critical thinker to be sceptical about the most fundamental metaphysical and political assumptions of our society, and we wanted him or her to have a disposition to apply this kind of scepticism even within daily life. This desideratum would likely not be captured by Q. So if we had such an element as an essential part of our concept of critical thinking, we would regard the inference from (3) to (4) as shaky, first conceptually, and as a result, inductively. If, on the other hand, we were to define critical thinking as the ability to do deductive and inductive inferences, evaluate analogies, identify fallacies, and detect vagueness and ambiguity, we might easily find all these aspects in Q’. The inference from (3) to (4) would appear strong. The concept of critical thinking is an essentially contestable one, so this inference is bound to be questionable.

It is instructive to compare the series of inference steps just explored to a parallel series premised on poor test performance.

(x1) S does not do Q. That is, the respondent does not answer (most) questions correctly. Again, this is a straight behavioral judgment. There is a discrepancy between the answers the test-taker gives and the keyed answers.
Therefore,
(x2) S cannot do Q. This is to say that the respondent fails to answer correctly because of some aspect of his or her capacities pertaining to the questions. The failure is not due to rebelliousness, sleepiness, accident, lack of attention, misreading of instructions, or flaws in questions.
Therefore,
(x3) S cannot do Q’. The questions S cannot do on the test are taken to represent a broader range of questions, and S’s inability to do them indicates inabilities in this broader area as well.
Therefore,
(x4) S has little critical thinking ability. Questions in Q’ represent a core area of critical thinking ability, and S is unable to do them, so he or she is a poor critical thinker.

A very important asymmetry occurs with the first level of inference here. As in the positive case, things may go wrong. But there is a difference in that what can go wrong includes aspects of the respondent and his or her situation, things outside the control of test constructors and professors. If a respondent got drunk the night before, or is feeling rebellious and wants to undermine the instructor by indicating that he did not learn anything in a course, he can put incorrect answers for reasons that having nothing to do with his inability to get the answer right. The inference from (x1) to (x2) might fall down due to flaws in test construction. In fact, this seems more likely in the negative case than in the positive one. To get a whole series of things right by accident on a test is inductively unlikely; to get a whole series of things wrong by accident much more likely. (One might, for example, misread a crucial word in a set of instructions applying to a large section of the test, or code in answers in the wrong place.) Thus it seems that the inference from having a high score to being able to do the questions is in general stronger than the inference from having a low score to not being able to do them. There are fewer ways in which the former can go wrong, instructors and testing professionals have more control over these factors, and the pertinent circumstances are less likely to arise in real life.

The next stage is closer to the positive case: the inference from (x2) to (x3) depends on the same kinds of factors as that from (2) to (3). In both cases, what is at issue rests on both logic and psychology. How similar are the further cases in a logical sense, and how likely are people to transfer their competence?

As for the last inference, here the negative inference is stronger than the positive one. This is because the contestability of the concept of critical thinking affects it less. To see this, we recall that there is really no disagreement about certain minimal aspects of critical thinking. The disagreement comes when we consider what should be added to these. Should we add having background information on a variety of subjects? Should we add the ability to do fundamental social criticism? Should we add the ability to synthesize diverse accounts? To make apt analogies between apparently disparate areas? To question the social and political implications of standard linguistic usage? To detect aesthetically inappropriate elements? The possibilities seem to extend indefinitely. For anyone who wants a long list, an inference from (3) to (4) will seem doubtful. The term ‘critical thinker’ is to some extent honorific, and we may not wish to allow the title to someone who shows only finite and determinate competence on a short-answer test. On the other hand, on virtually any account of critical thinking, deductive competence, linguistic sensitivity, inductive competence, and the ability to detect fallacies will constitute minimally necessary conditions of critical thinking. If a person lacks these, he or she is not a critical thinker, no matter what else this person can do. In the negative case, once we move beyond the first stage, things go more smoothly than in the positive case. However, the crucial first stage inference is more questionable.

Philosophical issues enter at every stage, and psychological ones at least at the first two. The philosophical issues fall into three main areas: the correctness of keyed answers; the determination of the range of problems represented by the test set; and the extent to which ability to do problems in that range represents critical thinking ability. Critical thinking tests will be inadequate from a philosophical point of view if a substantial portion of keyed answers are not uniquely correct, if the set of problems represented by the test set is too restricted even for the range the test purports to cover, or if that range of sub-areas is insufficiently broad. They will be inadequate from a psychological point of view if they are not constructed so as to preclude such things as getting many right answers by guessing, or being sent into a panic by instructions. More interestingly, they may be psychologically inadequate if the set Q’ is generated from the set Q with insufficient attention to normal competence in matters of transfer. There are, of course, further psychological issues. Since I cannot presume the competence to assess critical thinking tests from a psychological point of view, I shall concentrate on philosophical questions here.¹²

3. Content Analysis of Several Popular Critical Thinking Tests

3a. The Watson-Glaser Test

The Watson-Glaser Critical Thinking Appraisal is widely used. Circulated by the well-known Psychological Corporation, it has been an established pedagogical and evaluation tool for some decades. Two different tests are available; Form A and Form B. Comments here apply to Form A.¹³ The test has 80 questions, which are to be done in 50 minutes. It is divided into five equal sections. The first section deals with inference; the respondent is to decide on ‘the degree of truth or falsity [sic] of the inference’, given some facts. Statements are made; the respondent is to assume that these are true and then say whether a further statement is true, probably true, left undetermined, probably false, or false, given the stated claims.¹⁴ The second section involves recognizing assumptions, which are defined in the instructions as something that is presupposed or taken for granted. From examples given, it appears that deductively required assumptions and pragmatically required assumptions are both included. The third section is about deductive inference; statements are made and respondents are to say whether other statements follow necessarily from these. The fourth section, though called interpretation, appears to overlap considerably with the first one. The respondent is given some statements and asked to say what ‘follows logically beyond a reasonable doubt from the information given’; thus deductive and (presumably) strong nondeductive inference are involved. The only difference between this section and the first is that here respondents make a ‘yes’ or ‘no’ answer; thus the fourth section calls for fewer discriminations than the first one. The last section is about the evaluation of arguments. Here again, it is inference or ‘reasons for and against’ that is in question. Respondents are to ‘regard each argument as true’; that is, they are not to question the merits of arguments on the grounds that false or unlikely claims are contained within. The point is to determine whether, if true, the claims made in the argument would prove strong or weak reasons for a further claim.

The wording of some of these instructions would make any philosophically educated person cringe.¹⁵ The test is full of logical horrors such as ‘Examine each inference separately and make a decision as to its degree of truth or falsity’; ‘For an argument to be strong, it must be both important and directly related to the question’; ‘for the purposes of this test, you are to regard each argument as true’; ‘Try not to let your personal attitude toward the question influence your evaluation of the argument, since each argument is to be regarded as true’; and ‘When the word ‘should’ is used as the first word in any of the following questions, its meaning is, ‘would the proposed action promote the general welfare of the people of the United States?”.¹⁶

One might defend such wording on the grounds that the test is intended for use by people who are not philosophers and are not philosophically educated.¹⁷ One might urge that speaking of the truth of arguments, degrees of truth, inferences as being conclusions and so on will not be misleading. However, there are several problems with this defense of the test. First of all, it is not necessary to employ such expressions in order to communicate with non-philosophers. Instructions in some other tests do not employ such expressions on respondents, and they are perfectly comprehensible nevertheless. Secondly, the serviceability, of these logical innovations is in question. Thirdly, the instructions will be confusing to anyone who has studied logic or theory of knowledge, even at quite an elementary level. Since the test is supposed to be adequate for college students generally, this is a mark against it. The direction to interpret ‘should’ in terms of interests of the United States indicates an ethnocentrism which is entirely contrary to genuine critical thinking within the United States, and which makes the test quite unsuitable for use outside the United States.

There are problems that are approachable in the short-answer format but that the Watson-Glaser test rather mysteriously omits altogether. These include at least the following: reasoning by and about analogy; fallacies, both formal and informal; judging credibility of sources; definitions in context; sensitivity to ambiguity, vagueness, and emotionally loaded language; reasoning about and to explanations; causal reasoning and empirical confirmation in experimental contexts. This is a substantial list.

The Watson-Glaser test seems especially narrow when we consider these omissions in the context of the very considerable duplication that exists between sections. Sections 1, 3, and 4 are extremely similar. Section 3 restricts itself to deductive inference from given statements; sections 1 and 4 include deductive inference and strong probabilistic inference as well. The different titles for the sections, ‘Inference’, ‘Deduction’, and ‘Interpretation’, disguise the fact that very similar questions appear in each.

Further problems arise when we come to consider specific questions and keyed answers. Of the 80 questions, eleven are questionable, on my judgment. Some illustrations are discussed below.

Question 15. (From ‘Inference’ section: respondents are to say whether, given the initial statements, the further statement is true. probably true, insufficiently determined by the data, probably false, or false.)

‘Some time ago a crowd gathered in Middletown to hear the new president of the local Chamber of Commerce speak. The president said. ‘I am not now asking, but demanding, that labor unions now accept their full share of responsibility for civic improvement and community welfare. I am not asking, but demanding, that they join the Chamber of Commerce.’ The members of the Central Labor Unions who were present applauded enthusiastically. Three months later all the labor unions in Middletown were represented in the Chamber of Commerce. These representatives worked with representatives of other groups on committees, spoke their minds, participated actively in the civic improvement projects, and helped the Chamber reach the goals set in connection with those projects.’

‘Some of the Chamber of Commerce members came to feel that their president had been unwise in asking the union representatives to join the Chamber.’

The keyed answer is that there is insufficient data. However, on best explanation grounds, the answer ‘probably false’ would seem defensible. The key can be defended on the grounds that we are only supposed to consider the given facts. However, in order to avoid the inference that members of the Chamber of Commerce will be glad to have union participation, we have to suppose that they are not glad to have had their projects and goals so successfully completed. This supposition seems very unreasonable.

Question 16. (Same section, same instructions; same passage as in question 15.)

‘The new president indicated in the speech that the town’s labor unions had not yet accepted their full responsibility for civil improvement.’

The keyed answer is ‘true’. However, the answer ‘probably true’ seems preferable. The difference hinges on the distinction between entailment and pragmatic, or conversational, implication. The president has demanded that unions now do their share. This strongly suggests, but does not entail, that he thinks they did not previously do their share. We are asked whether the president ‘indicated’ in his speech that they had not yet accepted full responsibility. There is an indeterminacy in ‘indicated’ that compounds the problem here. If we are being asked whether the president said outright that they had not yet accepted full responsibility, the answer is no. Yet if we are asked whether he strongly suggested, implied, or virtually said it, the answer is ‘yes’. Standards for what we are entitled to infer seem looser for question 16 than for question 15.

Question 28. (From the section on assumption; respondents are to say whether an assumption is made – taken for granted or presupposed – by a person who makes a sample statement.)

Statement: ‘I’m traveling to South America. I want to be sure that I do not get typhoid fever, so I shall go to my physician and get vaccinated against typhoid fever before I begin my trip.’

Alleged Assumption: ‘Typhoid fever is more common in South America than it is where I live.’

The keyed answer is that this assumption is not made. We can see how this answer is defensible, for one might believe typhoid was only equally common in South America than in the place where the person lives, and he might want the vaccination in any event, for his trip. (This seems unlikely, but is possible.) However, the instructions do not tell us to restrict ourselves to assumptions that are necessarily made. A natural explanation, the ‘best’ by many standards, of wanting this vaccination would be in terms of the belief that typhoid is more common in South America than at home; hence it would, by this type of inductive reasoning, perhaps be defensible to answer that the assumption is made.

Question 31. (Section on assumptions; same instructions as for question 28.)

Statement: ‘If war is inevitable, we’d better launch a preventive war now while we have the advantage.‘

Alleged Assumption: ‘If we fight now, we are more likely to win than we would be if forced to fight later.‘

The keyed answer is that the assumption is made. Standards seem to be different here than in the previous two questions. There, a natural assumption was ‘not made’ because an alternative could be envisaged. Here, an alternative is easily envisaged, but the assumption is said to be ‘made’ nonetheless. One might assume that we have an advantage now and we may or may not have one later; this is to say not that we are more likely to win now, but that we have a relatively sure thing now, whereas the future is unknown. This view allows that in the future we might, in fact, be more likely to win, but we would not want to gamble on that prospect now. This assumption could equally well underlie the reasoning; as it is more restrained and attributes to the speaker a more modest claim to future knowledge, it is arguably more charitable in a real case to attribute it to a speaker. Again, the problem would be eliminated if we had been asked to find assumptions that were necessarily made.

Question 78. (From the section on Evaluation of Arguments. One is to assume that ‘arguments are true’, and rate them as strong or weak.)

Issue: Should pupils be excused from public schools to receive religious instructions in their own churches during school hours?

Proposed Argument: ‘Yes: religious instruction would help overcome moral emptiness, weakness, and lack of consideration for other people, all of which appear to be current problems in our nation.’

Respondents are supposed to deem this argument strong. If we grant the claims, it would be a strong reason to have religious instructions, but not a strong reason to have it by excusing pupils to go to church during school hours.

On the basis of this content analysis, the Watson-Glaser test does not fare well. Its range is narrow, even allowing for the fact that some aspects of critical thinking are not within the scope of short-answer tests. Its instructions are philosophically garbled in an unnecessary and unhelpful way. Within the covered range, there are many contestable items; more than ten per cent of the total items are of this type. If students improve scores by 15 or 20% from a pre-test to a post-test, little can be inferred, because they may merely be more in tune with the test writers on contestable items on the latter occasions than on the former. If one student surpasses another by 15 or 20%, a fact which would surely be taken as relevant were the tests used for admission decisions, the higher score may reflect little more than luck or accord with testers’ background or prejudices. The proportion of contestable items is too large. Furthermore, some problematic aspects of instructions are so extreme that genuinely critical thinkers might become positively angry with the test and thereby do badly. Willingness to go along with the instruction to interpret the word ‘should’ in terms of the welfare of United States citizens is surely inversely related to any capacity for critical thought and analysis.

3b. The Cornell Level Z Test

The Cornell Level Z test is another test of great interest. It is designed for students at the college level and is the product of years of thought by a well-known logician. In addition, it is of special interest because of a recent expert study done in the spring of 1983. Test author Robert Ennis invited a number of informal and formal logic teachers to do the test and wrote an analysis of their responses.¹⁸ The 1985 version of the test is very similar to the 1983 version, so the answers apparently did not reveal any serious flaws in the test – or so the testers judged from the experts’ responses.

The Cornell Level Z Test consists of 52 questions to be done in 50 minutes. Respondents are advised that the test is ‘to see how clearly and carefully you think’. They are told to avoid wild guesses, though to make shrewd guesses, if they have some good clues bearing on the answer. There are seven sections in the test. The first section tests deductive consequence and contradiction. Most of the reasoning is syllogistic in nature, not propositional or modal. The manual for the tests notes that questions are set in strong value-laden contexts, so that there is an emphasis on being able to reason neutrally with suggestive content. This section contains ten questions. The second section is about faults in reasoning, if we take the test instructions for respondents at face value, and about ‘semantics’, according to the manual. It tests for emotionally loaded language, arguments and claims that trade on ambiguity, and tacit persuasive definitions. There are several false dichotomies and one hasty inductive generalization. The greatest emphasis is on problems due to the use of language; this section contains eleven questions. The third section asks respondents to comment, given a specified evidential situation, on the ‘believability’ of statements. The manual indicates that it is about credibility. ‘Believability’, the term used in instructions, would seem to indicate that respondents should consider both the credibility of people as sources of knowledge and the plausibility, given substantive background information, of the claims they make. This section contains four questions.¹⁹

The fourth section of the test deals with inductive confirmation reasoning by describing an experimental situation, and getting respondents to comment on the implications of empirical results for an experimental conclusion. The manual indicates that best-explanation criteria are to apply in judging the items. The section has thirteen questions. The fifth section again has to do with inductive experimental reasoning, but has a somewhat different thrust, in that respondents are to comment on the logical significance of various predictions. The manual indicates that it is about planning experiments. Respondents are to comment on which sort of experiment would have the greatest epistemic usefulness in a described situation. Four questions deal with this aspect. In the sixth section, the focus is primarily on word meaning, how words are implicitly defined on the basis of their use in a given context. Four questions are given.

In the seventh and final section, respondents are to comment on unstated assumptions behind arguments and remarks. The problem is construed in a deductivist way: missing assumptions are those that, when supplied,

will make a conclusion logically follow from stated premises, or an explanandum logically follow from stated explanans. There are six questions here.

Clearly, the scope of the Level Z test is considerably broader than that of the Watson-Glaser test. But still, reflecting on its content in a general way, we can see a number of features of argumentation, reasoning, and critical analysis that might have been, but were not, included. There are no questions that call for judgments of relevance. There are none about analogies. There are none that deal with conductive (cumulation of factors) arguments, calling for considering the significance of pros and cons. Despite the presence of a section said on the test to deal with faulty reasoning, a number of fallacies make no appearance on the test. Straw man, ad hominem, the argument from ignorance, guilt by association, such deductive fallacies as denying the antecedent, and many others are entirely omitted. There is no section addressed to the interpretation of discourse, and attempting to elicit respondents’ abilities to distinguish what was said from what was not said. (Missing assumptions could have been treated in this context.) No real passages are used. All material has been invented for the test; presumably this is to ensure maximum clarity and neutrality. No section tests ability to subsume cases under stated principles. Within the deductive section, there is no material pertaining to propositional or modal logic; only class relations are tested. Nor is there any testing of general sense of argument structure – whether one or two conclusions are drawn, whether there are subarguments, or whether premises are linked or bear separately on the conclusion.²⁰

If we look at the content of the test, it seems to have been constructed on the basis of a broadly positivist theory of argument. This should not be surprising, in the light of the prominence of that particular theory, especially among philosophers. Inductive reasoning is included, despite its greater difficulties for the tester, and is apparently construed as involving confirmation in experimental contexts and credibility reasoning pertaining to sources of information. There are many questions that call for sensitivity to language – ambiguity, tricky use of definitions, and meaning in context. There are questions on deductive relations, primarily consequences, consistency, and inconsistency. Nevertheless, independent scholarly thought, practical decision-making, judging actual arguments, and participating competently in debate surely require judgments of relevance and a good sense of classification so that stated principles can be properly applied. Legal and moral reasoning depend heavily on analogies, as does the use of models in scientific reasoning. In addition, analogy is a powerful rhetorical device, and the basis of many deceptive arguments. A sense of underlying argument structure can be very important for critical evaluation, as when one premise in an argument is false, and we need to look to see how this affects the cogency of the rest of the argument. Any ability to do this is untested, though it would seem in principle amenable to inclusion on a test in this format. It seems strange to include a section on faulty reasoning and yet pay so little attention to fallacies. There is much attention to abuse of language in the section on faulty reasoning, while fallacies as traditionally construed scarcely appear.²¹

So far as the questions posed and keyed answers are concerned, despite great care and willingness to elicit expert criticism, contestable items remain on the Cornell Level Z (1985). Different analysts would no doubt differ on this matter – a fact that is itself of some significance. My own scrutiny of the test and keyed answers left me with some concerns as to seven questions, nearly fifteen per cent of the total.²²Here are details for some examples.

Question 12. (In the section on faulty reasoning or, as described in the manual, ‘semantics’.)

DOBERT: I guess you know that to put chlorine in the water is to threaten the health of everyone of Galltown’s citizens, and that, you’ll admit, is bad.
ALGAN: What right do you have to say that our health will be threatened?
DOBERT: ‘Healthy living’ may be defined as living according to nature. Now we don’t find chlorine added to water in nature. Therefore, everyone’s health would be threatened if chlorine were added.
Pick the one best reason why some of this thinking is faulty.

Dobert is using emotional language that doesn’t help to make his argument reasonable.
Dobert’s thinking is in error.
Dobert is using a word in two different ways.

The keyed answer is C. Clearly this makes sense, because Dobert’s first comment would appear to use ‘health’ in the ordinary sense as meaning absence of disease, and his response stipulates a special meaning. However, A is also a reasonable answer, because ‘nature’, which figures in the stipulative definition, is an emotionally positive term. Also, we merely infer that Dobert’s first use of ‘health’ is our ordinary one. Thus, our evidence that a word is used in two different senses could be said to be less convincing than our evidence that Dobert uses emotional language that doesn’t help make his argument ‘reasonable’.

Question 14. (Again, from the section on faulty reasoning or ‘semantics’.)

DOBERT: I understand that you look on this thing as an experiment. I’m sure that the citizens of Gallton don’t want to be guinea pigs in this matter.

ALGAN: This is a demonstration. Nobody ought to object to a demonstration, since the purpose of a demonstration is not to find out something, but rather to show us something that is already known. An additional value of this demonstration of chlorination is that its purpose is also to test for the long-range effects of chlorination on the human body. This objective of the demonstration is a worthy one.

Pick the one best reason why some of this thinking is faulty.

Algan has now shown that knowing the long-range effects of chlorination is a worthy objective.
Algan is using a word in two ways.
There is an error in thinking in this part.

Here B, which is the keyed answer, does in fact seem correct. The problem is that we might be able to defend A as an answer, and whether we can depends in part on how broadly the expression ‘faulty thinking’ is interpreted, Algan does just assert that finding out the long-range effects of chlorination on human bodies is worthwhile; he offers no evidence for that claim. Analyzing his argument, we could point out his inconsistent and stipulative use of the word ‘demonstration’. We could also point out that this claim about long-range effects is rather problematic; it could even be branded as question-begging in the broader context. If thinking involves mainly reasoning, then this answer is out of order. However the concept of thinking underlying the section seems to be a broader one. In any case, respondents could well be confused by this potential ambiguity.

Question 32. (From section four, on judging inductive inference to conclusions.) The background information given concerns an experiment in which some ducklings, of three different types, were fed regular diet, and some were fed regular diet plus cabbage worms. In the latter group, seventeen were dead, four were ill, and one was healthy at the end of a week; in the former group, one was dead, three were ill, and eighteen were healthy. The question is how the added information would affect the conclusion drawn, which was that cabbage worms are poisonous to ducklings. Respondents are to choose between:

If true, this information supports the conclusion.

If true, this information goes against the conclusion.

C. This information does neither.

It is discovered that during the original experiment the regular-fed ducklings had less sunlight than the worm-fed ducklings. It is not known whether or not the difference in amount of sunshine would have an effect on the health of ducklings.

The key states that this additional information would count against the conclusion, because ‘the differences in sunlight might explain the difference in results.’ The advice to respondents, that it is not known whether difference in sunlight affects the health of ducklings, is probably inserted so as to avoid respondents appealing to the common belief that sunlight is generally healthy. Using this belief, we would arrive at C as the answer, or possibly even at A. (We might reason that since sunlight makes for health, and worm-fed ducklings die more even with more sunlight, the good effects of the sun are countered by a bad effect – which must be that of the worms.) The answer given seems then to be correct, provided that our background beliefs about sunlight and health are ignored. But that may be difficult to do. Whether it is logically correct to ignore such fundamental background beliefs in the context of inductive reasoning about health is questionable.

Question 45. (From section 6, on definition and assumption identification.)

‘What are you making with that dough?’, asked Mary’s father.

‘Dough!’, exclaimed Mary. ‘Did you ever see anything made with yeast that was baked immediately after it was mixed? Naturally not,’ she said as she put the mixture into the oven immediately after mixing it. ‘Therefore, it’s not dough.’

Of the following, which is the best way to state Mary’s notion of dough?

Dough is a mixture of flour and other ingredients, including yeast.
Dough is a mixture of flour and other ingredients, not baked immediately.
Dough is a mixture of flour and other ingredients, often baked in an oven.

The keyed answer is A. The explanation in the key reads, ‘Mary’s reasoning is that the mixture is baked immediately, so it is not made with yeast; so it is not dough. The selected definition fills the gap between the subconclusion and the final conclusion.

The problem here is that Mary focuses on two things: the inclusion of yeast, and the idea that when yeast is included, if the resulting mixture is put into the oven immediately, it is not dough. She is defining what is not dough, rather than what is dough, really, saying that if x contains yeast and x is baked immediately, then x is not dough. (If Y and I then not D.) Contraposing, we get a definition of dough, perhaps; at least we get two necessary conditions, if it’s dough, then it does not both contain yeast and go into the oven immediately. (If D then either not Y or not I.) No answer says this. The ‘right’ answer requires that we see Mary as offering a two stage argument, but there is nothing in the passage quoted to indicate that she is doing this. The question is extremely confusing. The matter is made worse for those familiar with baking, in that the ordinary language sense of ‘dough’ is much looser than Mary’s. (No doubt this is why the emphasis is put on Mary’s sense of the word, in the statement of the question.)

The Cornell Level Z test seems superior to the Watson-Glaser test in a number of respects. It has a wider range, including inductive-confirmation and explanatory reasoning, and semantic matters. Instructions are stated clearly – without either deviating from or obscurely trading upon standard logical terminology. The test has been open to expert scrutiny, and its author is sensitive to academic disputes that bear on its content. Nevertheless, there are important criticisms to be made, even from an internal point of view. Contestable items remain, despite great care. Important aspects of reasoning and argument evaluation essential to critical thinking are not covered by the test, even when such aspects would appear to be manageable within the restrictions of the format.

4. Concluding Comments

The two most widely used critical thinking tests would seem rather imperfect, then, although the Cornell Level Z Test seems far better than the Watson-Glaser test with regard to breadth and philosophical cogency of instructions. The number of contestable items in either case is significant when we consider the purposes for which these tests are used. These are observing improvement or non-improvement in a class as a result of teaching and comparing that improvement or non-improvement with a control group; and comparing individuals’ tested critical thinking abilities for admission or employment decisions. In either case, a difference of 15% in score would surely be seen as very significant. And yet both tests have 15% contestable items and hence, a possible variation in this range that is due not to critical thinking ability but to something else.

Referring back to the inference levels discussed in section two above, the contestability of some items on the test will affect not only the first inference, from the coincidence of a respondent’s answer with the keyed answer to his or her being able to get that question right, but the second inference, from a respondent’s getting some question right according to the key to his being able to answer a related set of questions correctly. This is because if the keyed answer is not uniquely right (but is instead, either wrong, or non uniquely right, as has been argued for some items on these tests), then resolving the question as indicated in the key is a poor indicator that one would resolve a formally (and, where relevant, psychologically) similar instance in the same way. If test constructors are not aware of the ambiguity – which presumably is the case – then they will not see a particular response to that ambiguity as being part of the explanation for respondents answering in the way they do. This being the case, they will not take it into account in generating set Q’ from set Q; Q’ will be generated in an unreliable way.

As was noted, many aspects of critical thinking are not covered on these two tests. This is true even if we consider critical thinking in a fairly narrow framework, restricting ourselves to the context of articulated criticism of inferences and arguments. It is still more obvious, of course, if we adopt a broader conception of critical thinking. Many of the aspects of critical thinking still within the argument analysis area, such as various fallacies, questions regarding relevance, issues of discourse interpretation, and senses of argument structure, seem amenable to mechanical testing. And yet probably if such topics as relevance and analogy had been included, the number of contestable items would have increased.

There seems to be a dilemma here. Contestable items might possibly be eliminated from critical thinking tests altogether, but if they were, this would surely be at the cost of great restriction in the scope of these tests. Already such tests necessitate restriction to short illustrations, to invented material, to fairly straightforward wording, to domains where background knowledge and value judgments do not crucially affect judgment, to issues that can be quickly summed up, and so on. It is likely that some omissions are due not to the fact that test authors regard the material as irrelevant to the nature of argument and critical thinking, but because the construction of noncontestable short-answer questions in those areas is either difficult or impossible.

It is instructive, at this point, to compare the scope of the Cornell Level Z Test with a recent curriculum statement by Robert Ennis. Ennis provides an excellent and extensive list of abilities that it would be desirable to cultivate in a course on critical thinking. Examining this list, and comparing it closely with his own test, I found many aspects not covered by the test. These include the ability to identify and formulate questions; seeing similarities and differences in argument; identifying and handling irrelevance; summarizing material; handling tables and graphs; arriving at value judgments after considering alternatives; weighing, balancing and reaching a decision in a value context; defining a problem when one has to decide on an action; and presenting a position orally or in writing, using appropriate logical and rhetorical strategies. These and a number of other aspects Ennis included are not covered on his test. Many could not be, due to the restrictions dictated by the format or to the great unlikelihood of avoiding contestable items, or both.

The range of the Watson-Glaser and the Z level tests seems narrow. It is narrow in comparison with Ennis’ admirable curriculum proposal, and it is narrower still when we consider the broader psychological and socio-political aspects emphasized by Richard Paul. Given this, the last stage of inference, from being able to do questions of the type covered on the test to having critical thinking ability in some sense general and wide enough to be of genuine interest and importance, will be questionable. As noted earlier, this will be more questionable in the affirmative case than in the negative. If someone does well on a test with narrow range, that will not give good evidence for his having general critical thinking ability; whereas if he does poorly on such a test, it is more likely that he really does lack the ability – granting, of course, the earlier stages of inference.

The dilemma that arises here is a perfectly obvious one. By greatly narrowing the scope of tests, we might be able to eliminate contestable items, thereby strengthening the inference that what respondents do on the test reliably indicates their ability to do an appropriately generated set of related problems (problems of the same type). However, by narrowing the range, we weaken the inference from their being able to do these things to their being critical thinkers. Too many aspects central to critical thinking will have been omitted from the test. On the other hand, the scope of tests could be greatly broadened, using something like Ennis’ proposed curriculum – perhaps with inspiring amendments from Richard Paul and others – as a base. If this were done, contestable items would undoubtedly increase, and the earlier inferences would correspondingly weaken.

It seems to be a no-win situation. This tension is not merely the result we would expect on the basis of armchair analysis, but is corroborated by what we found in the Cornell Level Z Test. The test has been carefully constructed by conscientious experts and is relatively limited in scope; yet it still retains approximately 15% contestable items.

A possible response to this problem is that it is apparent, but perhaps not real. Until people have tried hard to construct tests, based on a broad range of items, with zero or very low contestability, we are not in a position to conclude for certain that such a thing cannot be done. One might argue that those who criticize existing tests should, perhaps, be working hard to improve them or to invent new ones.

This response raises broader questions about the role of machine-gradable short-answer tests and the interests they serve. If we regard their role as benign or neutral, and the interests they serve as legitimate, we may think that it is important to try to broaden and strengthen these tests. If, on the other hand, we see the tests as an expression of a general desire to sum up personal differences in a quantitative fashion, in the interests of apparently authoritative bureaucratic decision-making, we will have a quite different attitude. Why do we have critical thinking tests? How much do we need them? How important is it to make them accurate?

If we wish to test for critical thinking, but are willing to relinquish the requirement of short answer so-called objective tests, we could try to test ability by essays or interviews, or a combination of these with short-answer tests. Such procedures would also, of course, have their flaws. For instance, interviewers or markers might differ in their skill or in their

interpretive assumptions, leading to unreliable results. A major problem with such procedures would be that of cost. The 50-minute short answer test, markable by computer, has obvious practical advantages. Society so often demands quantifiable results, obtainable cheaply, in a relatively short time. Lobbying for courses, programs, fellowships, and grants is facilitated if one can appeal to results presented in this way, on the basis of ‘objective’ tests. Nevertheless, we have seen fundamental problems with these tests.

Is there is something rather absurd about a society that seeks numbers based on short pencil-and-paper encounters to represent such fundamental, profound, and wide-ranging human abilities as critical thinking? Is there perhaps also something absurd about philosophers claiming expertise in critical thinking helping to pursue this questionable ambition?

If critical thinking tests have serious theoretical liabilities, philosophers and psychologists should not be relying on them for significant group or individual decisions. Nor should they be encouraging other people to do so. The supposed need for such tests comes from interests of bureaucratic efficiency and academic-political lobbying, not from truly educational, philosophical, or critical interests. Furthermore, as argued in detail here, one is caught in a trade-off situation with these tests.

Perhaps the way out of this dilemma is to refuse the task and use what critical thinking abilities we have to resist those forces in society that demand a single number, obtainable after a 50-minute computer-scorable examination, to represent critical thinking ability.

Notes

I appreciate the help I have had from Robert Ennis, Matthew Lipman, Stephen Norris, and John McPeck in obtaining materials on which this discussion is based.

1. These are the most widely used among college level students. Several other common tests are used mainly for younger students from fourth or fifth grade to high school level. One is more specialized, focusing solely on students’ abilities to appraise authoritative sounding observational statements. Several tests, including the Ennis essay style critical thinking test. were reported to me to be out of print by their publishers in March, 1985. (In the case of the Ennis tests, this later turned out not to be correct.)

2. See David Owen, ‘1983: The Last Days of ETS’, Harper’s, May 19E3, pp. 21-37.

3. The links between the development of early IQ tests and nationalistic. racist, and sexist attitudes are vividly described and documented in S.J. Gould’s The Mismeasure Of Man (New York: Norton, 1981).

4. The papers presented at the symposium were both subsequently published in Informal Logic. See John E. McPeck, ‘The Evaluation of CT Programs: Dangers and Dogmas’. Informal Logic Vol. VI, no. 2 (July 1984), and Robert H. Ennis, ‘Problems in Testing I L/CT/Reasoning Ability’, Informal Logic, Vol. VI, no. I (January 1984).

5. Compare ‘Critical Thinking in the Armchair, the Classroom, and the Lab’.

6. Minimum requirements may be amenable, though fully adequate requirements are not.

7. Compare Don Locke, ‘Natural Powers and Human Abilities’ (Proceedings of the Aristotelian Society) Vol. 74 (1973-4), pp. 171-1 X7 and ‘The ‘Can’ of Being Able’, Philosophia (Israel), Vol. 6. pp. 1-20. (March 1976.) See also W.E. Cooper, ‘On the Nature of Ability’, Philosophical Papers, Vol. 3, (October 1974), pp. 90-98.

8. Teachers can probably avoid teaching to the test by being alert and conscientious. Conspicuous patterning such as that illustrated in the text can easily be avoided by conscientious test constructors. More subtle cues may be harder to avoid, however. In his caustic article on the ETS, David Owen reports that colleagues at Harper’s who were accustomed to taking SAT tests were able to answer a substantial number of questions about the interpretation of a passage correctly without ever having read that passage! Presumably they did so due to having been sensitized to testers’ background assumptions and style of questioning.

9. It is for this reason that our judgment that keyed answers are uniquely correct is absolutely crucial. One may quote statistics until one is blue in the face, but any argument for the validity of a critical thinking test is otiose unless this initial logico-philosophical condition is met.

10. The point is emphasized in Stephen P. Norris, ‘The Choice of Standard Conditions in Defining Critical Thinking Competence’, Educational Theory Vol. 35, no. 1 (Winter 1985), pp. 97-107. Norris refers to recent work on deductive logic competence-which presumably would be more neutral than inductive, analogy, or conductive argument ability. This work indicates that linguistic factors, content and context factors, and nonlogical biases can mean that competence in basic areas of deductive inferences does not transfer as logicians and philosophers have traditionally expected. One may be able to handle conditionals when they are about a familiar subject, but not when they are about an unfamiliar one, for instance. The locus classicus in this area is Jonathan St. B. T. Evans, The Psychology of Deductive Reasoning (London: Routledge and Kegan Paul, 1982).

11. Robert Ennis supplies an excellent list in ‘Goals for a Critical Thinking/Reasoning Curriculum’ (January, 1985; private circulation). He includes focusing on a question, analyzing arguments, asking and answering questions of clarification and challenge, basic support, inference, strategy and tactics, and dispositions. Under each heading, many useful specifications arc indicated.

12. These are in any case centrally relevant and necessary, though not sufficient, for the validation of any critical thinking test.

13. If Forms A and B are relevantly similar, the same problems will appear in B. If not, other difficulties arise, because the two forms are designed so as to be usable in pre and post testing.

14. The theory of argument presupposed here appears to be broadly positivistic, as all forms of support that are nonconclusive are regarded as rendering the conclusion probable.

15. A point emphasized by McPeck in his discussion in Critical Thinking and Education (Oxford: Martin Robertson, 1981).

16. Noted by McPeck, loc. cit.

17. Robert Ennis has charitably offered this defense of the Watson-Glaser test.

18. A summary of findings is given in C. T. News (Newsletter circulated by the Philosophy Department. Sacramento, California.) Vol. 2, no. 3 (November 1983). Agreement on the higher level test was about 85% on induction items; it was lower for these items on the less advanced test. Notably, the more advanced respondents were in comparison to the intended test level, the more they tended to contest keyed answers. This suggests that especially competent students might be. in effect, disadvantaged on such tests.

19 The manual for the Level Z test indicates that it is the reliability of the person that is intended to be relevant here. Yet ‘believability’, as used, seems to refer to the probability of the statement itself as well – implying that one would at least partially base one’s judgment on one’s sense of how likely it was that the statement was true given background knowledge. Though I urged this point as a respondent to the expert survey, it was not taken up. It might be less misleading to speak of the believability of people as Ennis did when he wrote ‘The Believability of People’ (Educational Forum, March 1974), pp. 347-354). In that article, however, he quickly moves from this locution back to speaking of the believability of a statement, qua the statement of the person described.

20. In fact, these aspects should be relatively easy to test in the required format, and the importance of such skills is widely recognized.

21. A close comparison with Ennis’ own curriculum – see note 12 – indicates that many fallacies he himself thinks should be taught do not appear on the Cornell Level Z Test. These include (at least) slippery slope, bandwagon, ad hominem, post hoc, affirming the consequent, denying the antecedent, straw person, faulty argument from analogy, appeal to tradition, and irrelevance.

22. At many points when analyzing natural argumentation, we have seen that legitimate alternative interpretations and appraisals are possible. Given this, it would be surprising if material on critical thinking tests could avoid all such issues. Examination of these tests in this chapter indicates just what we should expect: they do not.

License

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

Problems in Argument Analysis and Evaluation Copyright © 2018 by Trudy Govier & Windsor Studies in Argumentation is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.