VSU Home Page

Valdosta State University
Department of Psychology
PSY310  Edcational Psychology
Instructor: John H. Hummel, Ph.D.

Study Questions/Review
Student Evaluations: Tests and Grades

Chapter 13: Student Evaluations: Tests and Grades

1. Define the following: student evaluation; formative and summative evaluations; norm-referenced tests; criterion-referenced; distracters; absolute grading standards; mastery grading; continuous grading.

2. List and describe the functions of tests and grades, making certain your answer incorporates specific reasons why we evaluate students (note: each reason has several subpoints; incorporate these subpoints into your answer.)--see syllabus and text to answer this question.

A. Explain what WYMIWYG stands for, and why it is important. What does the acronym 3C/ROD stand for?

B. Why are both these concepts important to evaluation and teaching? How are they related to the goals of educational psychology? (HINT: Unit one.)

3. List the three reasons why course grades are usually considered as inadequate as incentives.

4. List and describe Gronlund's six principles of achievement testing.

5. Test items can be: non-objective (performance, essay and supply short-answer), or objective (list-and-describe, fill-in-the-blank, MC, T/F, and matching).

(a) Describe two differences between essay and short-answers.

(b) Describe two advantages for using non-objective items over objective ones, and two advantages for using objective items over non-objective ones.

6. Describe one difference between supply and select questions.Describe the characteristics of each of the following types of test items, list the advantages/disadvantages of each, and summarize any "do's and don'ts" concerning their use or construction: multiple-choice; True-False; fill-in-the-blank; matching; short answer essay (note: this is my preferred method for all appropriate levels-in your description,be sure to include how to impartially grade these types of items). Given that an objective test can measure students' knowledge (and all other levels of Bloom's taxonomy), why is it important to require students to take both non-objective tests as well as objective ones? (This will require some thought; there are several good reasons and you should include as many as possible in your essay-is that a cue or what?)

7. Why must there be congruence between test items and behavioral objectives?

(a) Why are good behavioral objectives important to assessment?

(b) Describe the relationship between the time spent in class on a a topic and, ideally, its relative weighting on an assessment.

8. Primary, middle, and secondary teachers view the purpose of grading and evaluating differently. Describe how each defines the proper role of evaluation.

9. Why is it important to clearly specify how tests, etc. will be graded and how grades will be assigned (for a test, etc., and for a grading period and year).

10. Relative grading standards, or grading on the curve, has advantages and disadvantages. List the advantages/disadvantages and explain why few teachers use pure grading on the curve.

11. Define test and evaluate it (describe its components) along its dimensions (HINT: SysMSBESaN--"Sismissbesan").

12. List and describe the four levels of measurement (nominal, ordinal, interval, and ratio).

13. (a) Define norm-referencing. (b) Are teacher-made tests examples of norm or criterion referencing? (c) Norm referencing and criterion referencing are associated with which levels of measurement? (d) Describe the difference(s) between immediate and distant norms.

14. Compare and contrast norm and criteria referenced tests on at least three dimensions.

15. Describe the concept of "true score v. performance" (performance is the same as a student's observed score on a test).

16. Describe two ways a teacher can improve a test's reliability/validity.

17. Define the general concepts of validity and reliability.

18. Write a 100-200 essay that is reasonable and logical that either supports or is against the use of frequent testing by teachers.

19. Describe the rule of thumb one uses to determine if one's grading standard is criterion-referenced or relative (e.g., what is the distinction between an absolute and relative standard?).

PSY 310 Review of Chapter 13: Student Evaluations: Grades & Tests

What You Measure Is What You Get (i.e., WYMIWYG) is in many ways a variation of the self-fulfilling prophecy that is associated with motivation; if you expect low (or realistically high) performance from students, that's what you get. {Note: A copy of the originaI WYMIWYG is attached.} All teachers want their students to master/learn content at the higher levels of the Bloom et al. (1956) taxonomy, then teachers must (a) have objectives that reflect these levels, (b) teach the content at these levels, and (c) assess at these higher levels. The vast majority of formative (and summative?) assessments tap only the first two levels of Bloom's taxonomy (i.e., knowledge and comprehension). For students to develop appropriate critical thinking skills, they must have many experiences, including formal assessments, at the application-through-evaluation levels. Related to WYMIWYG is the acronym 3C/ROD by Bushell and Baer (1994)3 . It stands for Close, Constant Contact with Relevant Outcome Data and means that educators should identify measurable/observable outcomes, identify how to teach these, and test the students' skills as you teach them the skills.

1. Very few teachers (or students) like to test and assign grades. However, student progress and performance must be evaluated. Our evaluations of students let us know how the students are doing (are they learning the material? Do they need extra help and/or remediation? Are their study techniques effective?), whether our methods of presenting information need improvement, as an important source for communication between the home and school, and as a selection/placement mechanism into programs, etc.

2. Why we evaluate students include the following:

a. Evaluations as incentives. Tests and grades can motivate students if they are important to students; are fair and objective (e.g., related to student's performance and behavioral objectives); and occur frequently. Frequent short quizzes are better than longer tests because they score higher on them (smaller amount of material) and less delay between taking the test and feedback.

b. Evaluation as feedback to students. Grades/papers with comments are more helpful than grades alone; the comments can serve as corrective feedback to identify strengths and weaknesses and how to improve.

c. Feedback to teachers. Lets them know the degree to which their instruction is effective.

d. Feedback to parents. Lets them know how the student is doing and whether they need to intervene at home to improve student performance (if necessary).

e. Information for selection purposes. Schools, in a sense, are constantly sorting students (hopefully ONLY on the basis of ability, but this, of course, is not truly possible). Which programs/courses one can take; which schools, occupations, etc., one may enter are affected by the student's record of achievement/evaluation. Because this has such significant import to the student's life, it is important that tests, grades/evaluations be as objective, reliable and valid as possible.

3. Types of assessments.

A test is defined as a systematic procedure used to measure a sample of a student's behavior/performance that is evaluated against standards and norms.

Systematic procedure: Teachers need to carefully select (by following a set protocol) which content to test students on. One's systematic protocol should ideally cover smaller amounts of content (resulting in more frequent assessments), reflect only those objectives actually taught (and proportionately with respect to engaged time), and cover all levels of Bloom's taxonomy (i.e., WYMIWYG-if students don't have to ever apply, analyze, synthesize, or evaluate, they won't acquire these skills!). Measure a sample: Because one cannot measure all content taught, the teacher should insure that items on the test reflect a fair representation of the content. The students' performance on the content covered on the test should be an accurate estimate of how they would have performed on content NOT covered on the test (i.e., the relationship between the observered score and true score). Behavior/performance: Knowledge students learn cannot be directly measured. Instead, teachers have to assess their answers (performance) on the test items. Given that it is a fair test, one can confidently assume (if we test often) that the students' score on the actual items given accurately reflects what they have learned. Evaluated against standards and norms: The quality of the student's performance on an assessment is judged/scored/graded either against the performance of other students who have taken the assessment, or against absolute standards (e.g., a key of correct answers).

a. Formative tests are used to pinpoint strengths and weaknesses, while summative tests are used to formally assess the student's knowledge. Formative tests are usually more frequent than summative ones. Formative tests CAN be viewed as teacher-made and standardized as summative ones. Formative can also be construed as "practice" teacher mades tests while summative tests are "real" (i.e., informal vs. formal); Finally, depending on which approach one adopts, formative tests can be viewed as teacher made formal assessments (e.g., quizzes) that cover less content than do the more comprehensive summative tests (e.g., exams).

b. Norm-referenced tests compare how students perform compared to one another. Criterion-referenced ones assess a student's knowledge/performance against an absolute standard. Formative evaluations are invariably criterion-referenced.

4. Grades (every 6 weeks, etc.) are not generally good incentives because they are indirectly tied to specific student behaviors, and because they occur too infrequently. This is why frequent testing is valuable; they help to keep the student on track; provide more immediate feedback (hopefully corrective); and provide a good predictor of what the student's reported grade will be so there are no surprises when report cards go home. [Every teacher should clearly specify how 6-week and course grades will be computed at the beginning of the year and at the beginning of each new reporting period (especially if there is a change occurring).

a. Often, different teachers teach the same course. Because each teacher emphasizes different points, may have different behavioral objectives, etc., many departments develop department-wide summative exams. This helps insure quality control across teachers but does necessitate that the teachers agree on common topics and BOs upon which the summative tests are based.

5. Achievement tests measure what one has been taught. Many teachers believe they measure what students have learned; but I am not certain this is the case. Gronlund (l982) has six principles of achievement testing:

a. There must be a high level of congruence between what has been taught (BOs) and the test items.

b. The test should be a representative sample of what was taught-if you spent 60% of the allocated/engaged time on topic A, 30% on topic B, and only 10% on topic C, 30% of the test should be on topic B, etc.

c. The test should be constructed with "appropriate" test items. For example, knowledge of facts, etc. can be assessed via fill-in-the-blank, T/F, or MC items. Still, overuse/dependence on objective items denies the student the opportunity to practice/refine their writing, and they may lead to superficial analyses/evaluation/synthesis on the part of the student. I believe that a good achievement test should reflect balance of types of items if one insists on using any objective items (students probably should take some objective tests-for example, probes and some of the formative evaluation; I believe summative tests should primarily be a sample of the student's knowledge as assessed through their writing). Also, research shows that when students "expect" tests to be objective they "prepare" differently than if they expect a supply test (or a combination of select and supply). This research shows that students who prepare for an exam that they expect to be essay will do as well on an objective test as they would have if the essay test had been given, but the reverse is not true. If students prepare for an objective test and are given an essay one they do more poorly on it than if they had been given an objective test. Life is not true/false or multiple choice!

d. The test should fit the purpose it will be used for. Possible test uses: predicting how the student will do in a course/program; diagnostics (pinpointing strength and weak areas); serving as either formative or summative evaluation of what the student has learned (remember, summative tests cover broader areas than formative ones, but both should be linked to the BOs-what was taught).

e. All tests must be reliable (e.g., consistently measure what they measure), and be interpreted with caution. Teachers can improve the reliability of their tests by making each test longer (almost require that you use more objective items than you should) and/or "mo is betta," give more tests.

No instrument is perfect-even on a good test, a student's true score is somewhere around the point where they scored (on a standardized test, this is known as the standard error of the mean). Because of the error factor associated with testing, this is another good reason why one should quiz/test frequently over small units of material-error on any one test is diluted by the number of tests given. Validity can be improved in several ways: "mo is betta" give tests over smaller amounts of content; key each item to a specific objective; make sure the question is worded well and at an appropriate level for the students; & that there is proportionality between weight of items and the amount of time spent in class on the items' objectives

f. Tests should be used to improve students' learning. Results should be clearly communicated to students and, when necessary, remediation/recycling over the information can be scheduled either by the teacher or the student. If it is believed that the students' performance is, to some extent, a product of how the information was taught, then the lesson plan must be revised and retaught.

6. Developing Test Items. Items should be based on BOs (or BOs based on items-the former is a better strategy). Remember, the test is a sample of what has been taught/learned-it usually won't cover every specific piece of information (in fact, summative evaluations that follow 2 or more units [separate topics] are supposed to be more general than specific-to measure concepts, understanding, etc.--does this sound like it violates some of the instructional principles we discussed? It does to me too; take it with a grain of salt.

a. Types of items. OBJECTIVE/SELECT: multiple-choice (4 or 5 distracters; give a question/stem and have the student select/recognize the "best" among those presented; allows you to cover a lot of content; can be used to measure all levels of Bloom's taxonomy; hard to construct good ones and they, like all objective items, deny the student the opportunity to practice/refine their writing and thinking abilities). Additional information on how to construct, and advantages/disadvantages are presented in the text. True/False. A variation of MC; students have a higher chance of accurately guessing the correct response. Best used when requiring Ss to chose between a dichotomy. Matching. Also measures recall. Several variations (length of lists, whether items can be used more than once, etc.) make it useful-some students get more confused with matching than with other types of objective items. The common characteristic of objective items is that there is a limited number of correct answers (usually one!). Not all objective items have one give a set of answers from which one chooses the most correct; some objective items require the student to provide the correct answer. Among these "objective supply" items are list and describe, define, and Fill-in-the-blank questions.

NONOBJECTIVE: Short answer and essay (aka supply). Harder to grade (time and fairness). Provides leeway to students that other item forms don't have. Students have to practice their organizational and thinking abilities as well as the mechanics of writing. Always should be graded using a premade key. PERFORMANCE: These exams require students to "do" something with the information they have learned; compose a letter, design a building, etc.

7. Grading: Almost always some flexibility accorded to the individual teacher. In all grading strategies, the teacher should always use a key. In absolute standards, the student's score is judged/interpreted according to preset standards that don't change; using relative standards occurs when one judges the student's performance against other students' scores (for example, allowing only so many Ss to earn a B or A); and mastery where one continues working on the material until at least minmal competency is acquired.

In his 1991 (Vol. 2) Behavior Analyst article, "A behavioral perspective on college teaching," Jack Michael describes two types of competition that teachers at all levels of schooling employ: Viscious competetion and friendly competetion. Absolute grading reflects the latter, and normative/relative the former.

A. Report cards: based on your formal assessments during the reporting period; the relative weighting of the various assessments should be expressed to the students at the beginning of the reporting period. Grades received should not be a surprise to the student; if they are, you are not communicating well enough with them on assessments (feedback); grades are privileged information-not to be shared or discussed with anyone other than the student, the parents, or other professionals with a need to know (unless prior written approval is given by the parent or the student ( if of majority).

The following is a personal communication (Internet) that relates to grading practices (and instruction to a lesser degree). I hope you too find it fascinating.

From: "John W. Eshleman" <73767.1466@compuserve.com>

Date: Sat, 30 Mar 1996 18:26:58 EST

To: Multiple recipients of list BEHAV-AN <BEHAV-AN@VM1.NODAK.EDU>

Subject: From Errors to Shaping

From Eliminating Errors to Shaping Responses

Part One by John Eshleman, Saturday March 30, 1996

Errors are anathema in Programmed Instruction. A sign of agood program, in fact, occurs when few or no errors are occasioned by it. Moreover, part of the task when creating a program of instruction focuses on adding or modifying frames

so as to reduce the likelihood of errors made by the student.

In Generative Instruction, and with Precision Teaching, on the other hand, errors are treated differently from the way they are in Programmed Instruction. Whereas there is an attempt in Programmed Instruction (PI) to ensure that errors by the student

rarely or never happen in the first place, in Generative Instruction (GI) errors are allowed to occur, but are corrected in any of several ways. In addition, errors are also counted and charted in GI, along with correct responses, with both charted as frequencies on Standard Celeration Charts.

In my own experiences, and also thinking about this difference, I have decided that virtually everyone has missed the boat, so to speak, where the matter of errors are concerned. I have asked myself, for instance, what are errors? And furthermore, what would be the best "behavioral" way of dealing with them? I think that the notions we have about "errors" and what to do about them have remained uninformed by our own science, as I shall attempt to describe below. I think that the concept of "correct vs. error"

manifests a logical fallacy, and that recognition and rejection of that fallacy will permit a more "behavioral" approach to instruction -- one based on actual shaping.

The Either-Or Bifurcation

One of the properties of verbal behavior concerns its correctness. A verbal community reinforces certain behavior, which may be deemed as "correct" responses, and punishes or extinguishes other behavior, which may be deemed as "incorrect" responses. This seems to be the case generally, and thus would also be the case with respect to verbal behavior. Much of what goes under the label "academic behavior" is verbal behavior, and thus also means behavior consequated based on its correctness. A response is typically

seen as either correct or erroneous. Given the stimulus "5 + 4 =", a response of "nine" would be correct, and a response of "eight" an error. The verbal community typically reinforces by some means (or at least fails to punish) the former and punishes (or at least

fails to reinforce) the latter. And since we do want people to learn correct responses, and to not make erroneous ones, that practice of reinforcing the former and punishing the latter makes a certain amount of sense. It does so because there are practical effects

that result.

Yet, given "5 + 4 = ", and a response of "eight," "eight" isn't bad. It's a pretty close guess. Closer than "three," for instance. And closer than "a hundred." But while a student is learning the correct response to "5 + 4 =" a response of "eight" appears to be taken to be as just as bad as any other response other than "nine." We mark it as "incorrect." Enough times of this and we learn that errors become the occasion for punishment, even if mild disapproval or correction.

Similarly, if the student is learning to spell "manufacture," and writes "manoofacure" the typical response by the instructor or by the instructional system (and, hence, the verbal community), would be to mark it as an error. Ultimately it certainly qualifies as such, of course. But what's at issue here concerns how such an error is treated during the instructional process.

The response "manoofacure" would be viewed as an error and treated accordingly, most likely, even though the student may still be in the process of learning it. Academic responses, then, for the most part, divide into corrects and errors. We all are familiar

with this: either we got the answer right, or we got it wrong. What happens in this situation, then, is a bifurcation. A response is taken to be either correct or erroneous. This is "either-or" logic, and represents essentially the same thing as XOR (exclusive-OR) logic. In exclusive-OR logic, for instance, either one of the alternatives may be correct, but not both simultaneously. (In the case of simple OR logic, either alternative or BOTH of them can be true simultaneously.) [An alternative to "either-or" exclusive-

OR logic would be "both-and" logic, though that's probably not the only alternative.]

Exclusive-OR logic works with electronic circuits, and has other applications, and occasionally appears with actual bifurcations. But many bifurcations are not truly bifurcations at all. Indeed, the term 'bifurcation' sometimes gets included in lists of

logical fallacies. Many apparent bifurcations mask over third alternatives, or the "many shades of gray" between the two apparent alternatives. I think, and this is just my opinion, that the "either-or" bifurcation represents just such a fallacy in many cases, and that it also permeates our culture in a rather pervasive fashion. Certainly it appears to be pervasive throughout the educational system and in circumstances where academic

behaviors are taught. As a result, we are seduced into speaking about two alternatives where there may be more.

In teaching academic skills the "either-or" bifurcation certainly seems to predominate. When you take a multiple choice test your answer is either correct or incorrect. When you do a flashcard the answer you give is either right or wrong. The same occurs even with Precision Teaching's SAFMEDS: a response is either correct or

it is an error. Fill-in-the-blank responses, or constructed responses, are taken to be either correct or erroneous. Even on more complex and subjective activities such as essay answers a student is given credit for being correct and not given the same for being incorrect. And, on many computer-based training software programs the machine

is programmed to record whether a response given is either correct or if it an error.

Steve Graf's Alternative: Shape the Behavior

The "EITHER it's correct OR it's an error" rule has persisted throughout the history of education, certainly appropriately to some degree since ultimately correctness has practical effects. But should the rule apply DURING the process of instruction? I think not, and here's the reason why: A verbal response made by the student during

instruction, instead of being viewed as an error if it otherwise seems to be, could be viewed as an approximation to the correct response. Does this suggestion sound familiar? To anyone familiar with the concept of shaping, it should. In shaping, of course, one reinforces successive approximations to some target or terminal behavior. One does not view those approximations as "errors" to be punished or extinguished simply because they are not yet correct!

Some years ago, during the 1980s, Steve Graf, Ph.D., at Youngstown State University (located in Youngstown, Ohio), created a system of evaluation that broke out of the "either-or" logic of the correct vs. error bifurcation. What Graf did was first of all

dispose of the names "correct" and "error." Second, he set up a situation whereby student responses could be categorized by any of four or five different categories. These categories

were labeled by Graf as "bullseye," "close," "try," and "skip." Occasionally there would be an additional category labeled "no-chance" (after the usages suggested in Pennypacker, Koenig, and Lindsley, 1972). A "bullseye" would be a perfectly correct response (preserving the "best possible score" attribute of targets). A "close" would be a response that, one way or another, was an approximation to the "bullseye." It would have some,

but not all, of the attributes of a "bullseye." Or it would be a thematic variation of a correct response, albeit in other words. A "try" would be a response that, while on-topic, would not be an approximation of any kind. A "try" would be a classic case of a completely incorrect response. Finally, a "skip" would be a response that also was not an approximation of any kind, but which was otherwise off-topic or irrelevant. Saying "I don't know" would be an example of a "skip."

In traditional educational terms, only the "bullseye" would be construed as correct. "Close," "try," and "skip" responses would be treated as errors. Thus, "close," "try," and "skip" responses would be treated the same by traditional "either-or" methods.

What Graf next did, beyond simply identifying separate categories above and beyond the "correct vs. error" paradigm, was to set up a unique differential reinforcement system that illustrates how these categories could be used as approximations to actual shaping. Every category except the skips could earn a student credit points necessary to passing Graf's university course. Thus, under Graf's system, "bullseyes," "closes," and "tries"

are generally treated somewhat the same, with some differences. What Graf would do, for instance, would be to assign different multipliers to the different categories. An example situation that he once described to me would look like this:

Bullseye -- X8

Close -- X4

Try -- X2

Skip -- X1

That is, a "bullseye" response would multiply the credit point earned by a factor of eight. A "close" would also multiply the credit point earned, but only by a factor of four. A "try" would also multiply the credit point earned, but by a factor of two. A "skip" would multiply by one, which would not change anything since multiplying by one retains the original value of the number you're multiplying. The payoff, of course, was differential.

"Bullseyes" earned the most credit points. But -- and this is the significant point -- even the "close" and "try" responses would earn credits, though not as much. That is, as approximations to the target "bullseye" response, they would be consequated with credits, and to the extent that credits held any reinforcing value, then the approximations would be reinforced. Moreover, since the "closes" are more successively proximal to the "bullseye" than are the "tries" they also earned more.

Graf has not published this information to my knowledge, hence I don't have any references to cite. It's all personal communication. I lack some of the details of what he has done, and thus cannot describe his entire instructional system. I presume, therefore,

that the goal was to get students to emit "bullseye" responses. What I find important about his work is that it breaks out of the "either correct or error" mold; that it begins to introduce actual shaping into the instructional process; that it constitutes practicing what behaviorists preach; and that it has clear implications for both Programmed Instruction and its successor, Generative Instruction. I'll cover these latter points in a follow-up

to this message.


Pennypacker, H. ., Koenig, C., & Lindsley, O. R. (1972). Handbook of the standard behavior chart. Kansas City, KS: Precision Media.

Copyright (c) 1996 by John Eshleman. All rights reserved.

What You Measure Is What You Get4

John H. Hummel and William G. Huitt

Valdosta State University

The country's current focus on promoting critical thinking skills is our collective reaction to a problem that has been developing for some time. Until recently we assumed that critical thinking would automatically develop as students acquired knowledge and primary importance was given to the discussion of specific disciplines to be studied. We were able to ignore issues such as: What is critical thinking? How can it be measured? How can it be promoted? until it became obvious that the level of critical thinking of too many high school and college graduates was insufficient to the demands of modern society.

In order to effectively and efficiently accomplish the objective of improving students' critical thinking abilities, we need to address the issues of defining and measuring critical thinking. These must be done well before we can develop and test empirical intervention strategies that will promote students' critical thinking skills in all grades.

Critical thinking is probably the most current label for what many call analytical reasoning, synthesis, problem-solving, or higher mental processes (Scriven & Paul, 1992). Common threads that tie the various definitions of critical thinking together are the terms used to describe the processes and outcomes associated with thinking critically, the development of concepts and principles, the application of facts, concepts and principles to solve problems and make decisions, and the evaluation of these solutions for effectiveness (Chance, 1986; Ennis, 1987). Almost four decades ago, Bloom, Engelhart, Furst, Hill, & Krathwohl (1956) published their now widely-accepted taxonomy for classifying objectives and assessment items for the cognitive domain. Their system specifies six levels of understanding and mastery, and each higher level subsumes the properties of the lower levels. The levels of the taxonomy are, from lowest to highest, knowledge, comprehension, application, analysis, synthesis, and evaluation. Subsequent research has lead to the conclusion that the taxonomy is indeed a hierarchy with the exception that perhaps evaluation and synthesis are misplaced (Seddon, 1978).

Typically, students' achievement and critical thinking skills are assessed using a forced-choice format. Unfortunately, most items used in these assessments address levels of knowing and thinking not typically associated with critical thinking. Many researchers (e.g., Carter, 1984; Gage & Berliner, 1992; Woolfolk, 1993) agree that the objective test items used at all levels of education overwhelmingly tap the lower (i.e., knowledge and comprehension) levels of the Bloom et al. (1956) taxonomy. Other researchers who developed alternative taxonomies have drawn a similar conclusion (e.g., Stiggins, Rubel, & Quellmalz, 1988).

These problems are crucial in that the types of assessments used in education affects how students learn and how teachers teach (Fredericksen, 1984). This conclusion is so central to teaching and assessment practices at all levels of education that in our preservice and inservice teacher education classes we use the acronym WYMIWYG to emphasize its importance. WYMIWYG specifies a concept we believe ought to be a guiding principle for all educators: What You Measure Is What You Get. If educators develop assessments aimed at higher-levels thinking skills, (a) they will be more likely to teach content at those levels, and (b) students, according to Redfield and Rousseau (1982), will master-and-perform at those levels. Students not only need to know an enormous amount of facts, concepts, and principles, they also must be able to effectively think about this knowledge in a variety of increasingly complex ways.

Getting Started

We believe all educators can immediately begin improving students' abilities to think critically by implementing a few basic strategies. First, teachers must insure that there are instructional/behavioral objectives that cover the lesson's content. Care should be taken to match objectives with an outside assessment, a next level of learning, or a stated requirement for success in a given field or career (Wiggins, 1991). Objectives that are too broad or general should be rewritten to specify what students should be able to do after mastering the objective, and if teachers identify expected outcomes not covered in existing objectives, new ones should be developed. After validating the congruence or overlap between the objectives and content taught, educators should then analyze each objective to determine its level vis-a-vis a validated taxonomy of the cognitive domain (e.g., Bloom et al., 1956; Ebel, 1965; Gagne, 1985; Stiggins et al., 1988). If appropriate, objectives should be rewritten to reflect a higher level of the taxonomy. (As one rewrites objectives to require students to use critical thinking skills, one may also have to revise the instructional techniques used to teach the course content.) Next, one should evaluate the assessment instruments used to establish whether students have mastered the content at the stated level of the taxonomy. If test items are used that only require lower-level thinking skills such as knowledge and comprehension, students will not develop and use their higher-order skills even if instructional methods that employ these skills are implemented. This follows the maxim that individuals do not do what is expected, only what is inspected.


Convincing educators, including college teachers, to demand precise, operational definitions of critical thinking is going to be no easy matter. In addition, getting teachers and standardized test developers to assess students at the higher levels of the taxonomy will not be an easy task. It takes time to prepare good assessments (e.g., tests, demonstrations, exercises) that require students to think critically; it takes even longer to prepare the necessary keys and to grade assessments such as essay exams and term papers. Everyone is willing to say that good teaching and assessment (especially as they relate to critical thinking) are important, but not enough educational resources are committed to support and promote these activities.

Of course, simply having teachers give more essay-type or activity-oriented assignments (even good ones that tap into the higher cognitive domains) will not necessarily improve students' critical thinking skills. Likewise, outcomes assessment efforts, including standardized tests, for high school or college graduates will not, by themselves, produce improvements in students' critical thinking skills (though such assessments, if valid, may help to emphasize and document the extent of the problem!).

Many teachers at all levels will likely need to be provided with inservice instruction so they can (a) discover/rediscover the value of instructional techniques that include well developed objectives and task analyses, and (b) incorporate new teaching strategies (e.g., Gray, 1993; Hummel & Rittenhouse, 1990) into their pedagogy that research shows help to promote/develop the skills associated with critical thinking. Fortunately, a variety of alternatives are becoming available (e.g., Georgia's Critical Thinking Skills Program, 1993; Oxman, 1992). However, unless measurement of students' critical thinking skills are completed regularly and given prominent attention by educators, the sustained efforts required to make changes in our educational system are not likely to occur.


Bloom, B. S., Engelhart, M. D., Furst, E. J., Hill, W. H., Krathwohl, D. R. (1956). Taxonomy of education objectives, handbook I: Cognitive domain. New York: David McKay.

Carter, K. (1984). Do teachers understand principles of writing tests? Journal of Teacher Education, 35, 57-60.

Chance, P. (1986). Thinking in the classroom: A survey of programs. New York: Teachers College, Columbia University.

Ebel, R. L. (1965). Measuring educational achievement. Englewood Cliffs, NJ: Prentice-Hall.

Ennis, R. H. (l987). A taxonomy of critical thinking dispositions and abilities. In J. B. Barron, & R. J. Sternberg's (Eds.) Teaching thinking skills: Theory and practice. New York: W. H. Freeman, l-26.

Fredericksen, N. (1984). The real test bias: Influences on teaching and learning. American Psychologist, 39, 193-202.

Gage, N. L., & Berliner, D. C. (1992). Educational psychology (5th ed.). Boston: Houghton Mifflin.

Gagne, R. M. (1985). Conditions of learning (4th ed.). New York: Holt, Rinehart, & Winston.

Georgia's Critical Thinking Skills Program. (1992). Atlanta: Georgia Department of Education.

Gray, P. (1993). Engaging students' intellects: The immersion approach to critical thinking in psychology instruction. Teaching of Psychology, 20, 68-74.

Hummel, J. H., & Rittenhouse, R. D. (l990, May). Revising Woods' taxonomy of instrumental conditioning. A paper presented at the annual meeting of the Association for Behavior Analysis, Nashville, TN.

Oxman, W. (Ed.). (1992). Critical thinking: Implications for teaching and teachers. Upper Montclair: Montclair State College.

Redfield, D. L., & Rousseau, E. W. (1981). A meta-analysis of experimental research on teacher questioning behavior. Review of Educational Research, 51, 181-193.

Scriven, M., & Paul, R. (1992, November). Critical thinking defined. Handout given at the Critical Thinking Conference, Atlanta, GA.

Seddon, G. M. (1978). The properties of Bloom's taxonomy of educational objectives for the cognitive domain. Review of Educational Research, 48(2), 303-323.

Stiggins, R. J., Rubel, E., & Quellmalz, E. (1988). Measuring thinking skills in the classroom (Revised edition). Washington, DC: National Education Association.

Wiggins, G. (1991). Teaching to the authentic test. Educational Leadership, 46(7), 41-47.

Woolfolk, A. E. (1993). Educational psychology, (5th ed.). Boston: Allyn & Bacon.


Last Updated: May 20, 1997