Reflection Expanded: Expanding The Cognitive Reflection Task

OCCIDENTAL COLLEGE DEPARTMENT OF COGNITIVE SCIENCE
Reflection Expanded
Expanding the Cognitive Reflection Task
Samuel C. Boland Spring 2013 Senior Comprehensive Project
The Cognitive Reflection Task (Frederick, 2005) is designed to measure the tendency of an individual to engage in cognitive reflection, or the propensity to think about their own responses analytically. In this study, I explain the CRT, the meaning behind it, and various correlating measures, and I propose an expansion of the test. Twelve new questions were tested against the original three on various measures such as SAT/ACT Score, Age, Gender, and cognitive heuristics-and-biases tasks, and of that, six were selected for possible inclusion in the CRT due to high correlations.
Boland 1
INTRODUCTION
COGNITIVE REFLECTION AND DECISION MAKING
The Cognitive Reflection task (CRT), first created in whole by economist Shane Frederick in 2005 (Cognitive Reflection and Decision Making; Frederick, 2005), is a test designed to measure cognitive reflection. He drew on Stanovich and West (2000) for a formal definition of this cognitive ability. According to the authors, human cognition can be generally characterized into one of two systems, the exact nature of which differs depending on the construct being measured. Regardless, they say that most cognitive systems have a System 1 and a System 2 format. System 1 is fast and immediate. It is utilized in such actions as driving, walking, simple arithmetic, and any other non-cognitively taxing activity. System 2, on the other hand, is slow and analytic; it must be specifically activated, and requires sustained effort and active concentration to maintain. It is implicated in complex tasks that require active concentration, such as learning a new skill, complex mathematics, reading dense books, or writing a paper. System 1 is the default system for most activities it would make little sense to devote intense cognitive ability to walking in a straight line, or most of any daily tasks. Thus, it is more easily activated. The activation of System 2 requires a specific desire to do so, and sustained motivation and ability throughout. Fredericks paper analyzes the relationship between individual affinity towards cognitive reflection and other cognitive measures. In order to do this, Frederick created a series of questions that were designed to activate both System 1 and System 2. Specifically, these questions must have a pre -potent, or gut response: one which seems immediately obvious, but which is, however, incorrect. The question then needs a correct, analytically derivable response, which can only be arrived at by the application of System 2. However, System 2 will only be activated if the participant catches themselves, or notes that they have made an error. In order to have done that in the first place, then they must have been reviewing their previous actions and responses, in effect reflecting upon their recent mental past. Hence the name The Cognitive Reflection Task. To this end, Frederick created three questions that satisfy this pre-potency condition. They are: 1. A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?
Boland 2
a. b. c.
Intuitive Answer: 10 cents. Correct Answer: 5 cents. Why: People seem to be eager to simply subtract $1.00 from $1.10, as it is a cognitively simple procedure. However, after a small amount of reflection, it becomes obvious that this way of completing the question violates the stipulation that they together cost $1.10, as $1.10 + $0.10 = $1.20. Instead they must generate a simple set of equations {X + Y = 1.1, X Y = $1.00, solve for Y}.
d.
This question comes from Kahneman and Frederick (2002) and Kahneman and Frederick (2005), and formed the springboard from which Frederick created the next two questions.
2.
If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets? a. b. c. Intuitive Answer: 100 minutes. Correct Answer: 5 minutes. Why: It takes 1 machine 5 minutes to make 1 widget. So, it will take 100 machines 5 minutes to make 100 widgets. The obvious answer is to scale up all of the variables to 100 however, Frederick when creating this question picked a special case where all of the numbers were the same, which may have instantiated a mental schema wherein X = Y = Z for all members of the set, which would provide inconsistent results.
3.
In a lake, there is a patch of lily pads. Every day, the patch doubles in size. If it takes 48 days for the patch to cover the entire lake, how long would it take for the patch to cover half of the lake? a. b. Intuitive Answer: 24 days. Correct Answer: 47 days.
Boland 3
c.
Why: It seems that people tend to assume linearity in mental calculations, perhaps because many spectra in day to day life operate linearly, or at least approximately linearly on the scales that we perceive. However, a doubling every day is an exponential function, which must be taken into account in order to correctly answer this question.
In Kahneman & Frederick (2005), Frederick notes that The critical feature of this [bat and ball] problem is that anyone who reports 10 cents has obviously not taken the trouble to check his or her answer. The surprisingly high rate of errors in this easy problem illustrates how lightly system 2 monitors the output of system 1: People are often content to trust a plausible judgment that quickly comes to mind.
CORRELATING MEASURES TO THE ORIGINAL CRT

Frederick was interested in how this measure might correlate with other cognitive measures. Specifically, he was interested in various measures of economic cognition. One of these is Temporal Discounting, or the tendency/ability to put off a small immediate reward for a larger later reward, and the ability to accurately gauge whether it is a better choice to receive a reward (such as money) presently or later, including an understanding of the effects of inflation and compound interest. One question in this vein that was utilized in the present study is: Would you rather be given $3400 this month, or $3800 next month? The high -CRT group (that is, preferred receiving more money later, with a high statistical significance and N = 806. Frederick explained this task in terms of Annual Discount Rate, or at what percentage of annual interest one would require for a certain amount of money A to grow to amount of money B. Frederick stated that the annual rate of $3400 to $3800 = 280%, much higher than any other sort of savings program, and so the best answer must be to wait, according to his judgment. In the field of Risk Aversion, the high-CRT groups were more willing to gamble when the utility of an uncertain payoff was much higher than that of a certain payoff, whereas the low-CRT groups were more eager to take the certain money, and not risk losing it. For some questions of this sort that were presented, the expected value would be maximized by picking the gamble. On others, it would be maximized by picking the certain money. To quote Frederick: In the domain of gains, the high CRT group was more willing to gamble particularly when the gamble had higher expected value, but, notably, even when it did not. (Frederick, 2005).
Boland 4
Another example was sensitivity to the Gamblers Fallacy. The gamblers fallacy is the belief that statistically independent events are actually causally linked in some way, generally involving a sense of luck. Some individuals appear to believe that luck is some sort of tally, keeping track of wins and loss es through times, and that a string of losses increases the chances for a later victory. One question designed to elicit this response is: When playing slot machines, people win something about 1 in every 10 times. Julie, however, has just won on her first three plays. What are her chances of winning the next time she plays? Frederick found high correlations between these and related measures and the CRT, seemingly pointing towards an underlying relationship between cognitive reflection and these various measures of economic cognition. Risk aversion, temporal discounting, and sensitivity to the gamblers fallacy all correlate with the ability to reflect upon immediate actions and their implications, even when they are not immediately obvious. (Frederick, 2005) Frederick found further strong correlations between the CRT and other measurements like the SAT, ACT, WLP (Wonderlick Personality Test), NFC (Need for Cognition Scale). These are reported here, all correlations P <.01. Further, he found a significant correlation between gender and CRT score, with men tending to perform better than women. The correlation with the SAT is to be expected, as the SAT Verbal and Math both require cognitive reflection to perform well on. While the questions in the SAT do not generally have a pre-potent response like the CRT questions do, they do frequently require a reframing of the question in order to correctly answer. Thus, a mediocre but significant correlation is not a surprise. Frederick collected thousands of data points across multiple locations. These were primarily colleges and universities, but some sampling of the general public was also conducted. Data was collected from MIT, Princeton, Carnegie Mellon, Harvard, University of Michigan at Ann Arbor, Bowling Green University, University of Michigan at Dearborn, Michigan State University, and the University of Toledo. He also collected data from the public at a Boston fireworks display and a web-based survey. In total, 3428 people participated in Fredericks study, although
Boland 5
not all filled out all fields (that is, not all had taken the SAT or ACT, and so were not part of analyses on those subjects. ) So it seems that the CRT is a rather powerful test, tapping in to some low-level construct that is shared by such seemingly disparate cognitive faculties as economic cognition, SAT and ACT scores, the WPT, and even a test designed solely to measure a respondents desire to think (the NFC).
TOPLAK, STANOVICH, AND WEST (2010)

Heuristics-and-biases questions are those designed to measure an individuals propensity to fall into common cognitive traps. Thinking deeply about complex issues is time-consuming and cognition-intensive, and so humans seem to have created or been born with certain mental shortcuts, or heuristics, that allow solving complex problems quickly with little effort. However, sometimes these heuristics act more like illogical biases, causing suboptimal performance. (Kahneman and Tversky; 1974 & 1983.) A 2010 paper by Toplak et al sought to find further correlations between the CRT and measures of cognitive ability, specifically heuristics-and-biases questions. (Toplak, Stanovich and West; 2010) Through a series of regression analyses, they found that the CRT was a more potent predictor of performance on heuristicsand-biases questions than other more traditional predictors such as self-report measures. They approached the CRT as a test of ones propensity towards being a cognitive miser, that is, the propensity to expend the least amount of effort possible to come to a conclusion. Previous literature has found a strong connection between such cognitive miserhood and common reasoning errors (Stanovich, 2009b; Tversky and Kahneman, 1974). One possible reason that they put forward as a reason for the CRTs efficacy in this field is that, unlike most other measurements designed to probe miserly cognitive behavior, the CRT contains the aforementioned pre -potent response, as well as a correct response, meaning that a strong immediate response must be actively inhibited in favor of a less obvious one, a cognitively expensive procedure that cognitive misers would not engage in.
CORRELATING MEASURES OF TOPLAK ET AL.
Boland 6
Toplak et al utilized 15 classic heuristics-and-biases tasks drawn from multiple studies designed to measure various subfields of human cognition, such as probabilistic reasoning, hypothetical thought, and statistical thinking. Not all of the questions utilized correlated significantly in Toplaks study, but the combined aggregate measure of these questions correlated at a .49 level, with P < .001, demonstrating the existence of a link between the two.
PROFESSOR SHTULMANS ADDITIONAL QUESTIONS

Professor Shtulman added two additional questions to the CRT, which I am familiar with from my summer research experience at Occidental college in summer 2012. The purpose of this was to make the CRT section of the exam five questions long, the same as the other portions of the exam, so as to not tip off test-takers that the CRT section is different. These questions were: 1. A house contains a living room and a den that are perfectly square. The living room has four times the square footage of the den. If the walls in the den are 10 feet long, how long are the walls in the living room? 1. 2. 3. Intuitive Answer: 40 feet. Correct answer: 20 feet. Why: The size of a room increases based on the square of the length of its walls, however it is cognitively easier to perform cognitive calculations assuming linearity. 2. A store owner reduced the price of a pair of $100 shoes by 10%. The next week, he reduced it by a further 10%. How much do the shoes cost now? 1. 2. 3. Intuitive Answer: $80. Correct answer: $81. Why: The second reduction was based off of the price of the first reduction. Thus it is not 10020%*100, but rather 100-10%, which is 90, then 90-10%, which is 81.
PROLIFERATION OF THE CRT
Boland 7
A footnote in Toplak et al (2010) points out that the CRT is in danger of becoming a self-report meaure, rather than a performance measure, due to the proliferation of the questions on the internet and between people. Toplak et al note that the ultimate answer to this quandary is the creation of more CRT items that vary in surface characteristics. An entirely non-rigorous analysis of this assertion seems to point to its possibility specifically, a Google Search for bat ball riddle, referencing the first question of the original CRT, returns over 1.8 million search results in and of itself this is not surprising, however what is surprising is the sheer number of relevant results multiple pages in to the search at the present moment, relevant answers can still be found up to page 20, a remarkable feat anecdotally speaking. This could point to a number of things Google could have changed its search algorithms, for example, and the causation of this phenomenon cannot ascertained in such a cursory analysis. That being said, the possibility exists that individuals are spreading the CRT questions through the internet and other channels of communication, aided by the small size and high memorability of the questions. This is especially dangerous for the CRT, as the very nature of the questions relies on individuals noticing that their answers were wrong if a tainted participant were to take the CRT and see a problem that he is familiar with, he may likely (and correctly) assume that the other questions in the set were of a similar nature, thus putting him on guard against the very thing that the CRT was looking to test for in the first place.
EXPANSION OF THE CRT

As stated before, this is a problem for experimenters who wish to utilize the CRT. The idea behind the current study is to expand the CRT, inoculating it somewhat, and for a short while, against the possibility of corruption by individuals who take the test with pre-existing knowledge of the questions. Further, if successfully extended, this larger CRT could allow a pool of questions for researchers to draw upon, increasing the utility of test the possibilities of expanding the CRT are discussed in the discussion section, below. To that end, the point of this this study is to expand the CRT. The plan is simple: First, find questions that are candidates for inclusion in the CRT. Specifically, questions that are CRT-Like. That is, they have a pre-potent response that is incorrect, and a correct response that requires the recruitment of System 2. Second, find measures that correlate and do not correlate with the original CRT (to achieve both convergent and divergent validity). Then, combine the correlating measures and the expanded CRT into one large test, and administer it.
Boland 8
Once that is complete, examine the correlations between the original CRT questions, the new CRT questions, and the correlating measures. The set of new CRT questions will require pruning to remove those that do not correlate with the CRT on the given measures. This will be accomplished by multiple passes of analysis, determining how each individual question reacts to all possible correlating measures, and seeing if it reacts similarly to the original CRT questions and that measure. If the new question correlates with measures similarly to the original CRT, then it is acceptable. If it does not, then it is removed. After searching through books of riddles, LSAT practice exams, and various internet sites devoted to riddles and jokes, I came up with a list of 10 potential new questions for the CRT. They were all, in the end, pulled from anonymous internet sources no physical books or LSAT practice questions were used, as I could not find CRT-like questions that fit my criteria. It may seem odd to pull questions from the internet to prevent from their proliferation on the internet, but this is not the case. First, almost all information available in books is now available on the internet, and security through obscurity of the source (such as an old book of riddles) is not a particularly powerful one once the source is discovered and digitized. Second, the point of this expansion is not to create a bullet-proof system of questions, but to reduce the damage to the CRT in the event that an individual knows one of the questions. The damage in mindset of the participant is unavoidable if they know that one of the questions is a trick, then they may extend that idea to all of the questions. However, i f a participant knows the bat-and-ball question from the original CRT, that would invalidate 1/3 of the test. If, however, they know one answer to the expanded CRT, they will have invalidated 1/n questions, where n is the expanded size. Unless I were to somehow invalidate one of the original questions, then that N will always be greater than or equal to 3, and will thus be a better buffer against statistical invalidation of the test for that participant. The new questions are listed below in no particular order.
T HE ADDITIONAL Q UESTIONS
1. Some months contain 30 days, others contain 31 days. How many contain 28 days? 1. 2. Intuitive Response: Only one month has 28 days. Correct Response: 12 months, as all months contain at least 30 days.
Boland 9
2.
A red clock and a blue clock are both broken. The red clock doesnt move at all. The blue clock moves but loses 24 seconds every day. Which clock is more accurate? 1. 2. Intuitive Response: The blue clock, as it is at least still running. Correct Response: The red clock, as it is correct twice a day, while the blue clock will have to cycle through 12 hours (assuming its an analog clock if digital with an AM/PM indicator or on military time, it will take 24 hours) in 24-second increments per day until, once every thirty (or 60) days, it is on time the entire day. This is further assuming that it loses 24 seconds one per day in one large chunk at the end if the clock is simply moving more slowly than others such that it cumulatively loses 24 seconds over the course of the entire day, then it will be even less accurate.
3.
You are in third place in a race. You overtake the person in second place. What place are you in now? 1. 2. Intuitive Response: First place. Since you just beat second place, you must be in first. Correct Response: Second place. You passed up the previous second-place runner, who is now in third place, but the original first-place individual is still in front of you.
4.
You have a book of matches and enter a cold, dark room. You know that in the room there is an oil lamp, a candle, and a heater. What do you light first? 1. Intuitive Response: Any of the above depending on the preferences of light vs. heat for the individual. 2. Correct response: The matches must be lit before anything else.
5.
Divide 30 by and add 10. What is the answer? 1. 2. 3. Intuitive Response: 25. ((30/2) + 10) = 25. Correct response: 70. It says to divide by , not multiply by . So it is ((30/.5) + 10) = 70. This is very similar to SAT Reading Comprehension questions, which sometimes seek to actively obscure the answer through non-standard wording of a problem.
6.
If within a family there are nine brothers, and each brother has one sister, how many people are within the family including the mother and father? 1. Intuitive Response: 20. If each brother has one (unique) sister, then there are 9 brothers, 9 sisters, and the 2 parents = 20 people.
Boland 10
2.
Correct Response: Nowhere does it mention that each brother has a unique sister, rather that they have *a* sister. Thus, the real answer is 9 brothers + 1 sister + 2 parents = 12 people.
7.
An airplane travelling at 400 mph crashes on the US/Canadian border. Where are the survivors buried? 1. Intuitive Response: Either where they are from, or where their family wishes for them to be buried. 2. Correct Response: Noting that survivors are not buried, as they survived, and that would be cruel.
8.
If it takes 20 minutes to hard-boil one goose egg, how long would it take to hard-boil 4? 1. 2. Intuitive Response: 80 minutes. Correct Response: 20 minutes just put them all in the same pot.
9.
A Doctor gives you three (3) pills, and tells you to take one every half an hour. How long will it be until you no longer have any pills? 1. 2. Intuitive Response: 1.5 hours. Three * 30 minutes = 1.5 hours. Correct Response: 1 hour. This is a problem of counting the fence posts If you take one pill immediately, you have two left. When you take another half an hour later, you will have one left. Finally, when you take the last pill one hour later, you will have 0 left. Thus you will have 0 after one hour.
10. You have a ribbon that is 30 inches long. How many cuts with a pair of scissors would it take to divide it into inch long pieces? 1. 2. Intuitive Answer: 30. Correct Answer: 29.
All of these questions are, at the very least, weakly CRT-like, in that they have a pre-potent response and a correct response. In some cases however the pre-potent response is not particularly strong, such as the question involving the clocks or the ribbon. Further, not every question was strictly mathematical in nature, as the original questions were. However, I still believe that they may tap into the difference between Systems 1 and 2, and effectively measure an individuals proclivity to engage in cognitive reflection.
T HE CORRELATING MEASURES
Boland 11
These measures were taken from Frederick (2005), Toplak et al (2010), and a variety of other sources. The reported sources are the original studies in which these questions were used, as far as I can ascertain. For others, such as the gamblers fallacy question, no original could be determined, and so the source that the question was found in is cited. 1. Gamblers Fallacy: When playing slot machines, people win something about 1 in every 10 times. Julie, however, has just won on her first three plays. What are her chances of winning the next time she plays? (Frederick, 2005) a. The point of this question is to gauge an individuals proclivity to believe in luck, karma, fate, or, more specifically, the cognitive fallacy that unrelated probabilistic events are actually related. The correct answer was 1/10, .1, 10%, or equivalent any other answer was coded as false. 2. Sample Size Sensitivity: A game of squash can be played to either 9 or 15 points. Player A is a better player than player B. Which amount of points to finish the game (9 or 15) gives A a higher chance of winning? (Kahneman & Tversky, 1982) a. The correct answer is 15. Much like how a coin, being flipped, will generally settle on a 50-50 distribution with enough time - reflecting the true underlying probabilities, a game with more chances to play with tend to settle towards the underlying probability distribution more quickly than a game with fewer chances. 3. Regression to the Mean: After the first two weeks of the major league baseball season, newspapers begin to print the top 10 batting averages. Typically, after 2 weeks, the leading batter often has an average of about .450. However, no batter in major league history has ever averaged .450 at the end of the season. Why do you think this is? (Lehmann, Lempert and Nisbett; 1988) a. When a batter is known to be hitting for a high average, pitchers bear down more when they pitch to him. b. Pitchers tend to get better over the course of a season, as they get more in shape. As pitchers improve, they are more likely to strike out batters, so batters averages go down.
Boland 12
c.
A players high average at the beginning of the season may be just luck. The longer season provides a more realistic test of a batters skill.
d.
A batter who has such a hot streak at the beginning of the season is under a lot of stress to maintain his performance record. Such stress adversely affects his playing.
e.
When a batter is known to be hitting for a high average, he stops getting good pitches to hit. Instead, pitchers play the corners of the plate because they dont mind walking him. i. The only correct answer is C. ii. This question, much like the previous one, tests how well an individual understands such statistical concepts as the law of large numbers and regression to the mean.
4.
Covariational Reasoning: A doctor has been working on a cure for a mysterious disease. Finally, he created a drug that he thinks will cure people of the disease. Before he can begin to use it regularly, he has to test the drug. He selected 300 people who had the disease and gave them the drug to see what happened. He also observed 100 people who had the disease but who were not given the drug. When the treatment was used, 200 people were cured, and 100 were not. When the treatment was NOT used, 75 people were cured, and 25 people were not. On a scale of 1 to 10, how strong of an effect did the treatment have, if any, either positive or negative? (Toplak et al., 2010) a. As can be seen, this is not a good treatment. When the treatment was used, 200/300, or 2/3 of people were cured. When the treatment was not used, 75/100 or were cured. Thus, the treatment was actually either slightly negative or completely useless. b. Any answer under a 5, was scored as correct. This many need to be changed.
5.
Methodological Reasoning in everyday life: The city of Middleopolis has had an unpopular police chief for a year and a half. He is a political appointee who is a crony of the mayor, and he had little previous experience in police administration when he was appointed. The mayor has recently defended the chief in public, announcing that in the time since he took office, crime rates decreased by 12%. Which of the following pieces of evidence would most deflate the mayor's claim that his chief is competent? (Lehman et al; 1988) a. The crime rate in the city closest to Middleopolis in location and size has fallen by 18%.
Boland 13
b.
An independent survey of the citizens of Middleopolis report 40% more crimes than in the police records.
c.
Common sense indicates that there is little that a police chief can do to low crime rates, as these are mostly social and economic matters beyond his or her control.
d.
The police chief was discovered to have business contracts with people in organized crime. i. Only A contains data specifically all others are unfounded conjecture, despite the fact that they may seem like they would make sense.
6.
Sunk Cost Fallacy: This was composed of two questions. The first part was: a. Imagine that you are staying in a hotel room, and that you have just paid $9.95 for a pay -perview movie. Five minutes into the movie, you find yourself bored with it. Do you change the channel or continue watching the movie? b. And the second part is: Imagine that you are staying a hotel room, flipping channels on the TV. You come across a movie that is just starting. Five minutes into the movie, you find yourself bored with it. Do you change the channel or continue watching the movie? (Toplak et al; 2010) i. If a participant selected the same answer to both of these questions, they were deemed as correct. If they chose different responses, they were deemed as incorrect. This was designed to measure sensitivity to the Sunk Cost fallacy. The sunk cost fallacy is the tendency of a person to continue an unpleasant activity if they expended value (time, money) acquiring it. ii. It should be noted that no individual who stated that they would switch the channel in the pay condition reported that they would stay on the channel in the free condition. It was only when an individual spent money that they were willing to sit through a movie that they did not like theyve already paid, or so the reasoning goes, so they should get their moneys worth. This is unreasonable, as the money is already gone and you could be spending your time in a more useful way. c. Outcome Bias Questions: Like the previous question, this came in two parts. (Baron and Hershey; 1988)
Boland 14
i. Part one: There is a 55-year old man with a serious heart condition. He had an operation to fix the problem, which succeeded. The probability of him dying from the surgery was 8%. Please rate how good of a decision this was on the following scale, with 1 being incorrect, a very bad decision and 7 being clearly correct, an excellent decision. ii. Part two: There is a 55 year old man with a hip condition. He had an operation to fix the problem, which did not succeed - the old man died on the operating table. The probability of him dying from the surgery was 2%. Please rate how good of a decision this was on the following scale, with one being incorrect, a very bad decision and 7 being clearly correct, an excellent decision: 1. The participant was rated as correct only if they rated part two as a better decision than part one. This is because, even if the patient died in question two, they had only a 25% chance of dying as compared to person one, who just so happened to survive. This is outcome bias, reflected in the phrase hindsight is 20/20. 7. Temporal Discounting 1: Would you rather be given $3400 right now or $3800 one month from now? a. This was coded as correct if the individual chose the second answer. This is because, in the current and foreseeable economic climate, interest rates would not allow $3400 to change to $3800 in one month. However, it *is* conceivable to turn $3400 to $3800 through other activities, such as arbitrage or short-term loans. This was pointed out to me after the experiment was conducted. (All temporal discounting questions taken from Frederick, 2005) 8. Temporal Discounting 2: What is the highest amount of money that you would pay for a book that you really want to be shipped to you overnight? a. This was coded as a 1 for correct and a 0 for incorrect. To determine this, the given answers were averaged, and all above the average were given a 0, and all below were given a 1. 9. Temporal Discounting 3:
Boland 15
a.
On a scale of 1 to 10, where 1 is very little and 10 is quite a bit, please rate how much you think about monetary inflation.
b.
This was not scored as correct or incorrect, but was used as a scale.
METHODS
PARTICIPANTS AND PROCEDURE
A total of 59 participants (43 from within Occidental college, 16 from the general public) took part in the study. Individuals from outside of the college were recruited to attempt to reduce the effect of WEIRD populations. (Heinrich et al., 2010) The individuals were recruited through social networking, Sona-systems, and word of mouth. Individuals who were eligible received .5 course credits for participation in the study. The test was administered as a Google form, one for within-Occidental that collected @oxy.edu email addresses, and one for outside of the college which was completely anonymous. Ages of participants ranged from college-aged to mid50s, with all over 18 years of age. The test was not timed, but anecdotal reports estimate that it took approximately 30-45 minutes to complete the study.
TASKS AND MEASURES

Participants completed a combined survey of demographic data, self-report testing data (SAT and ACT scores as well as Age), the above-mentioned heuristics-and-biases tasks, the original CRT (from here on referred to as the oCRT), Professor Shtulmans expanded CRT, and my 10 CRT question candidates (the mCRT). (For further discussion, Professor Shtulmans expansion will be combined with mine under the moniker of mCRT.) Mean performance on the oCRT was 1.55 questions correct, among only the Occidental students. Mean performance for the mCRT is reported later, as they must be filtered beforehand.
RESULTS
Boland 16
INTER-CRT CORRELATIONS
Question Living Room/Den Family of Brothers The Doctor gives you 3 pills Months with 28 days Passing in a Race Planecrash Divide by Ribbon Cut Shoe Price Goose Egg Clocks Matches Correlation to CRT .346** .443** .572* .314* -.185 .364** .287* .250 .294* .354** .124 .224 P Val .007 .000 .000 .016 .161 .005 .027 .056 .024 .006 .312 .088
(Correlations are marked for easy of viewing: * = P<.05, ** = P<.01) These are the first-pass zero-order correlations between the aggregated oCRT (known as the CRT in the above screenshot) and the questions comprising the sCRT and mCRT. The majority of the questions do indeed correlate with the CRT. Only a few do not correlate at all these are Race, Ribboncut, Clocks, and Matches. Specifically, Clocks and Race have very high P-values, at .312 and .161 respectively, while Ribboncut and Matches are somewhat closer to correlations, with a P = .056 and .088 respectively, possibly hinting towards correlation had there been more participants. Thus, Clocks and Race are removed from the analysis, while Ribboncut and Matches stays for now. This reduces the list of new questions to 10 from 12.
CRT/SAT CORRELATIONS
The next pass of analysis was concerned with the correlations between the CRT, the mCRT, and the SAT. The SAT is useful as a comparative statistic for the CRT because of the nature of SAT questions, specifically those in the reading and writing portions of the test. SAT-reading comprehension questions generally give an individual a passage of text to read, and then ask them questions about it designed to measure their comprehension of the passage. SAT-writing questions will provide a prompt, and then ask test-takers to write a long-form essay critically responding to the question. These sorts of question would, out of necessity, require reflection on the knowledge,
Boland 17
either knowledge recently received (as in the reading comprehension portion) or analysis of knowledge that you already have (as in the writing portion). For example, on one practice SAT reading comprehension question located here (http://www.majortests.com/sat/reading-comprehension-test01), one of the questions asked is What is the author implying in the above text? Another asks for which definition of a word best fits in the context of the passage. Both of these questions require the activation of Stanovich a nd Wests cognitive system 2, and the question on implication specifically requires internet reflection to decide from multiple different outcomes, similar to the decisions that must be made when answering CRT questions. The first pass of this second analysis for SAT correlations revealed a few important pieces of information: First, neither the CRT nor the mCRT correlate significantly (or even near significance) with the SAT-math subscore. For the other subscores, the oCRT correlates with the SAT Reading Comprehension subtest at a .441* level, P = .013, and with the SAT Writing subtest at .503**, P = .002. The mCRT correlates with SAT reading comprehension at a .330 level, P = .070, and with the SAT Writing subtest at P.517**, P = .002. So, both the oCRT and the mCRT correlate with the SAT writing subtest at a high and very significant level, whereas the oCRT does not correlate (But approaches correlation with) the SAT reading comprehension test. It would appear that a few questions exist within the mCRT that do not correlate with the SAT the same way that the questions in the oCRT do. To determine which of these questions the culprit was, the aggregate scores were split and compared to the SAT questions individually. To provide a baseline for further analysis, the oCRT was first split and compared individually to the SAT, to see how the CRT questions (which are usually used only in the aggregate) compare individually to the SAT, giving a base level of comparison for the additional questions. As expected based on the aggregate data, none of the oCRT questions correlate with the SAT Math score. Surprisingly however, two of the three original questions do not correlate with any SAT subscores, although they both are somewhat close (P <=.14 for SAT reading, P<=.097 for SAT writing), and they may reach significance with more data. This is important, because it shows that the oCRT is not a monolithic bloc with regards to correlations with the SAT.
Boland 18
This analysis was repeated with the mCRT. Four questions were found to be very far from correlation with any of the original SAT scores, with P values at approximately .8 for all scores. These four questions were: Months, Ribboncut, Shoe Price, and Matches. These questions are removed from further analysis, lowering the size of the mCRT to 6 questions from the previous 10. Upon removal of these four questions, and rechecking the correlations between the oCRT, mCRT, and SAT, the correlations between the mCRT and the SAT Reading Comprehension is now at P<.05, within the same range of .05>P>.01 as the oCRT.
CRT/HEURISTICS-AND-BIASES CORRELATIONS
Frederick and Toplak both found correlations between various measures of economic and cognitive heuristics and the CRT. Attempting to replicate their results, I utilized ten such questions which are detailed above in the introduction. At first, the data was quite muddied, and no correlations could be found. To see what was wrong, I examined the correlations between each heuristics-and-biases question and the oCRT/mCRT. What I found was that questions having to do with money, specifically the questions about Temporal Discounting, did not perform well at Occidental. Neither the oCRT nor the mCRT correlated significantly with any of them, at odds with Fredericks study. This lack of correlation was found within the Occidental college participant pool in particular, and the outside-Occidental pool was too small to get any significant data from alone. Questions such as How much would you pay to have a book shipped to you overnight were particularly strange, with answer ranging from 0 to 120 dollars within Occidental. This might be due to the nature of the college experience, and the change in technologies between the time of Fredericks study in 2005 and now. Now, a large amount of people own Kindles, iPhones, iPads, and notebooks. Internet access is also generally much quicker now than 8+ years ago when Frederick was performing his study. It could be that the increase of easy access to online materials has lowered the amount of books that college students are required to order, such that, when a student does need to order a book, it is because they need it immediately for a class, and that might be the only time that a student orders a physical book from the internet, necessitating expedited shipping and higher payment, and skewing the economically correct answer.
Boland 19
Another question that did not correlate, that should have according to Frederick, was one asking participants whether or not they would prefer to be given $3400 immediately or $3800 in one m onth. Fredericks justification for picking the latter is that it has a 280% annual discount rate or an increase equivalent to 280% yearly over one month, which is higher than can be found any official investment. However, when talking to participants who took my test, some did not see it that way some claimed that they could easily turn $3400 to $3800 in less than a month, with some to spare. Others claimed that they would need it now to pay pressing bills, and would not be able to wait one month. Whether this is peculiar to Occidental college, I cannot say. It should be noted that the Gamblers fallacy question did make it through this pass, so not all measures of economic cognition were inefficacious. Further, both the oCRT and the mCRT did not correlate, indicating not a problem in the test but a problem in the participant pool. These questions were removed, created an adjusted aggregate bias measure. To mirror Fredericks analysis on these measures, the oCRT and mCRT groups were split into two groups each one low, one high. Individuals were assigned to the low group if they generated less than N/2 (where N = the amount of questions) answers correctly, and were assigned to the high group if they generated more than N/2 answers correctly. LowCRT individuals, both on the mCRT and the oCRT, scored lower at a significant level on the adjusted aggregate biases tasks. Conversely, High-CRT individuals on both tests scored higher at significant levels on the adjusted biases tasks. Not only that, but the oCRT and mCRT acted almost identically: The oCRT_high correlated with the adjusted bias aggregate at .304, P = .019. The mCRT_high correlated at .364, P = .005. The oCRT_low correlated at .512, P = .000, and the mCRT_low correlated at -.378 at P = .003.
CRT/DEMOGRAPHIC CORRELATIONS
Despite Fredericks findings on the influence of Gender on the outcome of his test, with females tending to perform worse than males, no correlation with gender was found on either the oCRT or the mCRT. However, a correlation was found between both CRTs and Age, a measurement not used by Frederick, Toplak, nor by any other CRT-based experiments that I read. The oCRT correlates with Age at .294, P=.024. The mCRT correlates at
Boland 20
.253, P = .053. This is >.05, however considering how close it is it is likely that this is a problem of test power, and I am confident that having more individuals in the pool would have allowed it to reach correlation.
FINAL OCRT/MCRT CORRELATION

The previous two passes of analysis did not require the removal of any more questions from the mCRT, leaving the expanded test at 6 mCRT questions + 3 oCRT questions, for a total of 9 CRT questions. The mCRT and the oCRT aggregates correlate with each other at a .653 level, P < .000.
DISCUSSION
To save space, the entire list of final questions is not presented here; however, they are the Airplane question, the Doctor/Pill question, the Sister/Family question, the Goose Egg question, the dividing 30 question, and the living-room/den-size question. (As well as the original 3 CRT questions.) All of these require at least some modicum of mathematical thought except for the Airplane question, which, quite frankly, is a rather surprising candidate to make it through the gauntlet of correlations. It is possible that its inclusion is an artifact of over fitting the questions to the data, or of Occidentals peculiarities. It is also possible that it is tapping in to the same underlying cognitive reflection abilities as the other questions, and so does indeed deserve a spot on the list. There are some possible problems with the data that was collected. Primarily, correlations that were supposed to exist were on occasion not found; specifically between the CRT scores (oCRT and mCRT) and the SAT Math score, Gender, or the economically-minded heuristics and biases questions. However, it is important to note that neither the oCRT nor the mCRT correlated with these measures it would have been much worse if one had correlated while the other did not, which would mean that they acted very differently on one of these measures. Since they both do not correlate at similar levels, this may be considered a measure of divergent validity. That being said, the expanded mCRT correlates almost identically with the original oCRT on many disparate measures, including SAT Reading and Writing score, wide-ranging cognitive heuristic problems, and age.
Boland 21
Further, the individual questions comprising the CRT correlate with the aggregate CRT, many of them on more than one question, and all significantly. It seems that the mCRT is in fact a viable candidate for expansion to the CRT. More rigorous testing would of course be required, and a much larger pool of participants would be needed, but these questions preliminarily seem to travel with the CRT in a way that hints that they are both tapping into the same underlying cognitive process or proclivity. Specifically that they are both measuring, to various degrees, cognitive reflection. This is desirable; not only for the main reason of protecting the CRT against invalidation through proliferation, but in creating a larger pool of questions that researchers can draw upon, each measuring slightly different aspects of cognitive reflection ability, which may ease further lines of study with the CRT. One possibility would be to use the CRT, and any expansions to it, to measure the origin and malleability of reflective cognition. For example, are these scores generally constant throughout life? A singe 3-question test could not answer that, as it would require repetition of the same questions each measurement. But a larger 9-question test, such as the expanded CRT presented here, would allow for three measurements of three questions, for a broader perspective on how this skill acts through time. Another possibility is to see whether CRT score can be changed through training, another repeated-measures test that occurs through time which would require more than the original 3 question test.
WORKS CITED
Baron, J., & Hershey, J. (1988). Outcome bias in decision evaluation. Journal of Personality and Social Psychology, 54, 569-579. Frederick, S. (2005). Cognitive Reflection and Decision Making. Journal of Economic Perspectives, 19(4), 25-42.
Boland 22
Heinrich, J., Heine, S., & Norenzayan, A. (2010). The weirdest people in the world? Behavioral and Brain Sciences, 33(2/3), 1-75. Kahneman, D. & Frederick, S. (2002). Representativeness Revisited: Attribute Substitution in Intuitive Judgment. Heuristics of Intuitive Judgment: Extensions and Applications, New York: Cambridge University Press Kahneman, D. & Frederick, S. (2005). A model of heuristic judgment. The Cambridge Handbook of Thinking and Reasoning, 267-293. Kahneman, D. & Tversky, A. (1974). Judgment under Uncertainty: Heuristics and Biases. Science, 185, 1124-1131 Kahneman, D., & Tversky, A. (1982). On the study of statistical intuitions. Cognition, 11, 123-141. Kahneman D., & Tversky, A.(1983). Extension vs. Intuitive Reasoning: The conjunction fallacy in probability judgment. Psychological Review, 90(4), 293-315 Lehman, D., Lempert, R., & Nisbett, R. (1988). The effect of graduate training on reasoning. American Psychologist, 43, 431-442. Stanovich, K. E. (2009b). What intelligence tests miss: The psychology of rational thought. New Haven: Yale University Press. Toplak, M., West, R., & Stanovich, K. (2011). The Cognitive Reflection Test as a predictor of performance on heuristics-and-biases tasks. Memory and Cognition, 39, 1275-1289.

Reflection Expanded: Expanding The Cognitive Reflection Task

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Reflection Expanded: Expanding The Cognitive Reflection Task

Enviado por

Direitos autorais:

Formatos disponíveis

OCCIDENTAL COLLEGE DEPARTMENT OF COGNITIVE SCIENCE

CORRELATING MEASURES TO THE ORIGINAL CRT

TOPLAK, STANOVICH, AND WEST (2010)

CORRELATING MEASURES OF TOPLAK ET AL.

PROFESSOR SHTULMANS ADDITIONAL QUESTIONS

PROLIFERATION OF THE CRT

EXPANSION OF THE CRT

TASKS AND MEASURES

FINAL OCRT/MCRT CORRELATION

Você também pode gostar