ByDylan Wiliam, King's College, London
The curse of quantification and MacNamara's fallacy For over one hundred years, policy-makers have been searching for objective indices of the quality of education. In the second decade of the twentieth century in the USA, the 'School Survey' movement sought to gather 'objective evidence' about factors influencing the educational progress of school students, but within about twenty years, educational policy-makers looked to psychology to provide a way of measuring the outputs more 'scientifically'. This desire for quantification soon dominated most aspects public-service provision. Perhaps the best-known example of politicians' desire for simple answers to complex questions is John F Kennedy's furious reaction to the ambiguous evaluation of the impact of additional money provided for the education of socioeconomically disadvantaged students: "Do you mean that you spent a billion dollars and you don't know whether they can read or not?". The trouble with such 'objective' approaches is that while many things can be measured, there are also many important things that cannot, and the danger is that things that can be measured easily come to be regarded as more important that those that cannot. This process is well summed up by Charles Handy's rendering of the what has come to be known as the Macnamara Fallacy, named after the US Secretary of Defense, who argued that the ratio of Viet Cong/North Vietnamese Army losses to US/Army of theRepublic of Vietnam losses was an important measure of military effectiveness: "Things you can count, you ought to count. Loss of life is one." The Macnamara Fallacy: The first step is to measure whatever can be easily measured. This is OK as far as it goes. The second step is to disregard that which can't easily be measured or to give it an arbitrary quantitative value. This is artificial and misleading. The third step is to presume that what can't be measured easily really isn't important. This is blindness. The fourth step is to say that what can't be easily measured really doesn't exist. This is suicide. (Handy, 1994 p219) We start out with the aim of making the important measurable, and end up making only the measurable important. By making the pressure on teachers and students to achieve good results on particular tests greater and greater, we can secure improvements in scores on those tests, but these improvements are secured at the expense of everything else. The tests, originally meant simply as a sample of the curriculum, come to be the whole curriculum. The reason that this is important is that we are hardly ever interested in the specific things a student has to do to pass an examination or test-after all, a test tests only what a test tests. We are generally interested in examination and test performance because these results can stand as proxies for wider achievement and potential, and in the past. However, by increasing pressure to do well on the test to a ridiculous degree, we have reached a point where we cannot generalise beyond the immediate test scores. When test scores at key stage 2 improve, we cannot conclude that education in key stage 2 has improved. We cannot even conclude that performance in English, mathematics and science has improved. All we can conclude is that the narrow range of skills tested in the key stage 2 tests has improved. This provides an example of what has become known as Goodhart's law. Goodhart's law Goodhart's law was named after Charles Goodhart, a former chief economist at the Bank of England, and it states, quite simply, that performance indicators lose their usefulness when used as objects of policy. The example Goodhart used to illustrate this was that of the relationship between inflation and money supply. Economists had noticed that increases in the rate of inflation seemed to coincide with increases in money supply, although neither had any discernible relationship with the growth of the economy. Since no-one knew how to control inflation, controlling money supply seemed to offer a useful policy tool for controlling inflation, without any adverse effect on growth. The result monetarist policies produced the biggest slump in the economy since the 1930s. As Peter Kellner comments, "The very act of making money supply the main policy target changed the relationship between money supply and the rest of the economy" (Kellner, 1997) If you make a particular performance indicator a policy target, and make the stakes high enough, then the people at the sharp end will do everything they can do improve their score on the performance indicator. However, because the areas in which we use performance indicators are so complex, there is always a way of improving the performance indicator without having any impacton the overall quality of whatever the performance indicator is meant to be measuring (sometimes the quality actually gets worse, even though the performance indicator is rising). So, when schools were measured by the proportion of students achieving 5 good grades at GCSE, this improved, although in some cases, the average grades achieved by students went down. In response to this the average grades are now also reported, but again schools are able to manipulate this index too, by channeling students towards easier subjects, or by entering students for vocational GCSEs' which are deemed to be equivalent to four GCSEs. The reported scores rise, but the actual level of performance may be unchanged, or even declining. This is the essence of Goodhart's Law-in all these cases, a variety of indicators is selected for their ability to represent the quality of the service, but when used as the sole index of quality, the manipulability of these indicators destroys the relationship between the indicator and the indicated. There is no end to this process, because the people on the ground will always know more about where the loopholes are than those devising the performance indicators. Put bluntly, the clearer you are about what you want, the more likely you are to get it, but the less likely it is to mean anything.
What can be done? Our system of tests and examinations distorts our school curricula and produces results that are of limited reliability, and of doubtful validity. In proposing alternatives, the question is not where to find them, but how radical we are prepared to be. Why for example, do students get tested as individuals, when the world of work requires people who can work well in a team? Why do we test memory, when in the real world, engineers and scientists never rely on memory-if they're stuck, they look things up. Why do we use timed tests when it's usually far more important to get things done right than to get things done quickly? There are of course, those who claim that timed written tests give good indications of the ability to work under pressure, in which case, they should produce evidence of this-I haven't seen any. But I have seen plenty of evidence of the damage that timed written tests do, and how poor they are at measuring the important outcomes of learning. As a modest start, however, accepting the need for formalised assessments of students' achievement at the ages of 7, 11, 14, 16 and 18, I propose that all national curriculum tests (and, if thepoliticians have the stomach for it GCSEs and A-levels , which is what happens in Sweden, for example) are replaced with moderated teacher assessment. By extending the assessment over the whole key stage, we would produce unprecedented levels of reliability and validity, and the rigorous procedures of moderation would not only ensure against grade drift, but would also provide avaluable focus for inservice training for teachers. This would also be likely to tackle boys' underachievement, because the current "all or nothing" test at the end of a key stage encourages boys that they can make up lost ground at the last minute. The crucial point, however, in order to prevent teaching to the test, is to disentangle the evaluation of the school from the scores that a student gets. Instead of publishing the results of the moderated teacher assessments, schools would be held accountable by the results of special tasks taken by the students at the end of the key stage. Crucially, there would be a large number of these tasks, and not all students would take the same task. These tasks would cover the entire syllabus, and would be allocated randomly so that there would be no way of teaching to the test. Or more precisely, the only way to teach to the test would be to teach the whole curriculum to every student. Schools that taught only half the curriculum, or concentrated their resources on only the most able students, would be shown up as providing a limited education. Furthermore, the results of these tests could provide an additional check on the robustness of the moderation procedures, and would provide accurate information to policy-makers about the real state of education in our schools.
Summary Out current educational assessments are not just ineffective-they are preventing us from providing high qualityeducation for school students, and preventing schools from producing young people with theflexible skills that will be needed in the 21st century.This is because our assessments started from the idea that the primary purpose of educational assessmentis selecting and certifying the achievement of individuals (ie summative assessment)-and have tried tomake assessments originally designed for this purpose also provide information with which educationalinstitutions can be made accountable (evaluative assessment). Educational assessment has thus becomedivorced from learning, and the huge contribution that assessment can make to learning (ie formative assessment) has been largely lost. Furthermore, as a result of this separation, formal assessment has focused just on the outcomes of learning, and because of the limited amount of time that can be justified for assessments that do not contribute to learning, has assessed only a narrow part of those outcomes. The predictability of these assessments allows teachers and learners to focus on only what is assessed, and the high stakes attached to the results create an incentive to do so. This creates a vicious spiral in which only those aspects of learning that are easily measured are regarded as important, and even these narrow outcomes are not achieved as easily as they could be, or by as many learners, were assessment regarded as an integral part of teaching. In place of this vicious spiral, I propose that developing a system of summative assessment based on moderated teacher assessment. A separate system, relying on 'light sampling' of the performance of schools would provide stable and robust information for the purposes of accountability and policy formation. Psychometric Theory is that discipline which addresses the measurement and quantification of psychological phenomena (latent traits). Strictly speaking, psychological phenomena are not directly observable. Typically, they must be inferred from observations taken on some behavior that may be observed and is assumed to operationally define the unobservable characteristic that is of interest. An operational definition is most useful when it delineates boundaries of behavior and differential points between those boundaries. Ideally, a "scale" comprised of independent items is developed to measure a hypothesized unidimensional trait. Data are gathered and various statistical models are then employed to determine the extent to which the scale, or measurement instrument, functioned as intended.
Theme Quotes a la Prof. L. Ludown "Psychometry, it is hardly necessary to say, means the art of imposing measurement and number upon operations of the mind...". F. Galton,
"...that until the phenomena of any branch of knowledge have been subjected to measurement and number, it cannot assume the status and dignity of a science." Galton. "The Reader may here observe the Force of Numbers, which can be successfully applied even to those things, which one would imagine are subject to no Rules. There are very few things which we know, which are not capable of being reduc'd to a Mathematical Reasoning; and when they cannot, it's a sign our Knowledge of them is very small and confus'd; and where a mathematical reasoning can be had, it's a great folly to make use of any other, as to grope for a thing in the dark, when you have a Candle standing by you." John Arbuthnot, "I often say that when you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind: it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science, whatever the matter may be." Sir William Thomson, Lord Kelvin "The grand, and indeed only, character of truth is its capability of enduring the test of universal experience, and coming unchanged out of every possible form of fair discussion". Sir John Herschel. "Whatever exists, exists in some amount." E. L. Thorndike. "If it exists, it can be measured; If it can't be measured, it doesn't exist". Prof. L.H. Ludlow's Challenge Psychometrics Lectures |
|
|