Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Version 1.2
April 2007
Page 1 Guide to Writing Objective Tests v1.2
INTRODUCTION TO SELECTED
SELE CTED RESPONSE QUESTIONS
QUESTI ONS
This guide was written as part of the E-Only project, which sought to develop SQA’s
first (fully) online qualification. A package of support materials was developed as part
of that project – and this document is part of that package.
This document has gone through a number of revisions since the first draft version
was written in July 2006. Special thanks to everyone who took the time to contribute
through SQA Academy (http://www.sqaacademy.com). The document is being
frequently revised. An online forum is available to discuss its contents at the
following URL, where the latest version of the Guide can be found:
http://groups.google.com/group/objectivetesting
Although most SQA units employ “conventional” assessment, some subject areas
(mostly Science-related) have a tradition of using objective tests. For example,
Biology uses multiple choice questions at Intermediate, Higher and Advanced Higher
levels; and HNC Computing uses an objective test as part of the Graded Unit.
More recently, there has been greater emphasis on objective testing due to its
suitability for computer-based assessment; as a result, an increasing number of unit
specifications (at both National and Higher National levels) involve an element of
objective testing.
This guide will be of assistance to any SQA Officer (or appointee) involved in creating
objective tests. It has three objectives:
While the focus of this guide is objective questions, it does not seek to promote one
type of assessment over another. Traditional forms of assessment remain as valid
today as they have ever been – but, where appropriate, objective approaches have a
role to play too. Neither does it seek to explain what you already know. Most SQA
staff have a good knowledge of objective testing - this guide simply seeks to provide
a single source of advice for busy Officers and appointees.
There are no rules for writing objective tests; there’s only advice. Do whatever you
think is right for your particular test. Assessment is an art – not a science. There is
no substitute for human judgement.
Although this document does not seek to promote one type of assessment over
another, it does aim to dispel commonly-held, but inaccurate, views about objective
testing. Some of the most common misconceptions are rehearsed below.
1. Objective tests dumb-down education; objective tests are easy. Objective tests
are as “dumb” or “smart” as you choose to make them. Many high stakes tests
(such as university medical examinations in the UK and the SAT in the United
States) use objective tests.
2. Objective tests can only be used to assess basic knowledge. While this is largely
true in practice, there is nothing inherent in the design of objective tests to make
them unsuitable for assessing high level skills.
4. Writing an objective test is easy. While most teachers can create simple
objective tests, the construction of high quality objective questions is highly
skilled and requires significant knowledge and experience.
6. Objective tests aren’t appropriate for my subject. While objective tests have
traditionally been used in the physical and social sciences (such as Physics and
Psychology), they can be used in any subject.
QUESTION TYPES
SQA has traditionally employed a variety of question types within Unit and Course
assessment. These question types can be categorised under two headings:
Note:
Note Some of the terminology in this guide might not be
familiar to you. It has been used because it is widely
employed in international testing literature and it was
considered best to use “industry standard” nomenclature
rather than “Scottish” terminology.
Write here:
An extended response question (ERQ) is one whose answer requires the candidate
to write longer responses, normally consisting of two or more paragraphs. Examples
of ERQs include reports, essays and dissertations. There is no hard-and-fast rule
about where a restricted response question ends and an extended response
question begins.
This guide focuses on SRQs, which are becoming increasingly popular for a variety of
reasons.
ADVANTAGES OF SRQS
1. SRQs take less time to answer – reducing the amount of time that candidates
spend on assessment and increasing learning time.
2. SRQs are quick to mark – reducing the time teachers spend on assessment and
increasing teaching time.
3. SRQs are well suited to formative assessment – since candidates’ responses
can be analysed and used to provide detailed feedback.
4. SRQs are good for assessing breadth of knowledge - they are ideal for assessing
a broad range of topics in a short time.
5. SRQs are more reliable than CRQs - because they get around some of the
marking problems associated with written answers.
The low writing load of SRQs means that the focus is on the candidate’s knowledge
rather than the candidate’s writing or language skills – which is a common problem
with constructed response questions. Also, the speed of answering SRQs addresses
another common criticism of assessment – that it takes up too much time for both
students and teachers.
Research into the marking of CRQs and SRQs has shown significant differences in
the reliability of the two approaches – with objective tests proving to be significantly
more reliable than written tests. This has been the major reason for the widespread
adoption of objective tests in the United States, where testing organisations operate
in a more litigious environment.
DISADVANTAGES OF SRQS
1. SRQs are not suitable for assessing certain abilities, such as communication
skills or creativity. They are also not appropriate when candidates are required
to construct an argument or provide an original response.
2. SRQs may be less valid than CRQs and suffer from low professional credibility.
3. SRQs that assess higher order skills are difficult (and time consuming) to
produce.
4. SRQs can be wordy and require high order reading skills.
The first and second disadvantages are linked. There is nothing inherent in the
design of SRQs to make them less valid than CRQs – but because they have often
been used inappropriately (to measure skills that cannot be properly measured by
this style of question) they have established a reputation for being invalid among
some practitioners.
Most teachers are comfortable with using SRQs to assess low order skills (such as
factual recall, typified by Example 2). They are less comfortable with their use in
assessing deeper knowledge and understanding. Most currently available examples
of SRQs re-affirm this view by focussing on the assessment of surface knowledge;
even examples of SRQs that are meant to assess deeper knowledge often only
assess surface knowledge – albeit less well known surface knowledge!
Traditionally, the costs of carrying out assessment come at the end of the process –
the setting of the question paper is relatively speedy, the time consuming part
comes when the papers have to be marked. Objective tests reverse this model – the
time consuming bit is the production of the questions, with marking taking very little
Another criticism of SRQs is that they can atomise teaching and learning,
encouraging “teaching to the test” and surface learning. This, combined with their
efficiency in assessing large numbers of students in short periods of time, has
resulted in them acquiring a reputation as “weapons of mass instruction”, with poor
standing among many educationalists.
Historically, objective testing has been widely used for psychometric testing (testing
of intellect and attitudes) and, more recently, it has been widely applied to job
competence testing. It is also used in entry examinations for some professional
bodies (such as ACCA).
Objective tests are widely used internationally – including high stakes assessments
such as the SAT in the United States, which is used for university entry. They are also
widely used within vendor examinations (such as Microsoft’s global certification
programme). Awarding bodies in every country are focusing on computer-assisted
assessment, which has resulted in a renewed interest in objective testing. These
organisations share the view that the increasing popularity of e-learning will drive
demand for e-assessment – which will be underpinned by item banks consisting of
large numbers of selected response questions.
There are several types of selected response questions (SRQs). Although they share
some common characteristics, they each have unique features and applications. But
they all share a fundamental characteristic – they have one unambiguously correct
answer.
1. true/false questions
2. matching questions
5. ranking/sequencing questions
6. assertion/reason questions
TRUE/FALSE QUESTIONS
Note:
Note Any question that has one of two possible answers
is considered a true/false question (for example, the
responses might be “yes” or “no” rather than “true” or
“false”). These questions are also known as “alternative
response” items.
MATCHING QUESTIONS
This type of question requires candidates to match an object with one or more
associated characteristics.
Match the list of storage technologies on the left with the list of memory characteristics on
the right. Match each technology (A, B, C or D) with one characteristic (1, 2, 3 or 4) only.
A.
B.
C.
D.
The objects on the left are called “stimulators” and the matching statements on the
right are called “responses”. No more than seven stimulators should be included in
any one question.
In psychiatry, holding two contradictory views about the same thing is called:
A cognitive dissonance
B dementia
C dissociative disorder
D factitious disorder
MCQs are the most common type of selected response question – and the one that
this guide focuses on in later sections.
There are some misconceptions about MRQs. They are not necessarily more difficult
than MCQs; they are as hard or as easy as you choose to make them. There is no
need to indicate the number of correct options; this only encourages guessing. And
there is nothing wrong with making every option correct; in fact, prohibiting this
possibility reduces the reliability of MRQs.
RANKING QUESTIONS
A ranking question involves ordering the options in some defined sequence. The
sequence can be an ordered list of numbers, chronological sequence or series of
events.
Rank the following countries in order of their population densities (lowest density first).
I France
II Germany
III Spain
IV United Kingdom
ASSERTION/REASON QUESTIONS
The following assertion and reason relate to World War II. Read the assertion and
associated reason and then choose a corresponding letter (A-E) to indicate whether the
assertion and/or reason is/are true.
Assertion Japan’s lack of raw materials was a cause of World War II in Asia.
Reason Japan lacked natural raw material except for small deposits of coal and
iron.
A Assertion is true and reason is true and the reason is a correct explanation of the
assertion.
B Assertion is true and reason is true but the reason is not a correct explanation of
the assertion.
This type of SRQ was named after Rensis Likert who invented the scale in 1932. It is
widely used within questionnaires to gauge respondents’ attitudes. The classic Likert
scale consists of five possible responses:
1. Strongly disagree
2. Disagree
4. Agree
5. Strongly agree
Some psychometricians add or remove options (the neutral option – “neither agree
nor disagree” – is often removed).
A Strongly disagree.
B Disagree.
D Agree.
E Strongly agree.
This type of SRQ is almost exclusively used for attitudinal assessments and is rarely
employed within formal SQA assessments. It is not discussed further in this guide.
A “best answer” question is one whose answer is the closest (“best”) answer
selected from a list of possible answers of which more than one may be true. Used
carefully, best answer questions can be almost as objective as standard SRQs.
A user wishes to use a search engine to look for information relating to Celtic music that
originated in Scotland. Which one of the following queries is likely to produce the best
results?
Note that more than one of the responses is correct (in fact, they are all more-or-less
correct). But only one option is the best answer (B).
The use of best answer questions is particularly appropriate to the social sciences
and arts subjects, which tend not to have a definitive body of knowledge like the
physical sciences. Best answer questions can also be used to assess some higher
order skills since they frequently require an element of judgment.
EXCEPTION QUESTIONS
An exception question is one where all of the options are correct except one of the
possible responses. This type of question effectively reverses the logic of the
standard SRQ.
A diabetes.
B heart disease.
C lung cancer.
D Parkinson’s disease.
A question that includes “not” in the stem is effectively an exception question. For
example, the above question could be re-phrased: “Which one of the following
conditions is NOT caused by smoking?”.
Exception (and negative) questions are not ideal – but should not be completely
avoided since their use can simplify questions and/or increase the number of
questions that can be asked.
Example 12 ~ Clone
Variants and clones have significant implications for e-assessment since they
provide a quick and simple way of rapidly populating an item bank.
Each type of SRQ has its strengths and weaknesses, and each has its best uses. The
next section looks at choosing SRQs for different purposes.
TAXONOMIES OF LEARNING
One of the key determinants in the selection of SRQs is the kind of knowledge or
understanding that you are seeking to assess. For example, factual recall can be
adequately assessed using true/false questions; deeper understanding may require
more complex question types such as multiple response questions.
BLOOM’S TAXONOMY
His book described a classification system that could be used to categorise cognitive
abilities. The taxonomy (which became known as Bloom’s Taxonomy) is widely used
within the educational community.
Note:
Note Bloom’s Taxonomy is not the only way to classify
academic abilities. There are many alternative methods –
some linked to Bloom’s (but more up-to-date) and some
entirely different from Bloom’s. But Bloom’s Taxonomy
remains the most widely used classification system.
1. Knowledge
2. Comprehension
3. Application
4. Analysis
5. Synthesis
6. Evaluation.
Knowledge Knowledge involves the recall of specific facts and figures, or the recall of specific
methods and processes. Knowledge is the bottom of Bloom’s Taxonomy but
underpins the higher order abilities. There are three types of knowledge: knowledge
of specifics, knowledge of methods, and knowledge of universals. At the higher
levels (knowledge of methods and universals) it can be intellectually demanding.
This category includes: knowledge of terminology, knowledge of specific facts,
knowledge of conventions, knowledge of trends and sequences, knowledge of
classifications, knowledge of criteria, knowledge of methodology, knowledge of
principles and generalisations, and knowledge of theories and structures.
Comprehension Comprehension differs from knowledge in that it relates to the mental processes of
organising and re-organising information for a particular purpose. It includes:
translation, interpretation and extrapolation. Translation relates to the ability to
translate (or decode) a communication from one format (or language) to another.
Interpretation involves the explanation or summarisation of a communication.
Whereas translation involves a mechanistic, part-for-part rendering of a
communication, interpretation involves a more holistic, re-ordering or re-
arrangement of the information. Extrapolation involves extending trends or
sequences beyond the given data to infer consequences or corollaries.
Application This involves the use of knowledge and comprehension in specific situations. For
example, the use of knowledge of computing terminology and procedures combined
with an understanding of the principles of computer hardware and software can be
applied to the assembly of a computer system.
Analysis Analysis involves the breakdown of a communication into its constituent parts so
that the relationship between the elements is made clear. Analysis is intended to
clarify or explain communications or processes. This cognitive skill includes the
ability to: (1) analyse elements (identification of the components of the
communication); (2) analyse relationships (the ability to check the consistency or
accuracy of a hypothesis, and skills in comprehending the inter-relationships among
different ideas or concepts); and (3) analyse organisational principles (the ability to
recognise form and pattern in a communication, and the ability to recognise general
techniques used within a subject area).
Synthesis Synthesis involves combining the parts so as to form a whole. It involves combining
and arranging parts or pieces of a communication to create something new. It may
involve: (1) the production of a unique communication; (2) the production of a plan;
and (3) the derivation of a set of abstract relations to represent physical
phenomena.
Evaluation Evaluation involves making judgements about the value of particular phenomena for
given purposes. Evaluation is carried out using criteria and involves qualitative and
quantitative judgements based on these criteria. The criteria may be given or
created. This includes measuring the internal consistency of the communication
using criteria such as: quality of writing, accuracy of the information contained within
it, and consistency of argument; and measuring the external consistency of the
communication which requires the evaluator to have a detailed knowledge of type of
phenomena under review since it will be evaluated in terms of the general criteria
which are applied to phenomena of this type.
Bloom’s Taxonomy is a
hierarchy in that each
category builds on the one
below. For example,
application depends on
comprehension which in turn
depends on knowledge. Or, to
put it more simply: you can’t
apply something until you
understand it; and you can’t
understand something until
you know about it. The figure
opposite illustrates this
hierarchy -- with knowledge at
the bottom and evaluation at
the top. Figure
Figure 2 - Bloom's hierarchy
Table 2 associates some verbs with the levels within Bloom’s Taxonomy.
Level Verbs
Knowledge define, describe, label, list, name, recall, show, who, when, where
where
Table 2 - Verbs
Verbs associated with Bloom's
Bloom's Taxonomy
So, for example, a question that commences: “Define…” is likely to assess basic
knowledge; a question that begins “Compare…” is likely to assess analytical or
evaluative skills.
Note:
Note SQA does not formally use a recognised taxonomy
for assessments. However, when one is employed by
Officers or appointees, it is usually Bloom’s. Some SQA
question papers fall foul of Bloom’s Taxonomy, asking
candidates to “explain” something but actually awarding
marks for descriptions (or vice-versa).
Describe the main processes that take place during nuclear fusion.
This question has low demand (relating to factual recall) but high difficulty because it
relates to a complex topic (nuclear fusion). Similarly, crossing the road involves
evaluation skills (Is the road clear? Is it safe to cross? How far away is that car?),
which are at the top of Bloom’s hierarchy – but is not a difficult task for most people.
So, merely climbing Bloom’s Taxonomy is no guarantee of difficulty.
The concept of difficulty and demand has important implications for question setting.
Most SQA tests employ low difficulty/low demand questions; but even the “more
demanding” questions may not be – they might simply assess knowledge in a more
difficult way (by, for example, assessing little known knowledge).
Each question type can be related to one or more levels in Bloom’s Taxonomy. While
it’s possible to use any one of the question types for almost any of Bloom’s levels,
some are better than others for specific levels as the following table describes.
True/False While mostly used to assess knowledge, T/F questions can, in fact, be
used to assess knowledge, comprehension and application levels.
Matching Again, mostly used to assess basic knowledge but can be used to
assess knowledge and comprehension.
MCQ MCQs are the most flexible type ofof SRQ and can assess all levels; they
are particularly suitable for knowledge, comprehension, application
and analysis.
MRQ MRQs can assess the same range of levels as MCQs – but have the
potential to create more difficult questions within each category.
So, in theory, SRQs can assess all of Bloom’s levels. However, in practice, it is
uncommon to come across SRQs that assess anything other than knowledge and
comprehension. But this is not an inherent limitation in their design. Assessing
higher order skills can be done – but it is a time consuming and skilled task to do so.
As stated previously, each question type has its unique characteristics and uses. The
applications of each type are determined by its strengths and weaknesses.
MCQ/MRQ
MCQ/MRQ Can assess a wide range of cognitive abilities Good MCQs (at any level)
(up to analysis). are difficult and time
consuming to construct.
Scenario-based questions can assess higher
order skills. MCQs that assess high level
abilities require skilled
Well suited to diagnostic assessment authors.
(distractors can target learning difficulties).
Unsuitable for assessing
Item analysis provides detailed feedback (to synthesis and evaluative
assessors and candidates). skills.
Simple MCQs are quick and easy to construct.
In practice, the main barriers to constructing high quality items are the skills and
experience of the authors. A talent for writing traditional question papers does not
necessarily translate to writing SRQs – so experienced setting teams may struggle to
create high quality item banks. Even after training, some writers don’t “get” SRQs –
while others are veritable question factories.
Note:
Note If a unit writer wishes to use objective testing, s/he
should not prescribe the particular type of SRQ in the unit
specification itself. It’s better to simply state that selected
response questions may be used – and leave the choice
of SRQ to the assessment writers (although the Support
Notes may suggest specific forms of SRQ).
Multiple choice questions are the most common type of SRQ; they’re also the most
flexible and most difficult to construct. MCQs are used in all types of objective testing
(including high stakes assessment) and are the most common form of SRQ
employed by SQA.
ANATOMY OF AN MCQ
Stem (or
(or stimulus):
stimulus) the question or problem.
Key:
Key the correct (or best) answer.
Distractors:
Distractors the incorrect alternatives to the key.
There is no formula for constructing high quality items. However, there is some
guidance that aids their construction.
THE ITEM
The key to writing good items (“authoring”) is to ensure that the question directly
relates to the underlying Arrangements, it is clearly presented, and free from
unnecessary details. A question should not be a test of reading ability; the focus
must be on the knowledge or skill that it is seeking to assess.
Assess one thing at a time (unless you intend to ask an integrative question).
one
The most difficult part of writing an item is to ensure that there is only on e correct
answer. Having more than one potentially correct answer is the most common
complaint from teachers and candidates. It’s a challenge to write items with one
clearly correct answer – at least non-trivial items. It’s easy to be subjective or context
dependent (i.e. the key is correct is some circumstances but not others). One
solution is to spell out the context – but this may make the item clumsy or wordy or
gives clues to the correct answer. Another option is to use words and phrases like
“best” or “most likely” in the stem (it’s easier to argue that the key is the most likely
answer rather than the only answer).
Although the initial construction of questions has to be the work of an individual, it’s
vital that items are reviewed prior to being used operationally. It’s impossible for a
single author to both write and review items independently.
STYLE GUIDE
Each item should follow an agreed house-style to provide guidance on language use.
A style guide for item writing would normally include advice about:
• spelling
• punctuation
• use of emphasis
• prose style
• language.
For example, spelling advice would include the treatment of numbers (spelled in
words or written as digits?); punctuation advice would include information on the
punctuation to use within options (should they end with a period or without any form
of punctuation?); emphasis rules would include the use of bold and italics; prose
style and language would provide general advice about the type and level of
language to be used.
THE STEM
It’s best to phrase the stem as a self-contained question rather than a partial
statement – although the latter approach is neither uncommon nor invalid.
Try to phrase the stem as a complete question (unless this is too contrived –
when an incomplete statement may be used).
Use clear, straight-forward language – suitable for the target cohort in terms of
level of language.
Avoid subjectivity e.g. “Which one of the following do you think is…” (what the
candidate “thinks” is subjective – and her response cannot be wrong).
Any words that would be repeated in each of the options should be included in the
stem. Options should not begin or end with identical words and phrases.
If the pressure of a certain amount of gas is held constant, what will happen if its volume is
increased?
If the pressure of a certain amount of gas is held constant, what will happen to the
temperature if its volume is increased?
A Decrease.
B Increase.
Avoid words like “could” and “would”. For example, asking a candidate “What would
you do…” cannot be answered incorrectly (since only the candidate can know what
she would do in any given circumstance) – instead write: “What should you do…”.
The following example illustrates a poor question.
A Insufficient memory
B Over-heating
D Virus
The author intends D to be the correct answer – but any of the options could be
correct. Here is an improved version.
A computer suddenly runs slowly without any changes to its configuration. What is most
likely to be responsible?
A Insufficient memory
B Over-heating
D Virus
Notice the added contextual information in the stem to improve the clarity of the
question – and the replacement of “could” with “most likely”.
Specify any standards implied.. If an item calls for a judgment, specify the authority
or standard upon which the correct answer is based.
According to the American Medical Association, the diet of the average American provides
vitamins in amounts that are what?
The key to good stem construction is to keep the question (or statement) as short as
possible – consistent with providing sufficient information to clearly pose the
question. But don’t be tempted to reduce the length of the stem by moving
information into each of the options; this complicates the question and increases the
candidate’s reading time.
Negative wording is not prohibited but it’s better to word a question positively when
this is possible. Double negatives should be completely avoided i.e. two negatives in
the stem or a negative in the stem and a negative in the options. However, some
THE OPTIONS
Provide between three and five options – four options is most common.
Options should be internally consistent (e.g. all consisting of people’s names, not
three names and the measurement).
The key should not be worded in a way that would make it likely to change over
time.
Don’t use words such as “not”, “never” or “always” to make an option incorrect.
“None of the above” should be used sparingly (and when used should be the
correct answer some of the time).
The advice about pejorative language is quite subtle. Any option that uses words
such as “bad”, “low” and “ignore” is usually a distractor – authors rarely use such
words in the key.
SEQUENCING OPTIONS
Ordering of the options within an item should follow a logical order. If using numbers
or dates then they should be displayed numerically or chronologically in ascending or
descending order (normally ascending). Text answers should normally be sorted
alphabetically unless there is a “natural” sequence to the options, in which case the
natural sequence should be used in preference to alphabetical order. Do not order
the options to try to evenly distribute
distribute the answers (i.e. to ensure each option – A, B,
C and D – is used approximately the same number of times) nor attempt to avoid
clustering keys (e.g. A-B-B-B-C) since both of these strategies reduce the
randomness of the test.
The option “None of the above” should be used sparingly. It is preferable to avoid the
use of “None of the above” as well as “All of the above” Studies have shown that
they decrease item discrimination and test score reliability (see Section 6). However,
“None of the above” can be used if authors ensure that:
“None of the above” may be particularly useful in questions that require candidates
to carry out calculations, since this option effectively mops-up a large range of
potential errors. But, if it’s used, it must sometimes be the key.
Example 19 ~ Good
Good use of “None of the above”
Which one of the following is the solution for x in the equation 5(x-1)=10.
A 0
B 2
C 4
The quality of distractors has a huge impact on the quality of the question.
Distractors have a particularly important role to play in formative assessment since
their careful selection can provide a wealth of diagnostic information about the
candidate’s present understanding. In summative assessment, carefully selected
distractors can catch out unprepared (or under prepared) candidates. Writing
distractors, therefore, requires as much thought as writing the key.
Correct sounding distractors are good for the poorly prepared candidate.
True statements that do not answer the question are good distractors.
There is a balance to be struck between writing good distractors and trying to dupe
candidates. Distractors should not “entrap” candidates – that is, catch out
candidates through clever wording, very fine distinctions or tricks-of-the-trade. If you
want to write a difficult question then do so through the knowledge and skills
required to answer it – not by tricking the candidate into giving the wrong answer.
“Cueing” is the tendency for the stem (or the options) to infer the key. It is a common
problem with SRQs. The following question has only one option (A) which is
grammatically correct (the stem ends with “an” and only option A begins with a
vowel).
Example 20 ~ Cueing
A adjective
B conjunction
C pronoun
D verb
The wording in the stem should not provide obvious clues to the correct answer.
Don’t give clues to the correct answer by ensuring the options flow from the
stem, are in the same format and tense, and are grammatically correct.
Don’t allow the wording of the options to provide obvious clues to the correct
answer.
Avoid the use of “always” and “never” in the options since these responses are
rarely correct.
Avoid the use of “sometimes” and “often” in the options since these responses
are often correct.
Avoid using stereotypical language that could give away the answer.
Avoid pejorative wording (“bad”, “low” etc.) since these words are rarely used in
the key.
Avoid absolute language such as “always”, “never”, etc. since these are rarely
correct.
Avoid complex language in one option compared with other options (this option
tends to be the correct answer).
Avoid similar language in the stem and the options since the option with the
most similar language is most likely to be the key.
Avoid visual cueing i.e. one option being much longer or standing out in some
other way from the other options – this one is likely to be the key.
The length of options should be similar. An option that stands out from the others
can indicate to a student that it is the right answer. If different lengths are
unavoidable then use two long options adjacent to each other and two short options
adjacent to each other.
“Shakespeare wrote plays and they reflect both the depth of human emotion and the
complexity of human society.”
Which one of the following phrases improves the wording of the underlined fragment?
The question is clearly worded – although some of the language in the stem is
unnecessarily complex (words such as “fragment” could confuse candidates).
The options look homogenous, with none standing out (no visual cueing). They have
been ordered in a logical sequence (sentence length). They are all plausible to the
under-studied candidate. There is some repeated text in the options that a rewording
of the stem may avoid (but maybe not without making the question less clear). The
distractors have been chosen to reflect common misunderstandings among
candidates with respect to the use of “that”, “which” and “who”. And there is one
unambiguously correct option (B).
DISCLOSERS
Multiple choice questions (MCQs) have gained a reputation for being a quick-and-
dirty way of assessing low level knowledge. However, they can also be used to
assess higher level skills – but this requires a great deal more effort on the part of
the writer. This section explores the potential of MCQs to assess higher level skills.
As has been previously stated, MCQs can be used to assess all of the levels within
Bloom’s Taxonomy – although they are more suited to the lower levels.. This section
explores a couple of techniques for writing higher order questions and exemplifies
this against each level in Bloom’s Taxonomy.
Writing MCQs to assess higher order skills frequently contradicts some of the
previous advice about writing good items. For example, such questions often involve
long stems; complex language is frequently used; standards are often omitted (or
the question becomes one of knowledge of the standard); and they often require an
element of judgement on the part of the candidate (and, as a consequence, are less
objective).
Writing higher level questions is easier in some subjects than others. Some fields,
such as mathematics, are problem solving based and in such subjects it is relatively
straight-forward to produce questions that assess more than knowledge and
comprehension (see example 3 for a straight-forward application level question in
Maths). In other subjects it’s not so easy.
However, there are a few techniques that can be used to help authors produce more
demanding MCQs. We will look at two:
1. scenario questions
2. passage-based reading.
Before we do, there is a very simple technique that can be used to transform a
simple knowledge question into one that is more demanding. Instead of asking
“What…?”, ask “Why…?”. For example, in a Geography test, instead of asking:
“Which one of the following cities is the capital of the United States” (which assesses
basic knowledge), ask why Washington is the capital of the US (which requires an
explanation).
Example 22
22 ~ Upgrading questions
SCENARIO QUESTIONS
Scenarios can be used in all subjects but are particularly suitable in the social
sciences. Science subjects are inherently suited to problem solving and it is easier in
these areas to pose demanding questions without the need for lengthy scenarios.
The examples provided in this section are given without detailed comment. You are
encouraged to critically appraise each question yourself. When you do, you will
appreciate that no )non-trivial) question is without its weaknesses.
• a description of a problem
• an explanation of an event
The candidate will take more time to process a scenario question as it often requires
a high level of reading skills. This should be taken into account when determining
the duration of a test (see Section 7).
Julie is 14 years old and frequently uses an online community called MyParty, which is a
social network used by many of her friends. However, the service is open to any member
of the public. She has become very friendly with Jamie, who is another user of the service,
whom she has never met. Jamie’s profile reports that he is 16 years old and attends a
nearby school. Julie and Jamie share many common interests and Jamie has asked to
meet Julie, who wants to meet him.
This question uses a specific situation to ask a question that involves application
skills. Any question that uses a scenario that the candidate is unfamiliar with is, in
effect, assessing application skills.
A user is having problems reading files from a flashdrive. While most files work correctly,
any attempt to access a few specific files results in an operating system error message:
“Cannot read file. Storage device may be corrupt.”. Which one of the following actions is
normally the best course of action in such circumstances?
A Copy the readable files from the device and do not re-use the device.
B Copy the readable files from the device, reformat it and recopy the files to the
device.
C Ignore the error and continue to use the part of the device that is usable.
Note that this question is an example of problem solving. Note also that there are at
least two weaknesses. The key (B) “looks” correct (it is the longest and most detailed
option); at least one of the distractors is weak (C) and uses pejorative language
(“ignore”). But it has its strengths too. The key is clearly the best answer (not always
an easy task when writing demanding questions) and it’s a challenging question
(admittedly made easier by the options). And the author didn’t resort to “None of the
above” as a final option! It is a moot point whether this item can be “fixed” or
whether it has to be discarded.
The following example uses a single scenario and a number of linked questions of
increasing demand.
Raj and Sophie, who have never been married, have two children – Ben aged 8 and
Shazia aged 2. Raj and Sophie’s relationship has ended, and Sophie has married Carlton.
Raj has agreed that the children can live with Sophie and Carlton for the time being.
D Sophie only.
E Raj only.
2 If Section 8 orders are required in respect of the children, who could apply as of right
(without leave) for any Section 6 order?
3 Who would be able to apply as of right (without leave) for a residence or contact
order?
4 If Raj obtained a contact order to see the children every week, who would have
parental responsibility for the children?
PASSAGE-BASED READING
A number of linked questions could be asked about this passage. For example, a
vocabulary-in-context question could ask about the meaning of a word (or term) such
as “falsifiable” (line 3) or “cognitive therapy” (line 6); a literal comprehension
question could ask about the candidate’s understanding of this passage (such as
asking her to choose the best (one line) summary of the passage); and a number of
extended reasoning questions could be posed (such as one asking about criticisms
of Freudian psychoanalysis).
Example 27
27 ~ Evaluation skills
ITEM ANALYSIS
One of the major advantages of selected response questions (SRQs) is that they can
be easily analysed.
Item analysis permits a more scientific approach to assessment. If you know the
properties of each question (for example, how difficult it is or how well it separates
candidates of differing abilities) then you can construct a better test.
This section explores two classical ways of analysing items: (1) measuring their
difficulty; and (2) measuring how well they separate candidates. The next section
explains how these measures can be used to construct tests.
FACILITY VALUE
The facility value (FV) of an item is a measure of its difficulty – or, more accurately,
its “easiness”. It represents the proportion of candidates who answer the item
correctly and is expressed as a decimal fraction between zero and one.
A FV of zero means that no-one answered the question correctly; a FV of one means
that everyone answered the question correctly; and a FV of 0.6 means that 60% of
the test takers answered it correctly. The lower the FV, the more difficult the item;
the higher the FV, the easier the item (hence, it is better thought of as an “easy
index”). A very easy item might have a FV of 0.9 (meaning that 90% of candidates
are expected to answer it correctly) and a very difficult item might have a FV of 0.1
(meaning that 10% of candidates are expected to answer it correctly).
Note:
Note In a competency-based system (such as SQA’s), the
FV measures the probability of a minimally competent
candidate answering the question correctly – not
not a typical
candidate.
candidate
Facility values are best assigned during pre-testing. Once a sample group of students
has attempted the item (assuming that this sample is representative of the target
cohort), an initial FV can be assigned. If pre-testing is not possible (or, more likely,
not feasible) a predicted facility value (PFV) can be assigned by the test authors.
Predicted FVs are assigned by subject matter experts (SMEs) and represent the
“best guess” of two or more SMEs. This initial estimate can be re-calibrated once the
item is used operationally.
Note that, in theory, any SRQ will have a minimum FV greater than zero. For example,
any true/false question will have a minimum FV of 0.5 (which represents the 50-50
chance of guessing the answer correctly) and any MCQ (with four options) will have a
minimum FV of 0.25 (no matter how difficult it is). However, in practice, some FVs
will be lower than this due to the way the item has been constructed – with a key
attracting more than its fair share of candidates and badly constructed distractors
attracting very few candidates.
It is recommended that items with FVs greater than 0.9 are discarded (too easy);
similarly FVs lower than 0.1 should be avoided (too difficult).
DISCRIMINATION INDEX
The discrimination index (DI) of an item is a measure of how well that item separates
candidates. It relates each candidate’s test score with his/her performance on a
specific item, and then compares the top candidates with the bottom candidates.
DI values range from +1 (all of the top candidates answered it correctly and none of
the bottom candidates) to -1 (all of the bottom candidates answered it correctly and
none of the top candidates!); a DI of zero means that the same number of top and
bottom students answered it correctly. A positive DI is essential (which shows some
discrimination). If an item yields a zero or negative DI, discard it. The above example
illustrates good discrimination. It recommended that an item has a DI of at least 0.2;
items with DI values of 0.4 and above are considered to have good discrimination.
There is a link between a question’s facility value and its discrimination index. A
“good” question that is designed to be difficult will have a low facility value and high
discrimination. But not all questions with low FVs will have high DIs. A poorly
designed question that is difficult to answer due to lack of clarity or inappropriate
language may have a low FV and low discrimination (since few candidates can
answer it – and poor candidates are as likely to get it right as good candidates).
The following example (see over) illustrates the facility value and discrimination
index for a specific question. The item was designed to assess the mathematical
knowledge of S2 candidates. It was pre-tested on 60 candidates of whom 18
answered it correctly; 15 in the top third and three in the bottom third. This gave the
following item analysis:
FV = 0.30
DI = 0.60
If the radius of a circle is increased by 20%, which one of the following represents the
corresponding increase in the circle’s area?
A 40%
B 44%
C 120%
D 144%
This item is difficult. Given that blind guessing would produce a one-in-four chance of
answering it correctly (FV=0.25), the recorded FV of 0.30 (representing 30% of the
sample) is very low. It also discriminates well, meaning that it is likely to separate
candidates and aid grading.
It is worth noting that this item is slightly cued. “44%” appears twice in the options
(in B and D) – which might encourage some candidates to assume one of these
options is correct (which would be a correct assumption – the key is B). This could
have been avoided by selecting a different value for D (such as 160%).
OTHER METRICS
There is a range of other metrics that can be calculated for SRQs. Most are complex
and, unlike facility values and discrimination indices, have no “real” interpretation.
However, the distractor pattern provides useful information about which of the
options candidates choose. For example, the following distractor pattern illustrates
the choices made by 100 candidates for Example 25 (above).
Option
Option Frequency
of selection
A 15
B 40
C 10
D 35
This distribution would infer that distractors A and C are under-performing and need
to be strengthened or replaced. It might also indicate that distractor D is too strong
and may require weakening. It would appear that this question comes down to a
straight choice between options B and D for most candidates.
There isn’t a perfect distribution for the options – but options that are rarely selected
or a distractor that is more popular than the key warrant attention.
CONSTRUCTING TESTS
AUTHORING TESTS
This section looks at the process of combining questions into a test. The following
diagram illustrates the test generation procedure.
TEST SPECIFICATION
The test specification is the document (or “blueprint”) that defines the precise
nature of the test. It is normally created by the Principal Assessor (or equivalent)
under advice from the SQA Officer. The test specification will include the following
information:
• question format(s)
• number of questions
• duration
• conditions of assessment.
The description of the test must (at a minimum) define the learning objectives that
the test is seeking to measure. In the context of SQA, this would mean the unit(s)
and outcome(s) that the assessment is testing (its “domain”).
The question format defines the type of question that the test will employ. This might
be true/false, matching, multiple choice or multiple response – or a mix of these
types. For example, a test might use 15 MCQs and 5 MRQs – the test spec’ should
spell this out.
The number of questions is self-evident but note that where more than one question
type is employed, the spec’ should specify the number of each type.
The duration
duration of the test will depend on the number of questions and the complexity
of the questions. Simplistic formulas for the duration of a test (“two minutes per
Page 44 Introduction
Introduction to Selected Response Questions
Page 45 Guide to Writing Objective Tests v1.2
The rubric defines the marking scheme and provides instructions to candidates.
Setters may adopt a simple marking scheme (one mark per question) or more
complex schemes (involving one, two or more marks for each item depending on its
importance or complexity). Simple marking schemes are recommended. This section
should also provide any special instructions for candidates.
The pass mark (or cutting score) is the minimum mark that candidates must gain in
order to achieve a pass in the test. There are a number of techniques for setting
pass marks, some of which are discussed later in this section. But pulling a figure
out of thin air is not one of them. And 50% is rarely a suitable cut score for an
objective test (due to the effects of guessing – see below).
If a test is graded (beyond the basic pass/fail threshold), the grade boundaries must
be defined. The grade boundaries define the marks required to gain an A or B or C
pass. For example, a C pass might require a total score between 60% and 74%, B
between 75% and 89%, and an A pass 90% or more.
Finally, the test spec’ should describe any special conditions that have not already
been described elsewhere in the specification. Examples include: access to
reference material (Is the assessment open book? Or open web?) and permitted
materials (such as calculators or special instruments).
The test team is responsible for constructing the test, using the test specification as
a blueprint. This team will normally consist of an SQA Officer and a number of setters
– or, in testing terminology, a test expert (the SQA Officer) and a number of subject
matter experts (the setters). The SMEs should have prior knowledge and experience
of writing SRQs. The size of the team will depend on a number of factors such as the
number of items required and the time available to write them. The more items
required and less time available, the greater the number of SMEs needed.
Subject matter experts may need training in the construction of selected response
questions. This can be done at the authoring event (see below) or prior to this event,
at a specific training event.
AUTHORING EVENT
Due to the collaborative nature of item writing, it is recommended that questions are
produced over a short period of intensive activity rather than the more traditional
SQA approach to question setting. For example, a team of four SMEs might be asked
to produce 200 items over an intensive working weekend. A suggested workflow
during the authoring event is provided below.
Allocate learning
outcomes to
SMEs
Authors need to be crystal clear about the learning objectives (outcomes) that they
are to assess. Where more than one outcome is to be covered by an individual SME,
the number of questions for each outcome should be agreed. Each author’s targets
should also include the types of question and number of each type of question (for
example: “Twenty multiple choice questions and 10 multiple response questions”),
the average facility value for their set of questions (see below), and the expected
productivity rate (for example, five items per hour).
Writing items is a solitary activity. Although authors may seek advice when they write
questions, the act of putting pen to paper (or, more likely, finger to keyboard) is an
individual task. Authors should be provided with a question template before
commencing. This template (which is normally a Word document) defines the precise
format of the question and will include metadata about the item (such as the
associated keywords and its predicted facility value). A sample template is provided
in the appendices.
If the items are being written for a test with a known pass mark, authors will require
to know the target facility value (FV) to aim for. For example, if the writers are
producing items for a test with a pass mark of 15/20 then the target FV will be 0.75
and each author should ensure that each batch of questions has an average FV of
0.75 (so that the overall item bank has a “correct” FV).
The output from the authoring event will be an item bank of approved and calibrated
items. The SQA officer will play a crucial role
role in maintaining workflow and ensuring a
productive event. Target setting and regular milestones will play an important part in
ensuring a successful outcome. At various points during the event, the officer should
convene review meetings when progress can be measured, and problems or
bottlenecks can be collectively identified and addressed.
A high stakes test needs to be more reliable than a low stakes test – and therefore
needs to be longer. However, the improvement in reliability levels off over a certain
number of questions.
The number of learning objectives being assessed also has a bearing on the size of
the test. A test that assesses several outcomes (or one large outcome) will obviously
require more items than one that assesses fewer outcome (or smaller outcomes).
However, even a test that assesses a single outcome may require lots of questions if
that outcome covers a broad range of knowledge and skills.
And, finally, the time available needs to be considered. There is no point is designing
a test with 60 questions, requiring two hours to complete, if this is disruptive to
centres. For example, most Scottish schools operate a 50 minute period and tests
that last longer than this can be difficult to administer.
There is no formula for test length. Criticality, domain size and practical
considerations need to be balanced. However, in most instances of unit assessment
it is best to keep tests as short as possible to reduce the assessment burden on
centres (and candidates).
There are a number of ways to set a pass mark. We will look at three methods:
1. Informed judgement
2. Angoff method
3. contrasting groups
Some are more “scientific” than others but, no matter which method is used, none of
them replace the need for human judgement.
INFORMED JUDGEMENT
This technique involves the most human judgement and, as a consequence, is the
most subjective way of setting pass marks (it also is the method most similar to the
way that SQA sets cut-scores).
At its most basic level, informed judgement involves the opinion of the members of
the setting team. These subject matter experts (SMEs) agree a sensible pass mark
based on their expert judgement and the following considerations:
No matter how little a candidate knows, s/he is unlikely to score zero marks in an
objective test due to the effects of guessing. For example, in an objective test
consisting of 100 multiple choice items, each with four options, blind guessing
should produce a minimum mark of 25% (representing the one in four chance of
guessing the correct answer to each question). For this reason, the pass mark in an
objective test is usually higher than 50%.
The importance of the assessment also has a bearing on the pass mark. For
example, an assessment that grants a license to practice as a surgeon is more
important that an assessment that confers a pass in a unit. Where it is critical that
candidates possess particular competences both the test duration (see above) and
the pass mark (see below) should be increased.
If there is an existing item bank, the difficulty of the items in the bank can be used to
determine the pass mark. For example, if we know that an item bank contains
difficult questions then that would result in a lower pass mark; conversely, a simple
item bank would lead to a higher pass mark. Associated with this is the complexity of
the subject domain. For example, a test on nuclear physics might have a lower pass
mark than one on multiplication tables – although this is dependent on the age and
stage of the candidates.
The initial judgement may be refined after further consultation or pre-testing. For
example, practicing teachers may be asked about their views about the proposed
pass mark; and/or the assessment may be field-tested and the pass mark adjusted
in the light of the resulting scores.
ANGOFF METHOD
This method of determining the pass mark is less subjective than the informed
judgement approach. It involves aggregating the facility values (FVs) for each item
and estimating the pass mark based on this figure. The following example illustrates
this method.
Question FV
1 0.8
2 0.6
3 0.6
4 0.3
5 0.4
Total 2.7
Recall that the facility value is a measure of the probability (between 0 and 1) of
minimally competent candidates answering the question correctly. For example,
Page 49 Introduction
Introduction to Selected Response Questions
Page 50 Guide to Writing Objective Tests v1.2
based on the above table, there is an 80% probability that candidates will answer
question one correctly (FV=0.8). Adding the FVs for each question, therefore,
provides an indication of the total score that a minimally competent candidate
should achieve (in this case 2.7). Subject matter experts would then either round
this value down or up using their professional judgements (in this case the aggregate
FV was rounded up). The resulting pass mark for this test is three out of five.
In practice, pass marks are defined in the test specification, and the task, therefore,
becomes one of selecting questions with FVs that aggregate to this pass mark. We
effectively reverse engineer the Angoff method. For example, if the test specification
defines a pass mark of 7/10 then the test should consist of questions whose FVs
add to seven (give or take a decimal place). This is a very simple task for a computer.
CONTRASTING GROUPS
This method, unlike the previous ones, requires pre-testing. The test is issued to two
groups of students – one group who are expected to pass and one group who are
expected to fail. The actual scores are then plotted on a chart and the intersection of
the graphs provides an initial pass mark. This initial pass mark is then refined using
the SMEs’ expert judgement.
The graph below illustrates the results for two groups of students – one group (the
blue line) expected to fail and one group (the red line) expected to pass.
30
25
20
No of candidates
15
10
0
10
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
0
5
100
Marks
The initial cut score would be around 55% (the approximate intersection of the two
lines). Raising this to 60% would reduce the number of “incompetent” students who
would pass the test – but increase the number of “competent” students who would
fail. Conversely, decreasing the pass mark to 50% would reduce the number of
“false fails” but increase the number of “false passes”. The final decision is based
on the professional judgement of the SMEs.
These methods can be used alone or in combination. They all provide some scientific
basis to the process of setting the pass mark. The alternative – pulling a pass mark
from thin air – is not an option.
Guessing is often cited as a major problem with selected response questions and it
is true that blind guessing can produce relatively high marks for candidates in an
objective test. For example, blind guessing in a true/false test should produce a
result of approximately 50%. However, there are well established ways of dealing
with guessing. These are: pass mark setting, negative marking and correction-for-
guessing.
The simplest way of dealing with guessing is to adjust the pass mark accordingly.
Instead of the “traditional” 50% pass mark, the cut score can be made higher to
compensate for the effects of guessing. For example, a multiple choice test that has
a pass mark of 75% is unlikely to be passed by blind guessing. We have already seen
three ways of determining the pass mark for an objective test (informed judgement,
Angoff method and contrasting groups). Any of these methods will eliminate (or
greatly reduce) the effects of guessing.
NEGATIVE MARKING
Negative marking involves deducting marks for incorrect answers. For example, the
following table illustrates a candidate’s scoring pattern in a five item test where one
mark is awarded for the correct answer, zero marks where a question is not
answered and one mark deducted for the incorrect answer.
Question Mark
1 1
2 1
3 0
4 -1
5 1
Total 2
The main problem with negative marking is that it penalises partial knowledge.
Selecting a “good” distractor is a better than choosing a “bad” distractor – but both
choices will result in the loss of a mark.
CORRECTION-FOR-GUESSING
This technique involves deducting a certain number of marks from every candidate
to compensate for the effects of guessing. The number of marks deducted can be
worked out in a number of ways, ranging from the crude (a fixed number of marks
deducted from every candidate) to the more sophisticated (when the number of
marks deducted is not fixed and is based on an estimate of how many guesses each
candidate has made). An example of the second approach follows.
In a 50 item test, where each item is a multiple choice question consisting of four
options (a key and three distractors), a candidate scores 38/50. The proportion of
marks deducted is based on the number of incorrect answers (which are assumed to
be guesses) and is worked out as follows:
In this case:
So, four marks would be deducted from this candidate giving her an adjusted score
of 34.
While less crude than negative marking, this method suffers from similar problems –
it penalises partial knowledge as much as no knowledge, and disproportionately
affects low risk takers who will choose not to attempt a question rather than answer
it for fear of losing marks, resulting in many unanswered questions – and deflated
marks.
SEQUENCING QUESTIONS
When deciding the order of items in a test, it should be borne in mind that tests
should begin with relatively simple questions and progress to more complex
questions. It is also advisable to group item types together – for example, all
true/false items and all MCQs. So, in most cases, it is advisable to begin with
straight-forward, low difficulty true/false questions and progress to more complex,
higher difficulty MRQ or assertion/reason items.
Source unit
Provide details about unit, outcomes and performance criteria that the test is assessing.
Title Internet Safety Ref. No. 10 1234 SCQF level 4
No. of
Outcome Performance criteria
questions
1 All 9
2 d, e 9
Outcome(s) and Performance Criteria
3 a, b, c 7
Test details
Provide details about the test.
No. of questions 25 Type Number Additional info.
4 options for
Duration 50 min. MCQ 20
each question
Question format(s)
4 options for
MRQ 5
each question
Pass mark(s)
16/25
Including grading thresholds where applicable.
Rubric
Marking instructions, instructions to candidates, assessment conditions etc.
Marking
One mark per question.
instructions
Assessment
conditions No access to reference material (paper or web).
Such as reference
Candidate authentication is required.
materials, location,
authentication.
Instruction to
No special instructions.
candidates
Page 54 Introduction
Introduction to Selected Response Questions
Page 55 Guide to Writing Objective Tests v1.2
Item
Stem
Options
Key
2
Distractors
Metadata
Outcome
PC(s)
PFV
Tags
Workflow
ITEM
The question relates to learning outcome(s) and performance criteria.
Cueing is avoided.
STEM
The stem is phrased as a question.