Article Text


Can item response theory reduce patient burden when measuring health status in neurological disorders? Results from Rasch analysis of the SF-36 physical functioning scale (PF-10)


BACKGROUND Indices of physical function may have a hierarchy of items. In cases where this can be demonstrated it may be possible to reduce patient burden by asking them to complete only those items which relate directly to their own level of ability.

OBJECTIVES To determine whether statistical procedures, operationalising what is known as item response theory (IRT), can be used to assess the unidimensionality of the 10 item physical functioning domain of the SF-36 in patients with Parkinson's disease and motor neuron disease, and, secondly, to determine whether it would be possible to administer subsets of items to certain patients, on the basis of their replies to other items in the scale, thereby reducing patient burden.

METHODS Rasch analysis, a form of IRT methodology, of the 10 item physical functioning domain (PF-10) in two neurological patient samples was undertaken and the results compared with results of a Rasch analysis of data gained from a population survey (the third Oxford healthy lifestyles survey).

RESULTS Evidence from the analyses suggests that the PF-10 does not form a perfect hierarchy on a unidimensional scale. However, certain items seem to form a hierarchy, and responses to some of them are contingent on responses to the other items.

CONCLUSIONS Rasch analysis of the PF-10 in neurological patients has indicated that certain items of the scale are hierarchically ordered, and consequently not all respondents would need to complete them all: indeed those most severely ill would be required to complete less items than those with only limited disabilities. The implications of this are discussed.

Statistics from

The ultimate goal of outcomes research is to provide meaningful, accurate assessments of health, which can inform treatment decisions and regimes.1 Within the field of neurology evaluation of the patient has been largely by means of clinical assessment, but in recent years the place of patient self report questionnaires has come to occupy an increasingly central position.2 However, a possible criticism of such measurement is that it puts a considerable burden on chronically ill patients, who may have seriously disabling conditions. Ideally, therefore, questionnaires should be simple to understand and as brief as possible, yet still satisfying the requirements of validity and reliability which are central to all measurement. The most common procedure for creating shorter form instruments is to undertake a statistical analysis of the original measure that simply reduces the number of items.3 4However, although this will lead to brevity it also increases the error term in measurement, reducing accuracy and precision.5Another method of reducing item burden is to request respondents to complete only the items that are of direct relevance to them. For example, someone who is unable to walk at all need not be requested to complete items on ambulation. Questionnaires developed using classic psychometric techniques request patients to complete all the items, even though some may be inappropriate for their level of ability. However, more recent psychometrics can reduce the number of items any subject may have to complete while retaining the same original item pool. If questions on a measure form a hierarchy then respondents need only complete those that assess their own level of ability. Determining a hierarchy on a questionnaire can be undertaken using item response theory (IRT) scoring procedures.6 7

Often referred to as an item characteristic curve technique, IRT methodology begins with the assumption that any item will pose differing degrees of difficulty to different people in any given population. Furthermore, different items pose differing degrees of difficulty. These basic claims lead to two assumptions. Firstly, that the items constitute a hierarchical structure on a unidimensional concept, and, furthermore, that reproducibility of the item hierarchy can be achieved on different groups and across test occasions.8 9 If these two assumptions are satisfied then it is reasonable to assume that certain questions will be answered in a predictable manner on the basis of answers to other questions. For example, someone who answers a question such as “Can you walk at all?” in the negative is highly unlikely to affirm the statement “I can run long distances.” If a hierarchy of statements can therefore be found then those who are most disabled by their condition need complete less items than those with less severe forms of the condition. This is a desirable solution as patient burden is considerably reduced for the most severely ill. Such measurement is sometimes referred to as “test free”, in that people can be compared to one another on a trait or ability even if they have completed different questions, or a different number of questions. Rasch analysis is the most commonly used form of IRT methodology.10

The purpose of this paper is twofold: firstly, to determine whether IRT scoring criteria, using Rasch analysis, would be appropriate for the 10 item physical functioning domain (PF-10) of the 36 item short form health survey (SF-36) in patients with Parkinson's disease and motor neuron disease by assessing the unidimensionality of the scales, model fit, and comparability with data gained from a general population sample; and secondly, to determine whether it would be possible to administer subsets of items to certain patients, on the basis of their replies to other items in the scale, thereby reducing patient burden. The PF-10 is reproduced in figure 1.

Figure 1

Physical function dimension of the SF-36.


Three data sets were analysed for this paper: a normative dataset of the general population, patients who are members of the United Kingdom Parkinson's Disease Society, and a sample of patients with motor neuron disease drawn from across European countries. Recruitment into the studies is explained in full elsewhere,11-13although a brief outline of the methodology of each of the surveys is outlined below.

Normative data were gained from the Oxford healthy lifestyles survey (OHLS III). Questionnaires containing questions on lifestyle, as well as a copy of the SF-36, were mailed to randomly selected people in Oxfordshire, Northamptonshire, Berkshire, and Buckinghamshire. Completed questionnaires were obtained from 8889 of 13 800 people originally contacted, a response rate of 64.4%. Of those who did return questionnaires 8801 (99.0%) of respondents answered the question relating to sex, of whom 3863 (43.4%) were men and 4938 (55.6%) were women. The mean age of the sample was 41.6 years (SD 12.6; range 18 to 65)

Patient's with Parkinson's disease were recruited from a postal survey of members of local Parkinson's Disease Society branches. Four hundred and five patients who were registered with five branches of the Parkinson's Disease Society were contacted. Fifteen people were subsequently removed from the denominator as they could not be traced, were deceased, or did not have Parkinson's disease. A total of 227 questionnaires were returned, yielding a response rate of 58.2%. The mean age of this sample was 70.3 years (SD 9.0; range 40.9 to 87.7); 57.4% men; 42.6% women. The mean number of years since diagnosis was 8.6 (SD 6.7) (n=218).

Patients with motor neuron disease were recruited via the amyotrophic lateral sclerosis health profile study (ALS-HPS), a Pan European survey of motor neuron disease patient experiences and health status. Patients are recruited into the study when visiting their doctor and then return the questionnaires via the post. Five hundred and fifty one patients have been recruited into the ALS-HPS, of which there were 451 (81.85%) patient responses. The mean age of respondents was 59.91 years (SD 11.24; range 24.3–88.8). Two hundred and fifty three (56.1%) patients were men, 197 (43.7%) women, and one did not reply to the question. The mean number of years since diagnosis was 1.39 years (SD 1.88) (n=420), and the mean number of years since first symptoms were noticed by the patient was 2.18 (SD 2.46) (n=388).


Rasch analyses were performed on the three data sets outlined above. Two claims are tested with the Rasch rating scale model: firstly, the more capable a person is in physical functioning the less likely that person is to have limitations on any given item, and, secondly, the easier the item, the more likely the person will report no limitations. The Rasch model provides item locations along a hypothesised common measurement continuum. These calibrations define the hierarchical order of the items along the continuum. Calibrations for each item are expressed in logits, which is the natural log of an odds ratio. In this instance, an odds ratio is the level of performance of an item in relation to the performance (in terms of difficulty) of the total set of items. Logits typically range from –4 to +4, with logits of greater positive magnitude representing increasing item difficulty.

Unidimensionality was assessed using the information weighted fit statistic (infit). This fit statistic is standardised such that it takes the approximate form of a tdistribution. Values lying outside –2.0 and +2.0 indicate that data for that item may not fit the model.


Rasch analysis of the OHLS III normative dataset is reported in table 1.

Table 1

Mean (SEM) item calibrations and strata for PF-10 based on the third Oxford health and lifestyles survey (OHLS III, n=8853)

Close inspection of the results tends to suggest that the Rasch model does not provide a perfect fit for the items on the physical function domain of the SF-36. The unidimensionality of a multi-item index for a given sample is partly determined by goodness of fit statistics, which is an index of how well the item calibration (expressed in logits) fits the data for all of the subjects in the sample, who did not score all items at the floor or, alternatively, did not score all items at the ceiling. Infit statistics are reported, which are standardised to approximate a mean of zero and an SD of 1. As noted above, high infit statistics (>2.0) may indicate that an item does not fit the model well and is not closely related to the overall construct. Low infit statistics (<-2.00) indicate that items measure redundant or overlapping content areas.14 The items bending, kneeling, and stooping and bathing and dressing have very large infit statistics indicating that they do not fit the model at all well. On the other hand moderate activities, walking half a mile, and climbing one flight of stairs have fairly large negative infit statistics indicating that one or more of the items are redundant as they are measuring overlapping areas.

The results also suggest that the hierarchical nature of the items on the PF-10 is not completely satisfied. For example, to determine the spacing of each item calibration (expressed in logits) an associated SE estimate is calculated and used to define distinct strata along the measurement continuum. The spacing of items can be described by the number of distinct strata that can be identified in the scale. Strata can be defined as a separation of at least ± 0.15 logits.15 Nine distinct strata can be found for the PF-10 on the OHLS III data, with the items bathing and dressing and walking 100 yards having very similar item calibrations. Overall, the PF-10 data from the OHLS-III indicate a hierarchy of items, but with some possible redundancy.

Table 2 provides results of a Rasch analysis of the Parkinson's disease data. Once again the items do not form a perfect fit with the model, and once again the item bathing and dressing has a large infit statistic. However, in general the fit of items is better than that for the general population, although the number of strata are less due to limited differences between some items in terms of the infit statistics. It is particularly interesting to note that the order of items is not exactly the same as that for the OHLS III data, which further suggests that the data are not truly unidimensional. This result is also borne out for the ALS-HPS dataset, where once again the bathing and dressing item gains a large infit statistic, and the number of strata is also seven, and does not reflect the same hierarchy as either the OHLS or Parkinson's disease dataset (table3).

Table 2

Mean (SEM) calibrations and strata for PF-10 based on the Parkinson's disease sample (n=227)

Table 3

Mean (SEM) calibrations and strata for PF-10 based on the ALS-HPS sample (n=446)

Although the scale as a whole does not seem to fulfil the requirements of unidimensionality, there are none the less hierarchies of items within the scale. Thus the following items form the same hierarchy in both patient groups as well as the OHLS III dataset:

  • Vigorous activities

  • Walking more than a mile

  • Walking half a mile

  • Climbing one flight of stairs

  • Walking 100 yards

as do

  • Climbing several flight of stairs

  • Climbing one flight of stairs.

These items also conform to the requirements of the infit statistics, and never overlap in terms of logits (never within ± 0.2 of each other), even if they sometimes overlap with other items. It could be argued that severely ill patients who indicate severe disability on the easiest of the items in these groups do not have to complete the other item or items. For example, people who indicate that they are limited a lot in walking 100 yards need not complete the items on climbing one flight of stairs, walking half a mile, walking more than a mile, or vigorous activities, and consequently need not compete the item climbing several flight of stairs. Indeed this is borne out by the data. For example, 80 patients with Parkinson's disease indicated they were limited a lot in their ability to walk 100 yards, and at least 97.5% also indicated they were limited a lot in walking half a mile, walking a mile, or vigorous activities. Similarly, 200 patients with ALS claimed that they were limited a lot in their ability to walk 100 yards. At least 98% of this group also indicated they were limited a lot in walking half a mile, walking a mile, or vigorous activities.


Health status measures must fulfil several requirements to be useful. Typically, the attributes that are most discussed centre on reliability, validity, sensitivity to change, and interpretability. However, data from such measures are likely to be compromised if patients find completing instruments a burden. Instruments with what may seem a modest number of items to someone in perfect health can present a considerable challenge for people who find writing and movement difficult, or, indeed, impossible. Consequently, brevity is to be sought whenever possible in the design and implementation of health status measures. Rasch analysis of data from the SF-36 physical functioning domain suggests that some items need not be administered to the most severely ill patients. Consequently, the most severely impaired are most likely to benefit from instruments the length of which is determined by Rasch methods. However, this potential advantage cannot be tested in the current study. The reduction of questionnaire length by Rasch analysis is also likely to be of considerable value when questionnaires are completed on computers. Such on line questionnaires may be completed by the patient, or with help from someone else and may be used in hospital and doctors' surgeries. Such computer programmes have already been developed in the United States, for example the dynamic health assessment system (DynaHA™) developed by QualityMetric16 uses a pool of items from widely used health surveys including general and disease specific-DynHA™ designs. Only those items relevant to a person' health state are used. By scoring all responses on a standard metric, results can be compared for those who answer different questions. The brevity of the assessment means that the DynHA™ system determines scores at a fraction of the burden of traditional health assessments. Furthermore, the developers of this system go further and claim that DynHA™ is the first system to provide results in user friendly, real time reports that are precise enough for monitoring and managing care. Such computer adaptive testing is likely to become increasingly popular in the 21st century17 as technology advances and becomes more efficient, easy to use, and economically attractive.

The analysis presented here outlines the potential benefits of item response theory to those using and developing questionnaires for neurological patients who are likely to be severely ill. Not all questionnaires or domains in questionnaires will be appropriate for this form of analysis, as not all questionnaires are designed to cover unidimensional concepts with a hierarchy of items. However, we present some indication of the possible use of this technique. Research is currently under way as to the usefulness of this technique with disease specific questionnaires designed for use within the sphere of neurological disorders.


The studies reported here were funded by the NHS Executive–South East, the Directors of Public Health for Berkshire, Buckinghamshire, Northamptonshire, and Oxfordshire Health Authorities (for OHLS data), the Parkinson's Disease Society (for Parkinson's disease data) and Aventis Pharma (for data from the ALS-HPS). More information on all of these studies can be gained from CJ.

View Abstract

Request permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.