Background: The 39 item Parkinson’s disease questionnaire (PDQ-39) is the most widely used patient reported rating scale in Parkinson’s disease. However, several fundamental measurement assumptions necessary for confident use and interpretation of the eight PDQ-39 scales have not been fully addressed.
Methods: Postal survey PDQ-39 data from 202 people with Parkinson’s disease (54% men; mean age 70 years) were analysed regarding psychometric properties using traditional and Rasch measurement methods.
Results: Data quality was good (mean missing item responses, 2%) and there was general support for the legitimacy of summing items within scales without weighting or standardisation. Score reliabilities were adequate (Cronbach’s alpha 0.72–0.95; test–retest 0.76–0.93). The validity of the current grouping of items into scales was not supported by scaling success rates (mean 56.2%), or factor and Rasch analyses. All scales represented more health problems than that experienced by the sample (mean floor effect 15%) and showed compromised score precision towards the less severe end.
Conclusions: Our results provide general support for the acceptability and reliability of the PDQ-39. However, they also demonstrate limitations that have implications for the use of the PDQ-39 in clinical research. The grouping of items into scales appears overly complex and the meaning of scale scores is unclear, which hampers their interpretation. Suboptimal targeting limits measurement precision and, therefore, probably also responsiveness. These observations have implications for the role of the PDQ-39 in clinical trials and evidence based medicine. PDQ-39 derived endpoints should be interpreted and selected cautiously, particularly regarding small but clinically important effects among people with less severe problems.
Statistics from Altmetric.com
The past decade has seen two major developments in clinical Parkinson’s disease (PD) research: an increasing focus on evidence based medicine and a growing emphasis on the importance of patient reported outcomes.1 2 It is therefore reasonable to expect the effectiveness of therapy to increasingly be judged on the basis of patient completed rating scales. A prerequisite for valid interpretation of clinical findings, and hence evidence based medicine, is that rating scales can be interpreted with confidence.3–6 The need for high quality patient reported rating scales in PD and the fundamental role of evidence based measurement in clinical research is thus apparent.
The 39 item PD questionnaire (PDQ-39)7 is the most widely used disease specific patient completed rating scale in PD.8 However, several important measurement properties of the PDQ-39 have not been fully addressed. For example, basic requirements (scaling assumptions) that determine the legitimacy of summing PDQ-39 item scores without weighting or standardisation have not been examined, and studies addressing the validity of grouping items into its eight scales (dimensionality) have shown inconclusive or discouraging results.9–12 This poses limitations on the possibility to interpret study outcomes as it may be unclear what scores represent.4 There have also been indications that the PDQ-39 may not target respondents adequately, which could affect its ability to detect clinically relevant changes.10 Re-evaluation of the PDQ-39 therefore appears warranted to help inform its use and role in clinical trials and evidence based medicine.
With this in mind, we assessed the scaling assumptions, reliability, dimensionality and targeting of the eight PDQ-39 scales. Whereas the PDQ-39 was developed within the traditional test theory framework, modern test theory (particularly the Rasch model) is increasingly considered advantageous in scale development and evaluation.3 13–16 The PDQ-39 was therefore analysed using both traditional and Rasch measurement methods.
Patients and data collection
A total of 451 people with clinically diagnosed PD17 seen at a South Swedish university hospital over 1 year were considered for inclusion. Participants in other recent or ongoing questionnaire studies (n = 164) were excluded, as well as those deceased or in terminal care (n = 30). The remaining 257 people were sent a questionnaire booklet including the Swedish version of the PDQ-39.10 18 19 Two weeks later a second copy was administered, including a question asking if their health had changed (according to a 5 grade scale, “much better”, “better”, “unchanged”, “worse”, “much worse”) since the first mailing. Reminders were sent to non-responders 1 week after each mailing. Survey response was interpreted as consent to participate. The study was approved by the local research ethics committee.
The first mailing had a response rate of 81% (n = 209). Those indicating that they had not answered the survey themselves (n = 7) were excluded from further analyses, leaving 202 eligible cases (table 1). All but seven patients received levodopa with or without adjunct antiparkinsonian drugs, 18 had undergone neurosurgical interventions for their PD, three were only on PD drugs other than levodopa and four were not yet on any medical therapy. Of 173 responses to the second mailing (response rate 67%), five had not responded themselves and 31 reported change in their health status since the first occasion.
The PDQ-39 is a PD specific health status questionnaire comprising 39 items proposed to represent eight domains (scales) consisting of 3–10 items each (table 2).7 Respondents are requested to affirm one of five response categories according to how often (from never to always), because of their PD, they have experienced the problem defined by each item during the past month. The eight PDQ-39 scale scores are generated by Likert’s21 method of summated ratings (ie, item responses are summed without weighting or standardisation). Scores are then transformed to a common range of 0–100 (100 = maximum level of problems).
Data quality, scaling assumptions and reliability
Firstly, data quality (per cent missing data) was examined. We then examined the scaling assumptions (ie, the legitimacy of adding up items to generate scores without weighting or standardisation).21 Briefly, these require that within each scale, item scores should have roughly similar means and variances, and that the corrected item-total correlation (ie, the correlation between each item and the total score of the remaining items in that scale) should exceed 0.4.22 Internal consistency reliability was assessed by Cronbach’s alpha.23 Test–retest reliability between data from the first and second mailings among respondents who reported stable health (n = 137) was assessed by the intraclass correlation coefficient. Reliability estimates should not be below 0.7 and preferably ⩾0.8.24 25
Four approaches were used to test whether the proposed grouping of items into eight scales was empirically supported. Firstly, scaling success rates were examined. Scaling success is supported when items correlate significantly stronger with the total score of the other items in their proposed scale (corrected item-total correlations) than with other scales, as determined by 95% confidence intervals.22 Scaling failure is implied if an item correlates stronger with a scale other than its proposed one.
Items were then subjected to exploratory factor analysis with varimax rotation. Results were first interpreted by the criterion originally used to define the eight PDQ-39 scales7 (ie, by retaining factors (scales) with eigenvalues exceeding 1). However, because this criterion tends to overestimate the number of factors, parallel analysis was also used.26 One thousand parallel sets of random PDQ-39 data were thus generated and factor analysed, and each consecutive empirical factor with an eigenvalue exceeding the 95th percentile of random data eigenvalues was considered a useful factor.27
Thirdly, the extent by which observed data fitted the hypothesised items-to-scales structure was explored using confirmatory factor analysis. This technique is generally recommended over exploratory factor analysis when there is an a priori hypothesis regarding dimensionality, as it allows for testing whether empirical data fit an assumed structure.28
Finally, each of the eight proposed PDQ-39 scales were individually examined by means of the Rasch measurement model.29 According to this model, the probability of a certain item response is a logistic function of the difference between the level of the measured construct represented by the item and that possessed by the person. The model separately locates persons and items on a common logit (log-odd units) metric, which measures at the interval level and ranges from minus infinity to plus infinity (with mean item location set at zero). A fundamental Rasch model assumption is that all items in a scale work in harmony to define a common unidimensional construct. This assumption was tested for each of the eight PDQ-39 scales through assessment of overall scale and item level model fit by examining the accordance between expected and observed responses.30 Differential item functioning (DIF) is an additional aspect of fit to the Rasch model and an important facet of valid measurement.13 30 DIF occurs when items have different meanings and statistical properties across sample subsets. The presence of DIF challenges the validity of comparing data across such subgroups, and threats unidimensionality. DIF was assessed by comparing item response functions between genders and age groups (as defined by the median, <72 vs ⩾72 years old) across various locations on the measured constructs.13 30
To assess how well the eight PDQ-39 scales7 accord with the levels of health problems experienced by the sample, we first examined the amounts of floor and ceiling effects (ie, the percentage of respondents obtaining the lowest and highest possible scores, respectively) which should not exceed 15%.31 In addition, the relationships between the locations of persons and items, as determined by Rasch analyses, were examined. If scales are well targeted to the sample, the mean sample location should approximate the mean item location (ie, zero).
Analyses were performed using SPSS 12 (SPSS Inc., Chicago, Illinois, USA), ScoreRel CI,32 AMOS 5 (SmallWaters Corp., Chicago, Illinois, USA) and RUMM2020 (Rumm Laboratory Pty Ltd, Perth, Australia). All p values were two-tailed and considered significant when <0.05.
Data quality, scaling assumptions and reliability
Data quality was good with an overall mean of 2% missing item responses (range 0.5–22.3%) (table 2). We found general support for the legitimacy of summing items without weighting or standardisation, as illustrated by roughly similar item mean scores and SDs within most scales and corrected item-total correlations above the recommended criteria of 0.4 for all items (table 2). All reliability coefficients exceeded the recommended minimum of 0.70, and all but five exceeded the preferred value of 0.80. However, the minimum reliability criterion of 0.7 was not reached in four instances (three scales) when taking the 95% confidence intervals into account (table 3).
We found indications challenging whether the eight PDQ-39 scales represent the best grouping of items. Scaling success rates averaged 56.2% and did not reach 100% for any of the scales (table 3). Only one of the eight PDQ-39 scales (social support (SOC)) showed signs (9.5%) of scaling failure.
Exploratory factor analysis yielded eight factors according to the criterion used by Peto et al.7 However, the grouping of items did not accord with the assumed PDQ-39 scales, and eigenvalues of several factors only marginally exceeded 1 (fig 1). Parallel analysis identified four factors that were stronger than those produced by random data (fig 1). Among these first four factors, two of the proposed scales (emotional well being (EMO) and communication (COM)) were intact (factors 2 and 4, respectively). Factor 1 consisted of the 10 mobility (MOB) items and four activities of daily living (ADL) items, and factor 3 included the four stigma (STI) items and one SOC item (fig 1). Confirmatory factor analysis showed poor fit (χ2, 1885.85; p<0.0001) of the observed data to the proposed items-to-scales relationships, thus arguing against the assumed structure (see supplementary fig S1; supplementary fig S1 can be viewed on the J Neurol Neurosurg Psychiatry website at http://www.jnnp.com/supplemental).
Rasch analyses revealed four scales (MOB, ADL, SOC and COM) with signs of overall lack of fit (χ2, 16.7–41.0; p⩽0.01) to the measurement model (see supplementary table S1; supplementary table S1 can be viewed on the J Neurol Neurosurg Psychiatry website at http://www.jnnp.com/supplemental). Individual item fit to the respective scales are reported in table 4. A total of nine items, representing all scales but EMO, displayed signs of misfit. This suggests that these items do not work in harmony with the other items in their respective scales. Assessment of DIF identified significant DIF by gender for items 1 (MOB), 19 (EMO) and 24 (STI), and by age for item 24 (STI) (for examples, see supplementary fig S2; supplementary fig S2 can be viewed on the J Neurol Neurosurg Psychiatry website at http://www.jnnp.com/supplemental).
Ceiling effects were absent or negligible whereas all scales displayed floor effects (mean across the eight scales, 15%) and three scales exceeded the recommended maximum of 15% (table 3). This pattern became particularly evident in the Rasch analyses of the relationship between the distributions of persons relative to items. All scales thus tended to measure at a level corresponding to more severe health problems than that experienced by the sample (fig 2A). Figure 2B exemplifies this pattern for the EMO scale by displaying the distributions of person and item locations on their common logit metric. Superimposed on the person distribution graph is the information function curve (fig 2B). This curve can be interpreted as an inverse of the standard error of measurement and indicates at what locations people are measured with good precision and little error. In addition, as illustrated in fig 2B and by the item locations in table 4, items within each scale tended to represent a relatively narrow range of health problems.
This study assessed the measurement assumptions and properties of the PDQ-39 using traditional and Rasch measurement methods. Because study design cannot compensate for ambiguous measurement properties,25 such assessments are essential to guide use and interpretation of scales in clinical research. We found generally good data quality and reliability, as well as general support for the legitimacy of summing PDQ-39 items without weighting or standardisation within the respective scales. However, violations of the assumption of unidimensionality, which is a fundamental requirement for summed rating scales, argue against the validity of summing PDQ-39 items into their suggested scales. All PDQ-39 scales exhibited a relative measurement bias towards more severe health problems. These results have implications for the role of the PDQ-39 in evidence based medicine, as well as for future developments towards improved outcome measurement in PD. This is discussed below together with some possible explanations for the current observations.
Score reliability of the eight PDQ-39 scales was found acceptable, although it was suboptimal for three scales (SOC, COG and BOD). While this is encouraging, investigators should be aware that reliability is central in planning clinical studies, particularly when using rating scales as clinical trial endpoints. Compromised reliability, even if exceeding the minimal acceptable criteria, adversely impacts sample size requirements and needs to be taken into account as power calculations do not assume any measurement error.25
Whereas reliability is fundamental to evidence based measurement, it does not tell us what scores represent. This is a matter of validity, to which scale dimensionality is central. We found that it is unclear what the eight PDQ-39 scales represent and that they therefore should be interpreted with caution. While this appears to be the first independent study to assess the assumed grouping of PDQ-39 items with a sample size that is reasonable for, for example, factor analysis,28 our results largely agree with previous observations. For example, Tsang and colleagues12 found an average scaling success rate of 58.6%; authors using exploratory factor analyses have failed to reproduce the eight assumed PDQ-39 scales9 11; and our own initial observations suggested deviations from unidimensionality in four PDQ-39 scales.10 Ambiguous meaning of scores is considered a main limitation of currently available health status questionnaires in PD,4 and clear support regarding what scores represent is now called for in order to support claims based on patient reported outcomes in clinical trials.5 Available evidence suggests that it is unlikely that the eight PDQ-39 scales can be considered to meet such requirements. The apparent instability of the assumed PDQ-39 dimensionality may relate to the reliance on exploratory factor analysis to select and group items into scales when the instrument was developed.7 In addition to the tendency of the eigenvalue >1 criterion to overestimate the number of factors (scales),26 item level exploratory factor analysis tends to produce spurious factors that reflect endorsement patterns rather than dimensionality. That is, items tend to cluster together because of their distributional properties even if they measure the same construct as other items.24 Future scale developments would probably benefit from applying the Rasch measurement framework instead as this approach is not based on correlations and requires conceptualisation of the measured constructs.14–16
Analyses of targeting suggest that the PDQ-39 does not conceptualise health problems at a level that is congruent with that experienced by people with PD. This became particularly evident in the Rasch analyses of the person and item distributions. As targeting relates to the characteristics of the investigated sample, our observations could be due to sampling effects. However, the people studied here presented with a wide range of disease severity and duration, and their characteristics and PDQ-39 scores were similar to those previously reported from community based and randomised samples.33 34 Our observations regarding floor effects are also in general agreement with previous reports.12 35–37 The levels of health problems that items represent relate to their contents. In addition to the use of exploratory factor analysis to select items (see above), targeting problems may therefore reflect characteristics of the people surveyed to generate and select the PDQ-39 items. However, no clinical information (eg, stages or duration of PD) has been reported for the sample originally interviewed to generate PDQ-39 items.7
In addition to a general bias towards more severe problems, we also found relatively narrow Rasch derived item locations, indicating that items represent fairly comparable levels of health problems. Similar observations were made by Ito and colleagues,38 who failed in their attempt to develop PDQ-39 short forms targeted to different levels of PD severity because items covered very similar ranges. As a consequence of suboptimal targeting and clustering of items in the PDQ-39, and the relatively small number of items in several scales,14 39 a considerable proportion of people are measured with relatively low degrees of confidence. This poses some limitations on the PDQ-39, particularly for clinical trials aimed to detect small but clinically important effects among people with less severe problems. For example, a recent randomised double blind clinical trial comparing levodopa and entacapone with levodopa alone in mild to moderate PD found inconsistent results.40 While clinician reported motor and ADL scores favoured the levodopa–entacapone group, no differences were detected by PDQ-39 scales assumed to tap the same or similar constructs. This may, at least in part, have been because of suboptimal targeting and measurement precision of the PDQ-39.40
The findings reported here could be due to cultural differences or deficiencies with the Swedish version of the PDQ-39. However, there are reasons to believe that these are not major explanations. Firstly, many of the issues identified here have also been implied in previous studies from various countries (see above). Secondly, the Swedish PDQ-39 has been carefully evaluated regarding linguistic validity.18 19 However, empirical studies are needed to address these possibilities. In particular, studies addressing the presence of DIF by languages/countries are warranted to assess the validity of pooling and comparing PDQ-39 data in international clinical trials.13 Our sample may also pose some limitations to the generalisability of results. However, the primary purpose of the study was not to provide PDQ-39 scores representative of the general PD population, but to assess its measurement properties. Importantly, the sample represented a wide range of disease severity, duration and ages, and the distribution of most PDQ-39 scale scores spanned the full 0–100 range. There are also reasons to believe that our sample was fairly representative, given similarities with previously reported international population based studies using the PDQ-39 (see above).33 34 However, some subgroups (eg, the oldest and most severely disabled) are probably under represented. Furthermore, this study has not assessed the PDQ-39 summary index or its 8 item short form, PDQ-8. These will need to be thoroughly assessed in separate studies, preferably by methods such as those used here as this appears to be lacking. Finally, a number of PD specific health status questionnaires are currently available. While the PDQ-39 appears to be the most widely accessible and well documented alternative,8 this study does not provide any information on its relative merits compared with other available instruments. As such studies currently appear to be lacking, comprehensive head-to-head psychometric comparisons are warranted to help determine the best available alternative for a given situation.
Our observations bear a number of implications to guide the use of the PDQ-39. While the eight scale scores appear reliable, clinicians should be aware that score interpretations are hampered by ambiguities regarding their meaning. Our observations suggest that the assumed eight dimensional PDQ-39 structure may be overly complex (ie, too many scales with too few items per scale). This is not only likely to impact on the meaning of the scores, but may also compromise other measurement properties adversely.3 14 39 One remedy could be to redefine the questionnaire according to a more readily understood theoretical framework, for example by linking items to domains of the International Classification of Functioning, Disability and Health.41 Techniques for doing this have recently been proposed and results from linking generic scales to the International Classification of Functioning have shown promise.42 Such work may not only help improve interpretation of scores but also, in combination with quantitative techniques such as Rasch analysis, provide a basis for item reduction, which could lessen respondent burden.19
Caution should be exercised when interpreting PDQ-39 trial data that fail to detect differences or changes over time (particularly improvements), as compromised responsiveness is a likely consequence of suboptimal targeting and measurement precision. In order to rectify this, new items that conceptualise less severe problems are probably needed. Indeed, expanding the item pool could serve both to increase measurement precision and to decrease respondent burden, if conducted by means of so called item banking.3 14 43 This technique allows for selection of study specific, or even personally tailored, subsets of items without substantial loss of measurement precision or validity.44
The PDQ-39 has made, and will continue to make, significant contributions to our understanding of the impact of PD. However, this does not preclude seeking to improve the scale. Rating scale properties are relative and their adequacy relate, in part, to the purpose and context of their use. In this study, the eight PDQ-39 scales were assessed primarily from the perspective of their use as clinical trial endpoints. Unambiguous and valid inferences regarding the effectiveness of treatments require high quality outcome measures that meet rigorous scientific standards.3–6 14 Our observations suggest that the ability of the PDQ-39 to meet such standards can be challenged. In order to further clarify the role of the PDQ-39, we encourage others to examine their data and recommend that measurement properties should be reported in studies using PDQ-39 endpoints.
The authors wish to thank all participating patients for their cooperation, Jan Reimer for assistance with data collection and Elisabeth Rasmusson for secretarial assistance.
Funding The study was supported by the Swedish Research Council, the Skane County Council Research and Development Foundation, Rådet för hälso-och sjukvårdsforskning (HSF) and the Department of Nursing. CN was supported by the Section of Occupational Therapy and Gerontology, Lund University, Lund, Sweden.
Competing interests: None.
- activities of daily living
- bodily discomfort
- differential item functioning
- emotional well being
- Parkinson’s disease
- 39 item Parkinson’s disease questionnaire
- social support
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.