Article Text

Using visual rating to diagnose dementia: a critical evaluation of MRI atrophy scales
  1. Lorna Harper1,
  2. Frederik Barkhof2,
  3. Nick C Fox1,
  4. Jonathan M Schott1
  1. 1Dementia Research Centre, University College London Institute of Neurology, London, UK
  2. 2Department of Radiology, VU University Medical Centre, Amsterdam, The Netherlands
  1. Correspondence to Lorna Harper, Dementia Research Centre, University College London Institute of Neurology, 8-11 Queen Square, London WC1N 3BG, UK; lorna.harper.11{at}


Visual rating scales, developed to assess atrophy in patients with cognitive impairment, offer a cost-effective diagnostic tool that is ideally suited for implementation in clinical practice. By focusing attention on brain regions susceptible to change in dementia and enforcing structured reporting of these findings, visual rating can improve the sensitivity, reliability and diagnostic value of radiological image interpretation. Brain imaging is recommended in all current diagnostic guidelines relating to dementia, and recent guidelines have also recommended the application of medial temporal lobe atrophy rating. Despite these recommendations, and the ease with which rating scales can be applied, there is still relatively low uptake in routine clinical assessments. Careful consideration of atrophy rating scales is needed to verify their diagnostic potential and encourage uptake among clinicians. Determining the added value of combining scores from visual rating in different brain regions may also increase the diagnostic value of these tools.

  • MRI

Statistics from


The diagnostic value of structural neuroimaging is reflected in its inclusion in the diagnostic guidelines for a number of dementias, including Alzheimer's disease (AD),1 vascular dementia2 and frontotemporal dementia (FTD).3 ,4 Certain imaging features are suggestive of underlying pathology, such as the symmetrical and early medial temporal lobe (MTL) atrophy frequently seen in typical AD, the prominent parietal lobe atrophy associated with early onset AD, or the asymmetrical atrophy that is often evident in patients with FTDs,5 and quantification of these features can enhance their diagnostic value. Research studies have used a variety of imaging techniques to help distinguish patients with dementia from normal control participants, as well as the more clinically relevant problem of distinguishing between causes of dementia. However, volumetric analysis tools and image classifier algorithms, which are commonly used for this purpose, require specialist software and expertise, are often time consuming, cannot be applied to all image types, and are, therefore, seldom used in clinical practice. Conversely, visual rating scales, principally developed for research purposes, can be applied directly to clinically acquired images without the use of additional software, and with suitable training, can easily be used as an adjunct to standard clinical radiology reports.

Visual rating scales focus attention on brain regions particularly susceptible to change in specific dementias, most notably the MTLs, and offer a means of quantifying (on an ordinal scale) change at the individual patient level. Moreover, they can be used to enforce structured image reporting and provide radiologists and non-radiology clinicians with a framework for interpreting imaging findings, making visual assessment more consistent and potentially more sensitive. However, despite great diagnostic potential and widespread use in research studies, visual rating scales have not been widely adopted into routine clinical practice. In this review, we examine cerebral atrophy rating scales that have been developed for use in dementia to highlight the diagnostic potential of these clinically applicable tools. Since MRI is the imaging modality of choice in dementia, we focus on the scales designed for this purpose and consider their diagnostic utility in terms of the sensitivity and specificity for disease state, and reproducibility of the results. As an indication of the impact of each scale, the number of published studies that have subsequently applied each scale is provided, as well an indication of their inclusion in multicentre studies or clinical trials (table 1). A full description of each scale is included in online supplementary appendix-1.

Table 1

Reliability measures and imaging parameters

Visual rating of global cortical atrophy

ScalePasquier et al12
DisplayT2-weighted axial
ReliabilityInter >0.6, intra >0.7 (Cohen's κ)

The Pasquier scale, also known as the global cortical atrophy (GCA) scale, was developed to evaluate atrophy in 13 brain regions, including frontal, parieto-occipital and temporal sulcal dilation and dilation of the ventricles.11 Regions are assessed separately in each hemisphere and the final score is the sum of all scores in the 13 regions. The original work was used as a tool to quantify atrophy in patients with stroke, based on the hypothesis that a greater degree of atrophy is present in patients with stroke with dementia than in those without dementia. The hypothesis was not validated in the original paper; however, subsequent studies have demonstrated that the scale may provide added value as a composite diagnostic marker for dementia, and in particular, may be positively associated with vascular burden.12

ScaleO'Donovan et al13
DisplayT1-weighted axial
ReliabilityInter: 0.9, intra: 0.92 (interclass correlation coefficient)

As part of study to look at the discriminate power of established visual rating scales for distinguishing between AD and dementia with Lewy bodies (DLB), O'Donovan et al13 developed a rating scale to assess ventricular enlargement (VEn) as a marker of GCA. Each hemisphere was rated separately for enlargement of the lateral ventricles and the scores summed to get an overall value. VEn scores were found to be significantly higher in patients with AD and DLB than in control participants (p<0.003) but similar between patients with AD and DLB. Sensitivity and specificity of VEn for AD or DLB versus controls was 94% and 40%, respectively, and 36% and 74% for AD versus DLB.

Overview of GCA scales

Axial slices provide the best general overview of brain atrophy; however, specifying regions of interest (ROIs) to quantify such a large, generalised area is challenging. Using the ventricles produces excellent reliability; however, sensitivity and specificity estimates indicate the scale is less useful in terms of differential diagnosis, perhaps due to the considerable variation in ventricular size within the healthy individuals. With 13 brain regions, the Pasquier scale is more extensive in its coverage, although this comes at the expense of scale reliability, which was further confounded by the inclusion of regions susceptible to partial volume effects. Simplification of the Pasquier scale to provide a more general impression of atrophy throughout the brain (figure 1) resulted in increased uptake among the scientific community (table 1); however, the scale has been primarily used as a component part of a larger diagnostic assessment.12 Owing to the large brain area assessed by GCA scales, they are likely to be more severely confounded by age than other atrophy rating scales, although their diagnostic value may be improved by using age-specific cut-offs.14

Figure 1

Example of the four-step (generalised) Pasquier scale for global cortical atrophy.

Visual rating of frontotemporal atrophy

Davies et al16 devised a scale based on a postmortem staging scheme used to rate atrophy in FTD brains.15 Rating is performed at the level of the anterior temporal lobe and the lateral geniculate nucleus, and the highest recorded score is taken overall (figure 2). The scale was applied to a study population of patients clinically diagnosed with behavioural variant FTD (bvFTD), based on the hypothesis that patients with lower atrophy scores have better prognosis and prolonged survival.16 Favourable prognosis was defined as patients still living independently 3 years after diagnosis and unfavourable prognosis defined as those who had died or were in institutional care within the same time period. Sensitivity and specificity for favourable prognosis was 80% and 75%, respectively. Discriminant analysis also found atrophy to be the sole variable with significant power to predict prognosis (other variables were sex, age, symptom duration and various clinical scores).

ScaleDavies (2013)/Kipps et al17 (modified from Broe et al15)
DisplayT1-weighted coronal
Reliability:Inter/intra: 0.62/0.82 (frontal), 0.71/0.83 (anterior temporal), 0.64/0.79 (posterior temporal) (Cohen's κ)
Figure 2

Example of the five-step Kipps/Davies scale for frontal atrophy. The posterior temporal lobe reference images were included in the Kipps study only.

Kipps et al17 extended this scale further to include rating of the posterior temporal lobe (figure 2) and described slice selection in greater detail. The extended scale was applied to a large group of patients with FTD plus some control participants to assess the relationship of focal brain atrophy to clinical data. All control participants were rated <2; therefore, the scale was dichotomised with scores of 0–1 indicating normal scan appearance and 2–4 indicating a degree of cerebral atrophy. Sensitivity was 100% for semantic dementia (SD), 73% for progressive non-fluent aphasia and 53% for bvFTD, based on clinical diagnosis of FTD syndrome.

ScaleDavies et al18
DisplayT1-weighted coronal
ReliabilityInter: 0.71, intra: 0.76 (Cohen's weighted κ)

Davies et al18 later developed a more extensive scale, which included 15 frontotemporal brain regions contained within four landmark identifiable slices. Specific scale criteria were adopted in the basal ganglia and hippocampal region (anterior, mid, posterior), and the best slice was determined individually for each hemisphere to account for variation in brain orientation. The scale is intended for use in diagnosis and localisation of function in neurodegenerative diseases and other postoperative or postencephalitic brain abnormalities. Discriminant analysis indicated rating of the anterior fusiform distinguished SD from controls, while the insula was vital to distinguishing bvFTD. Multiple regions were reported to be relevant in discriminating AD from controls (insula, anterior hippocampus, orbitofrontal gyri and temporal pole), perhaps reflecting the more diffuse pattern of atrophy associated with AD. In a subsequent study, Hornberger et al19 reported rating of the orbitofrontal cortex (OFC) as a good discriminator between AD and bvFTD, with logistic regression analysis demonstrating correct classification in 71.3% of patients. Devenney et al20 also used the scale to demonstrate a lack of atrophy in C9ORF72 mutation carriers.

ScaleAmbikairajah et al21
DisplayT1-weighted coronal
ReliabilityInter: 0.91 (unknown κ)

Ambikairajah et al21 adapted the Davies/Kipps scales16–18 and applied it to patients on an amyotrophic lateral-sclerosis-FTD continuum.21 They scored four regions: OFC, anterior cingulate cortex (ACC), motor cortex and the anterior temporal pole. Using three landmark identifiable slices, scoring was performed separately in each hemisphere and then averaged. The study hypothesis was that there would be a gradient of cortical atrophy increasing in severity from amyotrophic lateral sclerosis (ALS) to ALS-FTD to bvFTD, and that patients with ALS and ALS-FTD would be best distinguished based on atrophy in the anterior temporal lobe and the ACC. This was partly validated with bvFTD scoring significantly higher than patients with ALS in all regions, and ALS-FTD scoring significantly higher than patients with ALS in all regions except the OFC. Correct classification calculated, using a logistic regression model with all scored regions entered as independent variables, was estimated to be 83.6% for bvFTD versus ALS and 75% for ALS-FTD versus ALS. No significant differences in atrophy rating were found between patients with ALS-FTD and bvFTD with correct classification calculated as 78.8% between these two groups.

ScaleChow et al22
DisplayT1-weighted axial, sagittal and coronal
ReliabilityInter: 0.06–0.07 (LAC), 0.2 (LAT) (Kendall's W)

Based on previous findings from volumetric analysis, Chow et al22 adapted the five-point scale of Davies et al18 to assess atrophy in the left anterior cingulate (LAC) and left anterior temporal (LAT) regions. Rating was performed on five slices (2 axial slices, 2 sagittal slices, 1 coronal slice) by four raters. The scale was applied to a study population of normal controls, AD participants and participants with a clinical diagnosis of FTD (FTD diagnosis was not further categorised). Raters were asked to give a diagnosis immediately after rating. Based on the given diagnosis, raters averaged 63% accuracy in correctly distinguishing AD from FTD and 59.5% accuracy in distinguishing FTD from controls.

Overview of frontotemporal atrophy scales

Frontotemporal atrophy scales may be useful in the differential diagnosis of FTD syndromes, and the scales developed around these regions have been designed and validated specifically for this purpose. In particular, the Davies, Kipps and Ambikairajah scales all stem from the same postmortem staging scheme, providing a reliable basis for region selection. Furthermore, slice selection is described in detail and reference images provided, which probably contributes to the consistently high reliability among these scales (table 1). From a usability perspective, reference images may be more useful when the ROI is demarcated with a bounding box as in the Ambikairajah study. The style of reference image provided with the second Davies scale, while informative, is perhaps somewhat complicated for use in routine practice. From close examination of all reference images, the spectrum of atrophy represented by each scale is not always uniformly distributed between scale increments. In some cases, such as the anterior temporal region,17 ,16 the scales may benefit by condensing the scale to four points rather than five. Asymmetry in either hemisphere is often associated with FTD5 and can be useful to help distinguish it from other causes of cognitive impairment; therefore, the decision by Chow et al to concentrate only on one hemisphere is a potential limitation. In this study, raters were also encouraged to concentrate on LAC and LAT regions and ignore hippocampal and parietal atrophy as atrophy in the latter regions nudged the rater towards a diagnosis of AD. This suggests there may be reporting bias, which could be attributed to improper slice selection; moreover, this advice may be less appropriate in a non-pathologically confirmed study population.

Visual rating of the MTL

The De Leon scale, designed to rate hippocampal fissure dilation, was one of the first described imaging markers of AD23 and was subsequently used in a study based on both CT and MRIs.24 Cross-modality agreement of individual hippocampal rating scale values ranged between φ-κ values of 0.87–0.89. Using a study cohort of patients with AD, patients with mild cognitive impairment (MCI) and healthy control participants, this study found sensitivities of 85% for mild AD, 96% for moderate to severe AD and 78% for MCI, and a specificity of 71% based on the presence of hippocampal atrophy (HA) in the control participants. They also reported that HA was associated with increasing ventricular size in all but the mild AD group, and that increasing HA due to age was confined to the control group.

ScaleDe Leon et al23 ,24
DisplayT1-weighted axial
ReliabilityInter: 0.72 (unknown κ)
ScaleScheltens et al29
DisplayT1-weighted coronal
ReliabilityInter: 0.72–0.84, intra: 0.83–0.94 (Cohen's weighted κ)

The Scheltens scale focuses on three key features of MTL atrophy, namely: the width of the choroid fissure, the width of the temporal horn and the height of the hippocampus25 (figure 3). The degree of atrophy in each of these regions is combined to produce a score reflecting overall MTL atrophy (see online supplementary appendix-2). Both sides of the MTL are assessed separately and in the case of asymmetry the highest score is reported. In order to assess sensitivity and specificity, the scale is dichotomised, with scores of 0–1 indicating the absence of AD, and scores of 2–4 indicating the presence of AD. A sensitivity of 81% and specificity of 67% for AD was reported in the original study, based on a clinical diagnosis of ‘probable’ AD according to the 1984 NINCDS-ADRDA criteria26 versus age-matched control participants. Since it was introduced, the Scheltens scale has been included in over 100 studies with several reporting improved sensitivity, specificity and reliability over the original study, even when used in a clinical setting.27 The reliability of the scale has been reported to be robust to the clinical experience of the rater28 but increases as the rater gains more experience with the scale itself.29 Improved performance of the scale may be due to advances in image acquisition and display, such as improved scanner hardware, higher field strengths (the original study was performed at 0.5 T and 0.6 T) and reporting of the images from digital display over hard copy film images. Better understanding of the pathological phenomenon measured by the scale has led to modification of the dichotomised scale to account for atrophy due to ageing,25 ,30 which has also helped to improve performance. The Scheltens scale is included in the research criteria for the diagnosis of AD.31

ScaleGalton et al32
Increments4 (novel aspect)
DisplayT1-weighted coronal
ReliabilityInter: 0.36–0.49, intra: 0.52–0.69 (Cohen's κ)
Figure 3

Example of the five-step Scheltens scale for medial temporal atrophy (images from The Radiology Assistant website—

Galton et al32 extended the Scheltens scale to incorporate non-hippocampal structures. The complete scale is therefore split into two parts, the first part using the Scheltens scale and the second part designed to rate the anterior, non-hippocampal medial (parahippocampal gyrus) and lateral temporal structures (see online supplementary appendix-2), each hemisphere was assessed separately. To assess classification, the complete scale was dichotomised in to normal or minimal atrophy (0–1) and moderate or severe atrophy (>1). Eleven per cent of controls demonstrated atrophy in the hippocampal region but no significant atrophy in any other regions. Only in the region of the hippocampus was atrophy significantly greater in the AD group than the control group, although only 50% of AD cases had moderate or severe HA. The SD group showed significantly greater atrophy in all regions bilaterally. The frontal variant FTD (fvFTD) group demonstrated significantly greater atrophy than controls in the temporal poles, hippocampi and right parahippocampal gyrus. Significantly greater atrophy in the temporal pole region and the left parahippocampal gyrus of the SD group helped to distinguish them from the AD and fvFTD groups. The SD group demonstrated significantly more atrophy than the AD group in all regions except the right hippocampus. There were no significant differences between the AD and fvFTD groups.

ScaleUrs et al33/Duara et al34
DisplayT1-weighted coronal
ReliabilityInter 0.75–0.94, intra: 0.84–0.93 (unknown κ)

Urs et al33 also developed a visual rating system (VRS) intended to improve on the utility of the Scheltens scale through better standardisation of the technique and its application. However, Duara et al34 published a study applying the system first and are often credited with its development. The VRS focuses on a single landmark identifiable slice at the level of the mamillary bodies (MB). This slice includes the head of the hippocampus, the entorhinal cortex (ERC) and the perirhinal cortex. Using the VRS software, the MB slice is displayed along with reference images depicting the five levels of atrophy, with each of the ROIs outlined to demonstrate the anatomical boundaries of each of the structures. The study population consisted of clinically classified patients with AD, patients with MCI and normal control participants. Based on mean VRS score, AD participants had significantly (p<0.05) higher atrophy rating in all regions compared with normal control participants. MCI participants had significantly (p<0.05) greater atrophy scores in the right hippocampus and ERC bilaterally. Patients with AD were not distinguishable from patients with MCI from visual rating scores in any region. Logistic regression analysis determined that the percentage of correct classification was 70.2% for MCI versus controls and 72.9% for AD versus normal controls. Duara et al34 reported sensitivities and specificities of 71% and 88% for normal controls versus patients with amnestic MCI, and 81% and 88% for normal controls versus patients with probable AD.

ScaleKaneko et al35
Displayshort TI inversion time coronal
ReliabilityInter: 0.68, intra: 0.79 (unknown κ)

Kaneko et al35 developed a scale for the evaluation of MTL atrophy on a single coronal slice in which the cerebral peduncles appear widest. The scale compares the shape and size of the hippocampus with the surrounding cerebrospinal fluid (CSF) space. Perpendicular lines were drawn on both sides of the hippocampus to divide the CSF space into three parts: an outer part (temporal horn), an upper part (choroidal fissure) and an inner part (ambient cistern). Raters were instructed “to put the hippocampus into each part of CSF space while keeping its original shape and size, as with a jigsaw puzzle piece”. The reference images demonstrate a hippocampal ROI being manipulated over the image, but it is not clear from the text if this is simply to illustrate the point or if this is representative of how the scale was applied in practice. It is also not clear how the ROI was generated. Sensitivity and specificity for patients with AD versus non-demented patients with psychiatric disorders was 88.2% and 78.9%, respectively.

ScaleKim et al36
DisplayT1-weighted axial
ReliabilityInter: 0.64, intra: 0.62–0.95 (unknown κ)

Kim et al36 adapted the Scheltens scale to rate MTL atrophy in the axial plane, similar to the older CT-based scale of De Leon et al.23 The study was motivated by limited acquisition of coronal images in some centres. The three main ROIs were transposed from the coronal scale into the axial plane resulting in rating of: the width of the MTL, the perimesencephalic cistern gap (measured by the width between the brainstem and the MTL), and the width of the anterior temporal horn of the lateral ventricle. By using a score of 2 or above to indicate HA, the sensitivity and specificity of the scale based on the area under the curve (AUC) was calculated to be 76% and 80%, respectively. However, it is not clear what the gold standard indicator of HA was in this case. Lye et al37 rated hippocampal size on 12 slices through the hippocampus. Unlike other published scales, a higher score indicates a larger hippocampus. The scale is not described in detail, although reference images are provided. The scale was used to investigate the relationship between hippocampal size and memory performance in people over 80 years of age. It was not assessed for reliability or sensitivity/specificity for disease state.

ScaleLye et al37
DisplayT1-weighted coronal

Overview of MTL atrophy scales

MTL rating scales were first developed for use with CT imaging but are now predominantly designed for use with MRI, as the current imaging modality of choice in the diagnosis of dementia. The Scheltens scale has had the biggest impact on the field (table 1), and formed the basis of the Galton and Urs/Duara scales, which have also been used widely, as well as the recently developed scale by Kim et al. While the Scheltens scale focuses on the hippocampus and the surrounding CSF space, the Galton scale was designed to capture additional information from the surrounding sulci. However, the between-rater reliability of the scale in these regions was poor (table 1), therefore, limiting the differential diagnostic gain. The software package used by Urs/Duara helps to operationalise the Scheltens scale by limiting rating to a single consistent slice and providing detailed reference information. Adopting this approach provides excellent reliability (table 1) and is well suited to use in research studies; however, the additional software overhead, and image preprocessing that is likely to be involved, make it less suitable to use in clinical practice. Similarly, the Kaneko scale also appears to use additional software making it impractical for use clinically; this is further compounded by the decision to validate the scale on a non-standard MRI pulse sequence. Like the original CT scales, the Kim scale is applied in the axial plane, providing reasonably good reliability (table 1); however, further validation of its discriminatory power is required in a clinically relevant study population.

Visual rating of posterior atrophy

ScaleKoedam et al38
DisplayT1-weighted sagittal, coronal; T2-FLAIR axial
ReliabilityInter: 0.65–0.84, intra: 0.93–0.95 (Cohen's weighted κ)

The Koedam scale focuses on the posterior cingulate sulcus, precuneus, parieto-occipital sulcus and the cortex of the parietal lobes.38 The left and right hemispheres are assessed separately and a separate score is given in each imaging plane. In the case of different scores in different planes, the highest score is taken. To assess sensitivity and specificity, the scale was dichotomised with scores >1 considered an abnormal finding. Based on a study population of clinically diagnosed late and early onset AD, other dementias and patients with subjective memory problems (without cognitive impairment), the sensitivity and specificity of the scale for AD was 58% and 95%, respectively. In a follow-up study of postmortem-confirmed dementias, the diagnostic accuracy of the posterior atrophy (PA) scale for distinguishing between the study population groups (AD, frontotemporal lobar degeneration (FTLD) controls) based on the average rating between two raters was assessed by estimating the area under the receiver operator curve.39 An AUC value of 0.74 was achieved between AD and control participants, 0.61 between FTLD and controls, and 0.66 between FTLD and AD. To our knowledge, this scale is the only scale designed to quantify posterior atrophy.

Visual rating design, methodology and validation

As table 1 illustrates, several visual rating scales have made little or no impact, while others have been replicated and cited extensively, with some of the most successful scales also included in multicentre studies, clinical trials, and as previously mentioned, the Scheltens’ MTL atrophy scale is also recommended in recent diagnostic guidelines for AD.31 The methodology employed by the more successful scales typically focuses not only on establishing the diagnostic value of the scale in a clinically relevant population, but also on the test-retest reliability of the scale and variability between raters. In general, the most successful scales provide a clear description of the rating procedure, allowing them to be easily replicated in other studies. Although not discussed in detail in this review, many studies that employ these scales demonstrate their correlation with clinical measures of cognition, adding further validation to their clinical relevance, and suggesting their potential as a marker of disease severity. Below we discuss in greater detail a number of factors implicit in the design of visual rating scales which may determine their successful adoption in research, and ultimately their ability to penetrate into clinical practice. We summarise these methodological considerations and their clinical implications in table 2.

Table 2

Summary of key design decisions, associated methodological considerations and clinical implications

Defining and displaying ROIs

The brain regions selected for visual rating have the greatest impact on the usefulness of the scale. Regions should be selected based on established findings from volumetric image analysis and/or macroscopic pathological assessment of the disease population of interest. The number of regions to be rated, the number of imaging planes to assess (axial, sagittal or coronal) and the number of slices used is likely to impact on the reliability of the scale, with reliability decreasing with increasing scale complexity. Specifying landmark identifiable slices for rating helps to ensure consistency between raters. There is good rationale for including focal regions, such as the MTL, which are typically preferentially involved in certain conditions, for example, AD, and have been shown to correlate with clinical measures of disease severity such as mini mental state examination (MMSE).25 Choice of MR pulse sequence affects both the appearance of atrophy and the visible extent of white matter changes and should also be specified. T1-weighted images offer good grey-white matter and CSF contrast, with high resolution three-dimensional volume acquisitions (that can be reconstructed in all three planes) offering the greatest utility for rating atrophy. T2-weighted images are less reliable, since the amount of CSF can be overestimated if T2-weighting is too strong. Image quality will also affect the reliability of the scale, with rating less reliable on scans that are subject to artefacts. Consistent image slice positioning will also help to improve the reliability of the scale.

Scale increments

The number of scale increments influences the level of detail captured by the scale. A balance must be struck between detailed quantification and the degree of change that can be reliably differentiated by visual inspection. In terms of structural neuroimaging, a four-point or five-point scale is most commonly used. The scale is typically dichotomised to classify normal and abnormal scan appearance. In both four-point and five-point scales, scale points 0 and 1 typically represent the degree of variation within the normal population, with points 2 and above describing more obvious pathological change. Four-point scales force the rater to make a more definite choice of disease state (presence or absence), therefore, increasing specificity at the expense of sensitivity. Five-point scales on the other hand may be more sensitive to earlier stages of disease but may also increase the number of false-positive results. In terms of the scales developed for use in the diagnosis of dementia, five-point scales may be particularly sensitive to the effects of ageing. Using age-specific cut-offs may help to improve scale accuracy.30 ,40

The effect of training

Training can have a significant affect on the performance of the scale. Reference images providing examples of each scale point are particularly useful and are likely to impact positively on the reliability of the scale. Reference images which include delineation of ROIs, such as those provided by Urs et al could also help to improve reliability, particularly among less experienced raters or raters without radiology expertise. Detailed descriptions of the expected appearance for each point on the scale can also be helpful to guide raters and improve consistency. Training sets representative of the clinical or study population, prerated by ‘expert’ raters, would help to ensure high observer agreement before implementation into clinical practice or research protocols. Training sets can also be used to audit rater reliability at defined intervals or after a period of absence.2


If rating scales are used as a method of measurement to make inferences about disease state, it is important that both the measurement technique and validation of the technique is rigorous. Test-retest studies are essential to determine the (inter-rater/intra-rater) reliability of the scale. Appropriate statistical procedures should be applied and fully reported to allow clear interpretation of the results and fair comparison with other studies. However, if used routinely, the affect of training and rater experience is likely to improve the reliability of the scale. Correlation with clinical measures of cognition17 ,25 ,38 and volumetric measurements18 ,21 ,27 ,41 are also useful to help validate the scale. Diagnostic tests should also be validated against an established ‘gold standard’ measurement technique. Currently, with the exception of individuals with genetic mutations, postmortem examination of brain tissue is the only definitive means of establishing diagnosis in neurodegenerative dementia. In most scales described here, classification of disease groups, and therefore measures of scale sensitivity and specificity, are based on clinical diagnosis of the study population.


Visual rating scales have been developed specifically to rate several brain regions sensitive to atrophy in dementia. They can be used to provide semiquantitative measures of the degree of atrophy in these regions, while combining scores from several rating scales can also improve classification accuracy.39 Unlike quantitative volumetric measures (manual or automatic), visual rating scales do not require specialist software or expertise, are quick to apply and are designed specifically for use with routine MRI. Moreover, unlike many diagnostic tests, they are not financially prohibitive, with brain imaging already recommended for all patients being investigated for dementia.1–4 Given the proven utility of some of these scales in clinical trials,6–9 the relative ease with which they can be applied and, in the case of the Scheltens MTL score, their incorporation into new diagnostic criteria,31 it is perhaps surprising that visual rating scales have not had greater uptake in routine radiological assessments. Further validation of the real life utility of these rating scales to improve diagnosis in a multicentre study population with postmortem-confirmed diagnosis, combined with the provision of a widely available training scan set, might help to transition these potentially useful diagnostic tools into wider use in routine clinical practice. In addition, clinical studies looking at the impact of visual rating on diagnosis and patient management are also required to determine their true potential as a diagnostic tool.

Search strategy

References for this review were identified by searches of PubMed and references from relevant articles. The search terms ‘Dementia’ or ‘Alzheimer's’, ‘visual rating’, ‘visual assessment’, ‘atrophy rating’, ‘reproducibility’ or ‘rater’, ‘MRI’, ‘magnetic resonance imaging’, or ‘T1-weighted’ were used. There were no date or language restrictions. The final reference list was generated on the basis of relevance to the topics covered in this review.


The Dementia Research Centre is an Alzheimer's Research UK coordinating centre. The authors acknowledge the support of Alzheimer's Research UK, the NIHR Queen Square Dementia Biomedical Research Unit, UCL/H Biomedical Research Centre, and Leonard Wolfson Experimental Neurology Centre. LH is supported by funding from Alzheimer's Research UK and a UCL Impact Studentship.


View Abstract

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

    Files in this Data Supplement:


  • Contributors JMS and LH devised the original concept of the article. LH drafted the manuscript. All authors revised and approved the final version to be published.

  • Funding University College London.

  • Competing interests None.

  • Provenance and peer review Not commissioned; externally peer reviewed.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.