Article Text

Download PDFPDF

Exome sequencing: how to understand it
  1. M J Keogh1,2,
  2. D Daud3,
  3. P F Chinnery2
  1. 1James Cook University Hospital, Middlesbrough, UK
  2. 2Wellcome Centre for Mitochondrial Research, Institute of Genetic Medicine, International Centre for Life, Newcastle University, Newcastle upon Tyne, UK
  3. 3Newcastle University, Newcastle, UK
  1. Correspondence to Prof P F Chinnery, Wellcome Centre for Mitochondrial Research, Institute of Genetic Medicine, International Centre for Life, Newcastle University, Newcastle Upon Tyne NE1 3BZ, UK; p.f.chinnery{at}newcastle.ac.uk

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Introduction

Glossary of terms

Calling. The process of determining the DNA bases or regions in sequenced subjects that differ from the reference exome or genome.

Coverage. The number of times a single nucleotide in a sequence has been sequenced or read.

Exons. The protein coding regions.

Exome. The portion of the genome coding for proteins, ie, collectively, all of the exons.

Filtering. The process of removing mutations with the aim of leaving potential pathogenic variants only.

Genome. The complete DNA sequence of an organism.

Homozygosity mapping. A gene mapping method used in rare recessive disorders often in consanguineous families. It is based on the assumption that an allele responsible for disease is likely to occur from a common ancestor and looks for shared regions of inherited DNA between subjects.

Meiosis. Cell division that results in two daughter cells each with half the chromosome number of the parent cell. An essential step in the formation of the gametes.

Mendelian disorders. Genetic disorders that occur due to alterations or mutations in a single gene. Their pattern of inheritance can be recessive, dominant, or X-linked.

Next generation sequencing. Sequencing technology able to sequence thousands or millions of DNA regions at once—also referred to as second generation sequencing.

Penetrance. The proportion of individuals carrying a particular variant who also express an associated trait (phenotype).

Primer. A strand of DNA which serves as the starting point for DNA synthesis.

Single nucleotide polymorphism. A single variation of one base at a particular site of DNA.

Single nucleotide polymorphism array. A type of DNA microarray used to detect specific polymorphisms within a population and variation between genomes.

Third generation sequencing. Various methods of DNA sequencing which in general sequence single molecules of DNA, often without the need for enzymatic reactions or DNA amplification—they may reduce cost and limit the potential biases and errors that occurred with previous techniques.

The terms ‘next generation sequencing’, ‘second generation sequencing’, ‘exome sequencing’, and ‘whole genome sequencing’ are gradually migrating from the echelons of esoteric molecular genetics journals into mainstream neurological practice. These techniques, in their technological infancy only 5 years ago, have become widely adopted into clinical research over the past 2–3 years, greatly advancing our understanding of Mendelian disorders. We are beginning to see their first applications to sporadic neurological disease; exponential growth in this area is inevitable. This article aims to enable the neurologist to understand what whole exome sequencing is, how it works, when and how it is useful in neurology, and the potential benefits and limitations of the techniques.

In practice

Over the past two decades, advances in molecular genetic diagnostics have dramatically expanded the field of neurogenetics, which now includes many ostensibly sporadic diseases presenting in routine neurological practice. Although neurogenetic disorders vary widely clinically, the overall approach to diagnosis is consistent.

The clinician should first fully exclude non-genetic causes. This involves careful phenotyping using the tried and tested neurological approach: history, examination, and well-considered clinical tests. This may identify sporadic diseases that might be amenable to treatment—for example, vasculitis in suspected axonal Charcot–Marie–Tooth disease, or vitamin E deficiency in juvenile ataxia.

The next step—at least in the UK's National Health Service—involves targeted mutation screening focused on specific genetic causes of a particular phenotype. A detailed discussion of the different tests on offer is beyond the scope of this review, but there is a comprehensive list at the UK Genetic Testing Network website (UKGTN, http://www.ukgtn.nhs.uk/gtn/Home). Tests not available through the UKGTN can be accessed through European or even North American providers (eg, http://eddnal.com/), but these often incur substantial cost. Other testing is carried out through research laboratories with a special interest in the disease, but these are not subject to the same quality control as a certified laboratory, and their results must be interpreted with caution.

If these tests come back negative—and assuming a genetic cause remains most likely—then there are three possibilities:

  1. The correct gene was not tested. This could be because the patient has an atypical phenotype, so the relevant test was not considered or performed.

  2. The correct gene was tested, but the mutation was not ‘picked up’. This could be because the diagnostic test is not comprehensive—for example, deletions may not be routinely tested for, or the mutation could be a regulatory variant upstream of the coding region, and thus not picked up using standard sequencing protocols.

  3. The disease gene is not known.

In any of these scenarios, it is like trying to find a needle in a haystack—or, to be precise, a mutation in a genome of three billion nucleotide base pairs. Until recently, the approach to this problem was limited.

Next generation sequencing

Many of the molecular diagnostic assays in clinical practice involve ‘first generation’ or ‘Sanger sequencing’. This is used, for example, to test for a NOTCH3 mutation in CADASIL (cerebral autosomal dominant arteriopathy with subcortical infarctions and leukoencephalopathy). Sanger sequencing can read a DNA sequence of up to around 1000 bases. However, this can only be within the predefined region or gene of interest, for example, a defined region of the NOTCH3 gene, which could be limited to mutation hotspots. Inevitably this approach is limited, and it makes multi-gene panels extremely cumbersome and expensive. Other methods can screen for mutations with higher throughput, but these also rely on the prior definition of a limited number of mutations or regions of interest. The key aspect of second generation sequencing—because third generation sequencing is just around the corner (see glossary)—is that vast quantities of DNA, up to and including the whole genome, can be sequenced in a short time. This has delivered a step-change in genetic analysis. For example, it cost US$2.7 billion (£1.7 billion, €2 billion) and took 13 years to read the 3.2 billion bases in the first human genome project, performed by massive laboratories on both sides of the Atlantic.1–3 In contrast, the human genome can now be sequenced in 2 weeks for ∼£4000 (€4700, US$6200) in many laboratories. This paradigm shift now forms the cornerstone of Mendelian gene discovery. The precise chemistries have been reviewed elsewhere.4

While sequencing the ‘whole genome’ may sound enticing, initial second generation sequencing studies perform ‘whole exome’ sequencing for two main reasons. First, although the human exome comprises only 1–2% of the whole genome, it is this portion that codes for proteins, and was thought to contain 85% of all pathogenic mutations causing Mendelian disorders5—although it is fair to say that this proportion is probably an overestimate, based on now outdated technology. Secondly, being much smaller, the exome can be sequenced in only 48 h at roughly a third of the cost, with far fewer data to store and analyse. The analysis, as we will discuss, is the major challenge; so limiting this to relevant segments of the genetic code is a major advantage, not least because the bioinformatic processes needed to analyse whole genome data are still evolving.

Case 1

How can we use second generation sequencing to determine pathogenic mutations in families such as in Figure 1?

Figure 1

Patients A, C, and B already attended the neurology department with an autosomal dominant movement disorder without a molecular genetic diagnosis. Until recently, linkage analysis was used in families to try and determine a ‘region of interest’ in which a potential pathogenic mutation may be found. This process worked by following genetic ‘markers’ passed on between affected members of a family, working on the principle that a gene responsible for the disease is more likely to be located close to the segregation of the markers on the chromosomes. Next came Sanger sequencing to find the mutation; this depended upon having large well defined pedigrees, and was compounded by the cost and time taken to sequence a large number of genes. Hence, progress was slow.

First, note that sequencing the exome of a single individual affected member, particularly in an autosomal dominant disease, is unlikely to lead to the discovery of a novel disease gene in isolation, although there are rare circumstances where this has happened. For simplicity, however, we begin by discussing the sequencing a single exome (subject A) in the pedigree shown in Figure 1, before discussing additional family members.

DNA is extracted from a 4–5 ml blood sample before being sequenced (Figures 2 and 3) to give the raw DNA sequence. This, figuratively speaking, is the ‘easy’ bit, and is often outsourced to a commercial sequence provider. The main bulk of the work is the subsequent bioinformatic processing, which involves a ‘pipeline’, taking the sequenced data through a series of steps to develop a list of possible causative mutations. This requires expensive computer workstations, and takes up considerable processing time—although cloud-based solutions are developing on the web.

Figure 2

An example of two different DNA enrichment approaches, though there are numerous available methods. DNA is randomly fragmented before adaptors are ligated to each end of the DNA sequences. (1) Solid phase hybridisation occurs on a DNA microarray. A collection of DNA spots/bait probes bind to DNA regions from the exome, but not intronic fragments of DNA, which are washed away. Thereafter the exomic DNA is eluted. (2) Liquid phase hybridisation—DNA probes attach to DNA fragments from the exome. Thereafter, streptavidin beads (black circles) are added to allow physical separation of the exome regions by binding to the DNA probes, which have attached to the exome.

Figure 3

Two common forms of second generation DNA sequencing. The figure shows the process for one DNA fragment only for clarity. (ai) Top panel: DNA adaptors on a DNA fragment bind to a complementary primer on a bead (bead, grey circle; primer, red line) which represents the 454 technique. (aii) Bottom panel: DNA fragments are passed over a lawn of primers where they attach; this is the Illumina (Solexa) technique. DNA is amplified many times so that multiple copies of the fragment are on the bead (top panel) or in a cluster (bottom panel). (b) The fragments are copied numerous times on each bead before being filtered. The DNA is denatured to a single strand and placed into a specific well (for the 454 technique), or remains in a specific channel on a slide for the Illumina method before sequencing primers, polymerase and nucleotides are added to the mix. (c) Each nucleotide is fluorescently labelled (eg, all adenine nucleotides have a specific label, guanine a specific label, etc), and as it is added to the DNA chain, a laser activates its fluorescence and sensors detect the colour change from each well or cluster to determine the sequence of DNA.

The first step in the bioinformatic pipeline is to align the raw sequence data to the human genome reference sequence of DNA (currently hg_19). For any individual this identifies roughly 20 000–25 000 single nucleotide polymorphisms and several thousand base insertions or deletions, highlighting the intrinsic variability of the genome. The real challenge is to determine which, if any, of these mutations is responsible for the disease (Figure 4).

Figure 4

Example of the bioinformatics processes needed with second generation sequencing. The pipelines are written in programming script and can be varied via interchanging programmes, or re-ordering some stages/steps. The captured DNA sequences (genome or exome) are in the form of millions of short reads of DNA. They are firstly aligned to a reference genome or exome before variants are filtered via databases and modes of inheritance before being quality controlled and finally annotated.

For example, the DNA sequence of subject A in Figure 1 would be filtered against lists of previously defined variants, compiled from unaffected control subjects both nationally and internationally.6 ,7 Any mutations which subject A shared with this cohort are probably not disease-causing and are removed. However, this process is not infallible, and there are well recognised pathogenic mutations listed on international databases as polymorphisms. In most institutions, subject A's data would also be specifically filtered against other local patients with other distinct diseases who act as ‘disease controls’. Local data greatly assist in reducing variant lists, as one of the fascinating findings of next generation sequencing studies is that the genome appears to have far more regional variation than previously anticipated. For example, in the UK, one local population may have frequent non-disease-causing polymorphisms, which are rare in other UK populations. The combination of these filtering steps reduces the list of potential candidates to several hundred variants (or fewer) in an individual.

The next step involves removing mutations that do not fit the pattern of inheritance. Thus, for the pedigree in Figure 1, autosomal recessive and X-linked mutations would be excluded. Finally, ‘low quality’ variants are removed. This ‘quality’ is determined by how many times each individual base/mutation has been read by the sequencer (referred to as ‘fold-coverage’). Variants with low coverage (<5-fold is arbitrarily used) are removed because there is a higher likelihood that these variants are misreads and not true mutations.

Finally, the remaining variants are ‘annotated’. This means that different computer software programs ‘predict’ whether the variants are likely to have a biological effect. For example, is the variant likely to change the amino acid, or simply be a non-functioning polymorphism? Or is it likely to be disease-causing, based on calculations of the likely effect on protein structure? Or is it linked to evolutionary conservation of the amino acid residue across several species, implying that differences at that site are not tolerated? Several software packages can perform this; although each does broadly the same thing, different programs often give conflicting predictions. Thus, these predictions help but are only a ‘rough guide’ to likely pathogenicity. The ultimate proof of pathogenicity comes from identifying the same variant segregating with the disease in several unrelated families, and experimental data showing that the variant affects protein function in a way that is likely to cause disease.

Performing the above process for subject A in Figure 1 generates a final list of potential variants, which are: (1) not seen in any of the control databases, (2) inherited in an autosomal dominant fashion, and (3) adequately ‘covered’ from a technical perspective. For subject A, this would yield a list of between tens to potentially hundreds of apparently novel mutations. However, these variants clearly cannot all be pathogenic in a monogenic Mendelian disease. Determining whether any of these mutations causes the neurological disorder is therefore unlikely to come from a single individual.

The next step generally is to sequence the exome of another definitely affected or definitely unaffected family member, and to cross-reference the results. This is often done at the same time as sequencing the original subject. For example, in Figure 1, the pathogenic mutation should be in subjects A and B, or A and C, or alternatively present in A, but not D. Choosing which family members should be sequenced is partly based on whose DNA is obtainable; in general, for autosomal dominant conditions, it is best to choose more distantly related members. This is because more distantly related individuals share less of their genome, having undergone more meiosis — ‘shuffling the genetic pack’, so to speak. We all have numerous spurious non-disease causing variants, but more distantly related clinically affected individuals share fewer spurious variants, increasing the likelihood that shared variants are indeed pathogenic.

This approach was used recently to determine that a mutation in the TITIN gene caused an autosomal dominant myopathy in three families in the northeast of England. Exome sequencing of three patients led to the identification of five potentially pathogenic mutations shared between patients,8 which were subsequently confirmed using Sanger sequencing. In addition, Sanger sequencing confirmed the segregation of mutations with disease in the rest of the family.8 This highlights how Sanger sequencing is almost always needed to confirm second generation sequencing results. There are two main reasons for this. Firstly, second generation sequencing techniques often result in the production of DNA sequences which cannot be mapped unambiguously to one region of the genome or exome.9 Sanger sequencing ensures that the mutation is definitely found in the identified region. Secondly, Sanger sequencing also confirms that the mutation itself (ie, the single nucleotide substitution) is ‘real’ and has not been ‘miss-called’ with sequencing algorithms, especially if the region had poor coverage.

Sanger sequencing also has a crucial role in checking whether any mutation is present or not in other affected and unaffected members of a family respectively. As an aside, neither sequencing technique is perfect. Mutations missed on Sanger sequencing can also be detected by next generation sequencing methods.

Case 2

Figure 5 shows a consanguineous family with an infantile neurometabolic brain iron accumulation disorder with apparently unique features.10 Sanger sequencing was undertaken for known plausible disease genes, but the results were negative.

Figure 5

A pedigree of a consanguineous family. Clear squares, unaffected males; clear circles, unaffected females; black squares, affected males; black circles, affected females.

In this case, homozygosity mapping (see glossary) identified the genome regions shared by the two affected siblings, using a single nucleotide polymorphism array (see glossary). Restricting interest to the homozygous regions drastically reduced the number of likely pathogenic variants, and focusing on homozygous rare variants led to the likely disease gene. The final list of candidate mutations, if small enough, could be checked with Sanger sequencing to see which segregates with the disease; if the list is large, additional exome samples from within the family can be added (from unaffected or affected individuals) and cross referenced.

Case 3

When primary cases of neurogenetic disease occur within one generation of a family, this does not necessarily mean the disorder must be a recessive condition. Depending on the phenotype, the condition could be due to:

  1. a spontaneous de novo dominant mutation: this is common—for example, up to 30% of Becker's muscular dystrophy cases are due to de novo point mutations in the dystrophin gene;

  2. a compound heterozygous mutation: that is, two different recessive alleles in the same gene; or

  3. a low penetrance heterozygous dominant allele, such as LRRK2 mutations in Parkinson's disease. In these cases, sequencing parent–child ‘trios’ (both parents and the affected child) may narrow the number of candidate mutations significantly due to the rarity of de novo mutations within the exome11 ,12 (Figure 6). This approach can also identify rare de novo mutations in a family, but proving pathogenicity in this context can be challenging.

Figure 6

A parental trio. One affected child with the condition believed to be due to a sporadic mutation. Clear square, unaffected male; clear circle, unaffected female; black circle, affected female.

Case 4

Sporadic cases do not always need to be treated in isolation. Many conditions—for example, spinocerebellar ataxia, atypical ataxias, atypical Charcot–Marie–Tooth disease—regularly occur as isolated cases in different unrelated families, but a similar phenotype suggests a similar genetic basis in the different families (Figure 7). Unlike the previous cases, they do not necessarily need to have exactly the same mutation. In this situation, all individual patients with a defined phenotype will be sequenced in the same manner as before, looking for recessive mutations, or possibly a rare low penetrance dominant mutation. Thereafter, rather than cross referencing for an identical mutation in each patient—that is, looking for the same variant in the same base of a gene—the samples will be cross referenced for the genes in which mutations are seen. It is reassuring to detect mutations implicating the same gene in different families, either with the same potential pathogenic mutation or different potential mutations within the same gene, implying a common underlying mechanism of disease. Determining subsequent causality involves tracking the mutation in different families to determine whether the segregation pattern fits with the proposed mode of inheritance, and a mixture of RNA, biochemical and cellular studies on patients to define the disease mechanism and establish causality. Animal models can be very useful in this context (zebrafish, drosophila, and mice).

Figure 7

Four pedigrees from separate unrelated families. All affected patients have a similar phenotype and are the only cases within a family. Clear squares, unaffected males; clear circles, unaffected females; black squares, affected males; black circles, affected females.

Further examples

Further benefits of second generation sequencing are the expansion of the phenotypes associated with known disease genes. For example, exome sequencing in patients with an unexplained peripheral neuropathy and phenotype similar to Charcot–Marie–Tooth disease found novel mutations in the SACS gene, previously known to cause autosomal recessive spastic ataxia of Charlevoix–Saguenay (ARSACS).13 Sanger sequencing of the SACS gene was not originally performed because ARSACS is associated with a childhood onset spastic ataxia, and not Charcot–Marie–Tooth disease.

Sporadic neurological disease

Many neurological disorders have a polygenic basis, with variants in certain genes making people more susceptible, or more ‘at risk’—for example, APOE4 genotype in Alzheimer's disease. The usual approach to identify these ‘risk’ genes involves candidate gene or genome-wide association studies. These studies take a ‘mapping overview’ of the whole genome, and thus help to determine genomic regions of potential biological importance associated with the disease. Occasionally this approach can define common polymorphisms—that is, present at >1% frequency in the population—which may confer disease risk, but do not directly cause the disease. In general, each polymorphism has only a small effect on disease risk, each typically conferring a <2-fold change. On the other hand, second generation sequencing may find rare variants within the genome or exome (<1% frequency in the population) which have a much stronger effect and may even cause the disease, thus uncovering new genes able to cause monogenic forms of a disease. Large scale exome and whole genome studies such as the Epi4K14 and EpiPGX studies in epilepsy are underway, together with similar studies in other areas of neurology. However, early smaller scale studies show its potential application15 and there will probably be second generation sequencing studies in sporadic neurological disease reported within the next 6–12 months. The added benefit of the sequencing approach is that it further clarifies the precise genetic variants responsible for a region of genetic association identified by the genome-wide association approach.

Limitations

Second generation sequencing is not always effective (Table 1). From the technical perspective, and particularly important for neurology, it has only a limited ability to detect repeat expansion disorders and DNA rearrangements (large and small). This was highlighted by the recent discovery of the C9orf72 mutation in amyotrophic lateral sclerosis-frontotemporal dementia (ALS-FTD)16–18 where, despite using next generation sequencing in pedigrees of familial forms of the condition, the repeat expansion was only detectable using additional experimental complexity. Secondly, as most groups are working on exome sequencing, rather than genome sequencing, intronic mutations causing Mendelian disease are not being defined at the same pace. As numerous neurological diseases are caused by intronic mutations19–,21 together with trinucleotide expansions,22 these limitations may prove to be significant weaknesses of the technique in its present form. In time, whole genome sequencing may address some of these concerns.

Table 1

The current potential paradigms in which second generation sequencing is used, may be used, and in which it continues to have limitations.

Additional limitations include the lack of coverage of the intended regions, although this is improving. Exome studies generally cover around 95–98%, with some recurrent systematic deficiencies related to local sequence features. This still leaves the possibility of missing some mutations if they lie in the 2–5% of coding regions not adequately covered. An additional factor is that although bioinformatic processes are constantly evolving, there always remains the possibility that small insertions or deletions can be misaligned and mutations missed. Finally, some alleles causing recessive conditions may occur relatively commonly in the population. Such mutations may therefore be added to reference databases and considered as ‘non-disease causing’ variants, meaning they may be potentially ‘overlooked’ with filtering techniques. Mutations causing late onset conditions or genes with incomplete penetrance could also be regarded as ‘non-disease causing’, coming from participants regarded as ‘normal’ controls in whom the condition has not yet manifested.

Research potential

Second generation sequencing projects are not only advancing the discovery of pathogenic mutations within small pedigrees with Mendelian traits. Projects such as the 1000 genomes project are rapidly advancing our understanding of the genetic variation of the human genome, which is vital for our ability to discover pathogenic mutations.7 International collaborative projects such as the DECIPHER (Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources) project provide an interactive database where clinicians can share genotypic and phenotypic data collated from a variety of bioinformatic sources to help clarify and determine pathogenic mutations and discover new syndromes.23 Subsequent projects in specific areas of neurology such as Deciphering Development Disorders have evolved with the aim of developing an online catalogue of genetic changes linked to clinical features that will enable clinicians to diagnose developmental disorders, develop diagnostic assays, and advance our understanding of the biological mechanisms of disease.24 In the near future, similar projects may develop in other sub-specialties of neurology.

Ready for clinical use?

At present, second generation sequencing is used predominantly for research. There are several reasons for this, from ethical concerns about detecting ‘unwanted and unsuspected’ mutations in genes for diseases that have no treatment, through to the limited informatics capacity in UK NHS diagnostic laboratories. For example, how do we deal with unexpected mutations known to predispose to cancer, such as BRCA1 or BRCA2, or with presenilin mutations that predict the inevitable onset of Alzheimer's disease way in advance? This is frequently debated in medical genetics with no current clear consensus.25 ,26 A possible approach is to use customised next generation sequencing arrays, harnessing the power of second generation sequencing while limiting the bioinformatic and ethical challenges. This would enable ‘panels’ of genetic tests to be performed for specific neurogenetic phenotypes with pronounced genetic heterogeneity. These arrays could, for example, specifically select and sequence all 30+ genes known to cause Charcot–Marie–Tooth disease. This approach offers a swift and relatively cost effective sequencing option and may be particularly enticing for clinical laboratories. However, it would limit the potential for finding new disease genes and the arrays would need updating if and when new disease genes were found.

No matter which approach is taken, few would deny the potential power of this technology, streamlining neurogenetic investigation and also reducing costs. It is therefore highly likely that we will see exome analysis offered in the near future—because all of these issues are surmountable. Whole exome sequencing based on patient consent for limited bioinformatics analysis is one way forward.

Exome sequencing: key points

  • The cost of DNA sequencing has fallen dramatically recently, allowing the entire exome—all protein coding regions (exons)—to be sequenced in a single individual for less than £1000 (€1200, $1500); this has led to a resurgence of interest in identifying new disease genes in patients with inexplicable neurological disorders.

  • Previous strategies required large families for genetic linkage studies; whole exome and whole genome sequencing can identify new disease genes in a small number of affected individuals, sometimes even in a single case.

  • Although still mainly a research tool, whole exome sequencing is having a major impact on the diagnosis of neurogenetic disorders, with new genes identified almost weekly; these results can be confirmed in a diagnostic laboratory.

  • There are plans shortly to sequence 100 000 genomes within the UK NHS; this will undoubtedly shape neurological clinical practice.

References

Footnotes

  • Contributors MJK and PFC designed the review. MJK and DD wrote the initial draft of the paper, PFC reviewed and amended subsequent drafts.

  • Competing interests None.

  • Provenance and peer review Commissioned; externally peer reviewed. This paper was reviewed by Norman Delanty, Dublin, Republic of Ireland and Katharine Harding, Cardiff, UK.

Linked Articles

  • Editors' Choice
    Phil Smith Geraint N Fuller

Other content recommended for you