Introduction

Sequencing the DNA or mRNA of multiple individuals of one or more species (that is, population-scale sequencing) aims to identify genetic variation at a population level to address questions in the fields of evolutionary, agricultural and medical research. Previous population studies, including genome-wide association studies (GWAS), have not been able to exhaustively characterize the genetic factors underlying human traits and diseases1. There has been much speculation about the source of this ‘missing heritability’, often pointing to both structural variants (SVs) and rare variants2,3. SVs account for a greater total number of nucleotide changes in human genomes than the far more numerous single-nucleotide variants (SNVs)4. To date, such population studies have relied mostly on high-throughput short-read sequencing technologies, which produce reads ranging from 25 bp to 400 bp in length5. However, short reads have important limitations in characterizing repetitive regions6,7. DNA repeats act as the genomic substrate to facilitate SV formation8 while also hampering SV discovery owing to read alignment inaccuracies. Even in a non-repetitive genome, variations such as insertions (especially for alleles longer than the read length7) or other modifications (for example, methylation) would be missed by an approach relying solely on short reads.

Long-read sequencing has emerged as superior to short-read sequencing and other methods (for example, arrays) for the identification of structural variation, as shown by the Genome in a Bottle (GIAB) and Human Genome Structural Variation (HGSV) consortia, which combined multiple technologies to comprehensively characterize structural variation in human genomes9,10. These studies highlighted that a substantial proportion of hidden variation can be discovered with long-read sequencing. Indeed, recent long-read sequencing studies of Icelandic and Chinese populations have already identified previously undetected variants associated with height, cholesterol level and anaemia11,12. Analysis of 26 maize genomes13 revealed that more SVs are involved in causing diseases than in conferring agronomically important traits. In addition, long-read sequencing is beneficial for improving the continuity, accuracy and range of variant phasing14,15,16, assessing complex small variants17 and has been applied to find disease-associated alleles18,19,20. For de novo assemblies, multiple methods have been published over recent years to promote the use of long reads21,22,23,24,25.

Ongoing advances in sequencing technology and bioinformatics have paved the way to achieving long-read sequencing on a population scale26. The two main competitors driving innovation in the field are Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT). PacBio high fidelity (HiFi) reads are generated by their Sequel II system; HiFi reads are both long (15–20 kbp) and highly accurate27. The ONT PromethION platform can produce much longer reads (up to 4 Mbp28), has a higher throughput at lower cost, but produces less accurate reads than the Sequel II system. Recent comparisons show an equivalent performance for SV calling with the two platforms29,30 (in-depth technical review and further comparison of long-read sequencing platforms available elsewhere31). Within the past 2 years, multiple studies have applied long-read sequencing to answer various questions in multiple different organisms32,33,34,35 (Fig. 1; Table 1). The largest human-focused long-read sequencing study to date investigated the genomic diversity of 3,622 Icelandic genomes11, with many other studies to follow, such as the NIH All of Us research programme and the NIH Center for Alzheimer’s and Related Dementias (CARD) in the USA and similar efforts in China, Abu Dhabi and Qatar. Long-read sequencing of a global diversity cohort is also being carried out as part of the Human Pangenome project36. Aside from human studies, long-read sequencing has been applied on a population scale to discover structural variation associated with phenotypes in crops32,33, fruitflies34 and songbirds35, and increasingly has a role in metagenomic studies (Box 1). Here, we restrict our discussion to eukaryotic organisms, as long-read sequencing studies of bacteria and other prokaryotes require specific laboratory and bioinformatics approaches, and the challenges are inherently different.

Fig. 1: Overview of population-scale studies using long-read sequencing.
figure 1

Studies published in 2019–2021 in which five or more samples were sequenced are included. Genome size of study organisms is viewed in three different categories (<500 Mbp, 500–2,000 Mbp and >2,000 Mbp), and the methodological approach taken to investigate genetic variation (comparison of assemblies, read mapping against a reference or both) is illustrated by the different colours. For further details, see Table 1.

Table 1 An overview of long-read-based population studies

In this Review, we discuss the approach of long-read, population-scale, whole-genome sequencing and highlight its advantages, point out challenges and provide an overview of different experimental setups. We define population-scale sequencing here as sequencing of more than five genomes, although in the case of more limited genomic diversity in some organisms, a lower number of individual genomes may be sufficient. We focus on technologies that produce continuous sequence reads and do not address other long-range technologies, such as linked reads or optical mapping (for example, Bionano Genomics). However, both these technologies may be useful and applicable in a population setting37,38. When sequencing of the highest number of samples is required, targeted sequencing may be a cost-efficient alternative to whole-genome approaches (Box 2). Similarly to most population-scale sequencing projects, we focus on germline variants, as somatic variants require higher genome coverage and access to the relevant tissues.

Project strategies

The total number of sequenced individuals (or rather chromosomes) should in general be as high as possible. However, the different underlying questions that motivate population-scale sequencing studies have vastly different sample size requirements. Although estimating the degree of genetic differentiation or ancestral population size is already possible with a sample size as low as ten chromosomes (five individuals of a diploid organism)39, the identification of rare variants (and potentially associated diseases) in a population usually requires sample sizes that are many orders of magnitude higher40. Regardless of the approach taken, it is crucial to keep track of metadata and control for covariates in the cohort selection.

There are multiple commonly applied strategies with specific budget requirements to be considered at the beginning of a large population-scale sequencing project (Fig. 2a). Here, we discuss three main strategies that allow for different scaling and budgeting and thus have an impact on the level of resolution in detecting genetic variation. Across virtually all sequencing technologies, the cost per sequenced base pair is consistently decreasing. To be able to compare the strategies discussed below, we use the required long-read sequencing output as a proxy for costs (Supplementary table 1). Although we assume a diploid genome with a size similar to the haploid human genome (3.2 Gbp), we note that for genomes with higher ploidy (for example, hexaploid plants), the overall coverage must be adapted to the ploidy of the organism (that is, the number of homologous chromosomes). Furthermore, we assume a sample size of ~2,500 individuals, similar to that of the 1000 Genomes project41. At the time of writing (early 2021), the least expensive option to generate long-read data is the ONT PromethION platform, with a yield of roughly 100–150 Gbp per flow cell at a price between US$650 and US$2,100, depending on the discount obtained when multiple flow cells are purchased simultaneously. Of note, PacBio HiFi reads are of adequate length and high accuracy, and although not formally assessed, it is reasonable to expect that lower coverage would be sufficient with this technology. However, at the time of writing (early 2021) this still equates to a higher cost than with the ONT PromethION platform, as one PacBio single-molecule real-time (SMRT) cell costs ~US$1,300 and yields ~500 Gbp (continuous long reads) or ~30 Gbp (HiFi) of data.

Fig. 2: Overview of long-read population study design.
figure 2

a | The experimental design of three different approaches is outlined. In the first strategy (left), all samples are sequenced at medium to high coverage by long-read sequencing. In the second approach (middle), a proportion of the samples are sequenced with medium to high coverage and the remainder using low coverage by long-read sequencing (similar to the initial 1000 Genomes project). In the third approach (right), a proportion of the samples are sequenced at medium to high coverage by long-read sequencing and the remainder by short-read sequencing. The decision of which approach to take will affect the ability to detect common (red symbols) or rare (grey symbols) events in the population. The decision also depends on the available budget, existing data and the sample DNA availability. b | Overview of current established sequencing technologies based on CHM13 sequencing data79: Illumina, Pacific Biosciences (PacBio) High Fidelity (HiFi) reads or ultra-long reads from Oxford Nanopore Technologies (ONT). The N50 read length and average read accuracy are highlighted in orange. Although each technology has advantages and disadvantages, HiFi and ONT are the most promising for future applications. c | Overview of analysis strategies. Although multiple approaches are available, the main decision is whether to use an alignment-based approach or a de novo assembly-based approach, which has implications for sequencing requirements and the approaches, resolution and comprehensiveness of downstream computational analysis.

A full coverage approach

Although the most expensive of the three approaches, the highest level of resolution is obtained with a strategy that aims to sequence every sample of the population with medium to high coverage (a ‘full coverage’ approach; Fig. 2a). The main criterion for deciding on the coverage required per sample is whether a de novo assembly (>40-fold coverage required) or reference-based alignment approach (>12-fold coverage required42) is planned. The advantage of this strategy is its comprehensiveness, the simplicity of the study design and the relatively straightforward computational workflow. Furthermore, samples receive similar coverage and are therefore equally well studied, and rare variations in each sample can be easily detected. Sequencing all 2,500 individuals at 20-fold coverage requires 150 Tbp of sequencing data.

A mixed coverage approach

In the ‘mixed coverage’ approach (Fig. 2a), a subset of samples that are representative of the subgroups in the cohort (for example, ethnicities or subpopulations) are sequenced at high coverage (for example, 30-fold) and the remaining samples at low coverage (for example, >5-fold). Although this approach is generally less expensive than the full coverage approach, it still achieves high overall detection sensitivity and is thus particularly suitable for studies with a high number of individuals or a limited budget. However, several analytical challenges remain, especially in achieving high accuracy of genotypes across multiple samples or differentiating somatic from heterozygous germline variants, which is further complicated by regions exhibiting recurrent mutations. In addition, there will certainly be a bias towards common alleles with this mixed coverage approach, as many rare alleles can be missed, especially if a locus is heterozygous and the alternative allele is thus sparsely covered. Assuming that in this second strategy 200 individuals are sequenced at 30-fold coverage and the remainder of the cohort at 8-fold coverage, this approach requires 73 Tbp of data and is thus potentially half as expensive as the full coverage strategy.

A mixed sequencing approach

The ‘mixed sequencing’ approach (Fig. 2a) involves long-read sequencing of just a few samples (for example, 10–20% of all samples) and short-read sequencing of the remaining samples to genotype variants that are discovered by long-read sequencing. The rationale behind this approach, similar to the selection of individuals for high coverage in the mixed coverage strategy, is to identify a small subset of samples (either randomly or by known diversity43, ethnicity or phenotype) and sequence only these to higher coverage. This mixed sequencing approach was effective in elucidating germline SVs that predispose to cancer, whereby short-read sequencing was used to identify evidence of SVs followed by long-read sequencing of selected samples44. Phylogenetic analysis of variants detected by short-read sequencing has also been used to select a representative set of soybean accessions for long-read sequencing and de novo assembly33. Other studies have used SVCollector43 to automatically select samples (this is done over iterations by selecting the most diverged sample and re-ranking remaining samples based on non-selected variation) for long-read sequencing to complement existing short-read sequencing data25,32. Once a subset of samples have been sequenced with long-read technologies, yielding a set of identified SVs, their breakpoint coordinates can be genotyped (for example, insertions) across the short-read sequence data sets. In this way, robust allele frequencies for the identified variants can be obtained, albeit with a bias towards variants identified by long-read sequencing, which means that rare variants contained in other samples may be missed. It may not be possible to directly genotype all types of SV using short reads, especially in repetitive regions, but knowledge of the haplotypes on which the SVs of interest are found will enable imputation of these variants based on short-read SNV genotypes11. This strategy has already been applied using diversity panels of human SVs to discover novel expression quantitative trait loci (eQTLs)45,46 and signatures of evolutionary adaptation47. If for this strategy no additional short-read data need to be generated, then this approach is likely to be the most affordable, as sequencing 200 of the 2,500 individuals to 30-fold coverage only requires 18 Tbp of data.

Sequencing logistics

Efficiently operating long-read sequencers at scale, from logistics to sample preparation, loading optimizations and run monitoring, is not a trivial task. ONT and PacBio have different advantages but also challenges in almost every step in this process given their different designs of flow cells and sequencing instruments (Fig. 2b). The per-sample sequencing process and the characteristics of each technology are reviewed elsewhere31.

A substantial amount of high molecular weight DNA (HMW DNA) and highly pure input DNA is of crucial importance in these methods. Achieving this DNA quality requires specific extraction methods and is often challenging for samples for which only limited or degraded material is available (for example, non-contemporary samples or samples from very small organisms). Amplification-free low-input DNA kits exist for both PacBio48 and ONT (https://nanoporetech.com/products/kits) sequencing platforms, with a minimum input DNA amount of 150 ng and 400 ng, respectively. However, these machines frequently require much more DNA to produce optimal sequencing yields. At the time of writing, it is often necessary to perform a nuclease flush and library reloading on an ONT flow cell to recover blocked pores to obtain the highest yield, which is an additional preparation step that is not necessary for PacBio cells. Importantly, ONT flow cells and PacBio SMRT cells have a limited shelf life, which is logistically challenging when sequencing many samples. Depending on the organism and its features, such as its physical size, the presence of a cell wall or secondary metabolites, high-quality DNA extraction can be a major constraint. Variability in DNA quality and molecular weight is a common issue and pre-sequencing quality control is necessary to ensure that inadequate samples are omitted and other technical covariates are recorded to be taken into account in downstream statistical analysis.

ONT sequencers store the raw data as hdf5 files (in the fast5 format), requiring base calling to obtain the more commonly used and much smaller fastq and BAM formats. Currently, incremental updates to the ONT base-calling algorithm regularly improve the read accuracy49, which suggests that repeating the base calling of older data is valuable. This reanalysis requires long-term storage of the fast5 files, which can be up to 1.5 TB for a single PromethION flow cell, although further compression is possible50. By contrast, the PacBio base-calling process is highly mature, and BAM files containing unaligned reads are produced directly from the sequencing machine. For HiFi reads, post-processing of the subreads is essential to collapse consecutive sequenced DNA molecules down to a high-quality consensus sequence, which is also done on the latest version of the machine (Sequel IIe system), and thus the overall data storage requirement is much reduced.

Analytical considerations

Arguably the main challenge in population-level studies is a scalable and streamlined analysis. Multiple recent reviews have discussed approaches at the single sample level6,7,21. Table 2 lists computational tools that are commonly used in long-read sequencing projects and these are reviewed in-depth elsewhere6,7. Of note, in this very rapidly developing area of genomics, new tools are introduced constantly while established ones quickly become outdated. As we do not assume that matching short-read sequencing data are available for every individual, the integration of long-read and short-read data is not discussed. Nevertheless, we highlight the important role of short reads for the polishing of long reads51 and assemblies52 or in fine-scale resolution of SV breakpoints11. These applications may lose their relevance as the accuracy of long-read sequencing improves, as is already the case for PacBio HiFi data.

Table 2 An overview of software tools for analysing long-read sequencing data

For population-scale projects, the choice of analytical tools often involves balancing sensitivity and computational efficiency. Before downstream analysis, it is crucial to perform quality control of experimental factors that directly affect the performance of assembly, SV detection and read phasing, such as DNA fragment length and sequencing yield. Multiple tools have been developed for this purpose53,54. Changes in sequencing chemistry or technical equipment during the project may lead to artefacts in the analysis and can thus potentially affect the findings. As such, it is important to randomly assign samples to batches, for example, sequencing runs, to reduce technical covariates.

Two main strategies for downstream analysis are available: aligning reads from individual samples to a single reference genome or comparing de novo assemblies (Fig. 2c). These two approaches are very different in their computational and coverage requirements, which in turn depend to a large extent on genome size and complexity. For both approaches, the goal is to apply the same set and versions of methods to all samples. The results need to be generated in a consistent way using correct version control and reproducible pipelines to avoid additional artefacts in the analysis. In the following sections, we discuss alignment-based and de novo assembly approaches and graph genome-based methods.

Read alignment-based analysis

Alignment-based approaches are often the method of choice for population-scale studies, as they facilitate the comparison of all samples against a common coordinate system (that is, the reference genome), which is illustrated by the fact that more than half of population studies (Fig. 1; Table 1) employ these approaches. Furthermore, these approaches are often less computationally demanding and require substantially less coverage than assembly-based methods. Alignment-based approaches rely on matching sequencing reads with a reference genome, the overall correctness of which will affect the analysis of read data7. If the reference genome is incomplete, incorrect, fragmented or too divergent from the focal sample, it will lead to biases in the downstream analysis55,56.

The software for long-read sequence data analysis is under constant development, and alignment methods in particular have become much faster in recent years (Table 2). The NGMLR42 and LAST57 methods speed up the alignment process and improve the accuracy of long-read alignment. The minimap2 aligner is considerably faster than its competitors while often delivering similar results, and thus it is currently the most popular, widely accepted long-read aligner58. Two noteworthy recent innovations are Winnowmap, which improves alignments (specifically in repetitive regions)59, and lra, which improves the alignment in the presence of SVs60.

The choice of tools for the detection of genetic variation is arguably of equal importance. For SVs, several tools are currently available, such as Sniffles42, SVIM61, PBHoney62, CuteSV63 and pbsv (Table 2). One of the remaining challenges is the accurate representation of SV breakpoints, which is particularly difficult in the context of more complex events involving multiple variants in repetitive regions, such as segmental duplications or large tandem repeat arrays (SV detection methods are comprehensively reviewed elsewhere7,64). Recently developed tools are removing the need for high sequencing coverage by enabling SV calling42,65 and genotyping42,66 at lower coverage, although the associated risk of incomplete or erroneous SV detection and genotyping cannot be ignored.

Owing to the different error profiles of long reads, naive pile-up approaches or SNV and small insertion–deletion (indel) calling methods that were developed for short-read sequencing are usually inadequate or suboptimal for long reads. Over the past few years, multiple strategies have been developed to improve the detection of small variants with sophisticated machine learning models for each of the long-read sequencing technologies (Table 2). Current methods include, for example, DeepVariant67 Pepper68, Clair69 (both using deep learning) and LongShot70 (which explicitly requires alleles to be concordant with the haplotype structure), which also outperforms Illumina-based SNV calling71. PacBio HiFi, in contrast to ONT, is also competitive with Illumina for small indels.

Expansions and contractions of tandem repeat arrays are a highly challenging and frequent type of variation72. As these repetitive DNAs, which include short tandem repeats (1–6 bp repeat unit) and minisatellites (>6 bp repeat unit), are known to contain disease-causing alleles, accurate characterization of them is crucial73. Some tools have been developed specifically for this purpose74, such as tandem-genotypes75 and TRiCoLOR76. Similar challenges remain for accurate characterization of other repeats. For example, the LPA locus (encoding apolipoprotein(a)) consists of 8 kbp tandem repeat units (encoding kringle IV domains) that are repeated 5–10 times in human genomes77, making it notoriously difficult to assess.

To date, most reference genomes consist of a haplotype-collapsed representation, in which two or more chromosomal haplotypes are collapsed during assembly to a single artificial consensus sequence. Phased genome assemblies, in which the haplotype structure of each chromosome is fully resolved, have the potential to more accurately represent the genome. The human Telomere-to-Telomere (T2T) consortium effort aims to produce the first full chromosome assembly of the human genome from the essentially haploid complete hydatidiform mole (CHM13) genome and has already completed assembly of chromosome 8 (ref.78) and chromosome X (ref.79). In another example, a single haplotype from a haplotype-resolved de novo assembly was used as the reference for read alignment in a population genetic study in crows35.

Population-scale de novo assemblies

Many reference genomes based on short-read sequencing are incomplete or highly fragmented with many gaps80. Furthermore, hundreds of megabases of population- and individual-specific sequences are absent from the human reference genome81. These missing sequences are often repetitive, but also include coding sequences. As a consequence, a fraction of reads derived from a sample cannot be aligned to the reference genome or they align to paralogous sequences, leading to tens of thousands of false-positive and false-negative variants for each individual82. Therefore, creating and comparing de novo assemblies is desirable (Fig. 1).

The increased availability and affordability of long-read sequencing data have led to an explosion of faster and more accurate genome assembly tools (Table 2), of which haplotype-resolved de novo assembly is commonly considered the most comprehensive representation of a genome. This competition to produce improved de novo assembly methods has led to the rapid development of new tools, usually focusing on either computational demand, contiguity, completeness or correctness, indicating that genome assembly represents (at present) a trade-off between these key parameters. De novo assembly-based approaches are often more sensitive and better for reconstructing highly diverse regions of the genome than alignment-based approaches, but can also lead to a collapse of highly similar segmental duplications83. For such duplicated regions, specific algorithms have been developed that leverage SNVs that differentiate multiple copies of repeats and thereby can recover medically relevant duplicated genes84,85. The dependence of de novo assembly on high read coverage and more computationally demanding methods has made it historically very challenging for large population-scale sequencing. However, the ever-increasing yield of sequencing technologies will enable the sequencing of each sample to sufficient coverage to obtain a high-quality de novo assembly86 (Fig. 1; Table 1).

Single-genome projects iteratively test multiple parameters or different methods to optimize a de novo assembly, which is neither realistic nor desirable in a population context. Multiple projects have integrated proximity-ligation or strand-specific short-read sequencing methods for substantial improvements of the contiguity of the assemblies25,46, but such approaches do not scale well to large populations. De novo assembly-based approaches are typically also more computationally demanding, which becomes especially relevant for large numbers of samples. Large cloud storage infrastructures might improve the scalability, but the computing cost will rise substantially. The recent development of less computationally demanding assemblers may be able to mitigate this limitation25.

Another important consideration is the scalability of the downstream computational approaches. Although the process of genome assembly already requires considerable computational resources, these demands increase linearly with the addition of more individuals. To infer genomic variation, de novo assemblies are usually compared with a chosen reference genome, yielding a standard variant call format (VCF) file. Currently, genomic alignment tools and dedicated variant callers (such as MUMmer87, Assemblytics88, minimap2 or dipcall89 and SVIM-asm61) are designed to provide a pairwise comparison of two genomes, such as the assembled and a reference genome (Table 2). However, in a project with multiple (diploid) genomes, this is clearly not ideal, as a whole-genome alignment-based approach likely suffers from the same biases as a read alignment-based approach. For example, in the case of novel sequence insertions in samples compared with a single reference genome, these variants will often be more challenging to compare across all samples of the population (Fig. 3a). This issue might be further amplified by gaps in the reference assembly, which potentially reduces the number of regions that can be compared across the population. Although troublesome for comparisons across samples, assembly-based SV calling will more likely correctly represent complex SVs that are longer than the read length and therefore harder to correctly identify with alignment-based methods (Fig. 3b). The likely most comprehensive option would be a compare-all-with-all approach (Fig. 3a), in which unique pairwise comparisons increase quadratically, meaning that with 100 samples there are already 4,950 possible ways to compare samples with each other. Clearly, such an approach would currently not be feasible for most projects, and alternative strategies have to be developed. Most recently, the introduction of progressive Cactus90, a tool that constructs an ancestral genome when comparing two assemblies based on a guide tree, has enabled comparison across multiple genomes. However, to date this tool has mainly been tested across species and not between individuals of a species.

Fig. 3: Potential problems for different genome comparison approaches.
figure 3

a | Schematic depiction of a potential problem in a de novo assembly-based approach. The presence of a novel segment N1 in two de novo assemblies, at different locations and, even more so, a sequence variant (red x), poses a challenge to correct reporting by current state-of-the-art methods and variation formats. b | Similar representation of the N1 problem in an alignment-based approach, where the coordinates of N1 are shared, but remain challenging for the identification of the single-nucleotide variant (SNV) or the entire N1 sequence. c | A graph-based representation of N1, which enables a clearer comparison of the variant across the samples, illustrating the potential benefits of graph genomes. R1–R3 represents the backbone of the graph genome and N1, and its SNV represents novel sequencing for a given sample set.

Another, perhaps even greater, challenge in de novo assembly approaches is the correct representation of ploidy. Many organisms have diploid genomes (for example, humans and many animals) and even higher ploidies exist, such as in some crops. Tools optimized for diploid (that is, haplotype-aware) de novo assembly are available to reconstruct both haplotypes22. This reconstruction is essential to recover all heterozygous variation, as two different haplotypes may otherwise be collapsed to a single artificial and incorrect representation of the chromosome. However, haplotype-resolved de novo assemblies often require higher coverage and computational cost. The correct genotyping of both heterozygous and homozygous variants is of utmost importance for subsequent population genetic analysis. A recent solution is to first create an unphased assembly, then identify variants and partition reads into haplotypes before creating phased contigs86,91.

Even if complete and accurate haplotype-resolved assembly is achieved, then SV calling from assembly-to-assembly comparison might not be straightforward in highly complex regions. For example, the human LPA77 and SMN1 and SMN2 (ref.92) loci with their highly repetitive structure lead to problems in genomic alignments. As such, the main challenge may shift to genomic alignments and methods to interpret the detected differences between multiple assemblies.

Graph genome methods

Both read alignment and de novo assembly approaches can have systematic issues with complex structural variation, inserted sequences missing from the reference genome, repeat variability and highly polymorphic loci (Fig. 3). Linear reference genomes only represent one allele and thus, do not incorporate polymorphisms and complexity of a population. Reference pan-genome approaches, which combine genomes from multiple individuals within a species, are a better fit to represent genomic diversity93,94 (Fig. 3c). Variant catalogues for pan-genome structures are obtained by ongoing projects using high-quality haplotype-resolved assemblies of diversity panels for the discovery of variants46. A reduction of the alignment bias against non-reference alleles is achieved by explicitly taking known population variants into account in the read alignment step. As such, the analysis does not rely on a single reference genome. This goal is realized by graph genome-based tools and their associated data formats, as a way to represent a collection of possible (alternative) sequences95. Examples of tools for this purpose include vg96, minigraph97, the SevenBridges Graph Genome Pipeline98, the DRAGEN Graph Mapper and PanGenie99. These implementations provide tools to build graphs based on the linear reference genome and a collection of known variants, or alternatively use (haplotype-resolved) assembled contigs. Although a detailed discussion of the methods to construct such pan-genome graphs is beyond the scope of this Review, we note that there are important differences in implementation and data format with regard to compatibility with coordinates on the linear reference genome and storing information of the individual haplotypes that contributed to the included variation97. An additional benefit of graph genome methods is that they enable a more correct representation of nested variation, such as smaller variants within inserted sequences94.

A major benefit of graph genomes is the genotyping of SVs using short reads. Multiple tools, such as GraphTyper2100, Paragraph101 and tools from the vg package45,96, have been developed specifically for alignment of short-read sequencing data to graph genome structures. SNVs, small indels or SVs within a sample are genotyped as reads following a certain path (‘walk’) through the pan-genome graph96,101 (Fig. 4a). Graph genotyping methods enable the assessment of variants that remain undetected by the current state-of-the-art short-read SV discovery methods46. In the next step, variants that were not yet explicitly encoded in the graph can be identified, with the option to incrementally augment the graph structure with the newfound variation to further improve accuracy98,102. Graph genome methods are reviewed in greater depth elsewhere94,95,103.

Fig. 4: Genotyping of SVs and SNVs across a population set.
figure 4

a | Graph genome-based genotyping of a region with multiple alleles between two genome segments (green and pink). Insertions of different sizes (yellow) can be genotyped at the same locus using spanning reads (blue and purple) to identify the presence of two different alleles. b | An example of structural variants (SVs) and single-nucleotide variants (SNVs) across different unique and repeat regions being correctly or incorrectly genotyped based on read length. c | A phylogenetically informed filtering approach for SVs. Assuming that after a sufficiently long time (4Ne generations, where e = effective population size) most or all genetic variation should be fully sorted between two clades; variants that do not adhere to this assumption and are polymorphic across clades (for example, variant 3) can be removed. Although this approach is certainly very conservative and ignores the fact that some types of variation exhibit repeated mutations on the same locus, it can be considered a first step towards more reliable genotyping of SVs.

With such graph-based approaches, the often discussed dichotomy of either using an existing reference genome for alignment or constructing a novel reference genome through de novo assembly can potentially be avoided for population studies, as downstream of this step all sequences have to be compared with a single (reference) assembly or a backbone of a pangenomic graph, for identification of variation, annotation and statistical evaluation. However, these approaches are less straightforward in practice than the use of a linear reference genome and are not entirely mature, with competing implementations and data formats. Although graph genome methods are good candidates to solve biases when assessing (structural) genomic diversity, it remains unclear whether these methods will become mainstream in clinical or diagnostic applications, in which a single reference genome is an attractive simplification.

Variant validation and genotyping

To determine whether any given variant constitutes the biological reality and is not just an artefact, it is important to perform validation. Ideally, this is done using orthogonal approaches, to capitalize on the strengths of different technologies. Traditionally, PCR validation of variants has been the method of choice104; however, for complex SVs that contain highly repetitive regions, other, non-sequencing-based methods such as optical mapping might be more suitable46. Visual inspection of alignments and subsequent manual curation of variant sets are arguably a very accurate validation approach but certainly not feasible for more than a few hundred variants. A semi-automated pipeline, SV-plaudit, has been developed to enable rapid, streamlined and efficient curation of thousands of SVs105.

Of similar importance is variant genotyping, which we define as determining the presence and zygosity of a variant. Although the initial discovery of variation is relatively straightforward, obtaining reliable genotypes for a given variant across a population is usually much more difficult. However, knowing the alleles (that is, the genotypes) of variants for a given sample is particularly important in population genetic and evolutionary studies, in which population size estimation and measures of genetic differentiation (such as the fixation index FST) rely on obtaining accurate allele frequencies of variants106. In particular, variants in repetitive regions are more readily genotyped using long reads than using short reads (Fig. 4b). For SNVs, sophisticated genotyping approaches have been developed that consider important parameters such as mutation dynamics (for example, transition to transversion ratios) and information about non-variant sites to improve genotype accuracy107. The concept of a genomic variant call format (gVCF) has been implemented in applications such as freebayes108 and GATK109, which has improved the efficiency of the comparison and made multiple rounds of genotyping obsolete. Another approach is to completely abandon genotype calling and instead calculate posterior probabilities of genotypes to directly incorporate uncertainty in the downstream analysis (for example, ANGSD110). Merging SNVs is typically done with tools such as bcftools111 and RTGTools112.

For SVs, the situation is much more complicated, as establishing homology of variants between samples is not straightforward. One of the first approaches to be developed is based on 50% reciprocal overlap, which allows two SVs to be merged if they overlap substantially. Although this works well for large copy number variation events, there may be some limitations for smaller SVs (for example, 50 bp to 1 kbp) with more localized breakpoints. Another approach is to require breakpoints from each individual to be approximately in agreement to establish that a variant in two samples is indeed homologous (for example, SURVIVOR merge113). In some cases, such as when two insertions are homologous, but their sequence slightly deviates, an approach based on breakpoints may be too conservative, and some tools have been used to attempt to address this issue (for example, Truvari9, SVanalyzer and Jasmine32). However, at present, no universal standards are available for the thresholds. Thus, approaches rely on arbitrary thresholds of breakpoint distances and sequence similarity. Deletions are arguably the most straightforward type of variation to genotype, but calling heterozygotes for even this seemingly simple type of SV can be difficult114. Tools such as Sniffles and SVJedi are capable of genotyping SVs based on a candidate VCF file, following an initial step of SV discovery based on the long-read alignments66.

Another potentially very powerful approach to improve SV genotypes is to harness the information contained in a sampling scheme consisting of phylogenetically distant populations (Fig. 4c). In this approach, basic population genetic assumptions are made to reduce the number of false positives for genotyped SVs. After a sufficient number of generations (4Ne, where Ne = effective population size), variation is likely fully sorted and no polymorphisms should occur across lineages any more, assuming that there are no repeated mutations at the same locus (that is, the infinite sites model)115. Any variants exhibiting polymorphic genotypes across the divergent lineages are excluded. Although this approach neglects the fact that certain types of SV have much higher mutation rates and thus indeed have the potential for repeated mutations (for example, variation within tandem repeat arrays), it provides a first step towards more reliable SV genotyping. This approach has recently been successfully applied in the corvids crows and jackdaws35.

Prediction of functional impact

The mathematical framework for the analysis of (small) genetic variants predates the advent of high-throughput sequencing by almost a century and is therefore well established. Large-scale single-nucleotide polymorphism (SNP)-array-based GWAS projects enabled the interrogation of thousands of variants and haplotypes for their association with disease. Although quality assessment steps such as principal component analysis and testing for Hardy–Weinberg equilibrium still hold for indel variants (that is, >50 bp), these models do not necessarily cover all types of SV, for example, in the case of a continuous spectrum of repeat lengths116. A solution, albeit with loss of resolution, would be to binarize the distribution into ‘reference’ and ‘expanded’ alleles, but historically it has been difficult to unambiguously establish a cut-off length. Association testing of the role of partially overlapping variants for a certain trait requires an approach conceptually similar to that used for burden analysis in rare variant association studies.

Whereas classification of the functional impact of small variants on protein function for synonymous, missense and loss-of-function variants is relatively mature with tools such as the Ensembl VEP117, it is less straightforward to judge the impact of SVs on the expression of nearby genes. This is mainly because it is unclear how the length of an SV impacts the surrounding genomic region and it is often hard to obtain robust allele frequencies for SVs114. For functional annotation and pathogenicity prediction, approaches using joint linear models118, supervised learning119 and existing databases120 have been developed, and there are promising examples demonstrating that SVs are indeed associated with important traits of interest118,119.

ConclusionsOngoing significant technological improvements have paved the way to apply long-read sequencing to population-scale sequencing projects and demonstrate that this sequencing approach is here to stay. This process already started with the first larger data sets generated by targeted sequencing of certain genes (Box 2) and continues with an increasing number of projects that leverage long reads at scale (Fig. 1; Table 1). The analysis of population-scale long-read sequencing data sets remains challenging, with the read alignment-based approach currently being the most feasible. Nevertheless, we anticipate this to change to alignment of either haplotype-resolved de novo assemblies or individual sequencing reads to graph genome structures. This development will have a profound impact on the field and holds the promise of improved variant representation and complexity of the underlying biology, but would require a paradigm shift from a linear to a more complex version of the reference genome.

PacBio and ONT lead the current development of long-read sequencing for multiple applications. However, other companies (for example, Base4, Quantapore and Omniome) are developing novel long-read technologies, the viability of which remains to be evaluated in the coming years. Although not discussed here, improved DNA extraction, conservation and library preparation is also adding to the rapid growth of long-read sequencing population studies31. Among the biggest achievements in recent years is the generation of sequence reads of 4 Mbp and longer; although this is not yet routinely possible without compromising yield28. Once sequencing reads routinely approach chromosome length, the process of de novo assembly seems obsolete; however, whether such reads can be directly used in a framework that is based on de novo assemblies instead of read alignment remains to be seen.

Future directions

The future of long-read population-scale sequencing holds many opportunities for multiple types of omics assays. For example, both the PacBio and ONT platforms are able to simultaneously detect the nucleotide sequence and modifications of DNA such as 5-methylcytosine121. The identification of such modifications has unprecedented implications for epigenetics and the analysis of DNA damage. More recent versions of the ONT base callers are trained to detect common nucleotide modifications, which together with the plateauing accuracy potentially alleviates the need to store raw data. Several studies have shown excellent reproducibility and correlation with bisulfite sequencing, suggesting that nanopore sequencing could become the gold standard for detecting methylation patterns122. Although methods tailored to short-read bisulfite sequencing exist, there is a lack of statistical methods for differential methylation assessment that leverages the unique features of large distance phasing of modifications in parental haplotypes. Detection of nucleotide modifications further opens up a wealth of opportunities for specialized assays such as chromatin accessibility profiling123 and replication fork detection124.

Complementary to DNA-based population sequencing, long-read sequencing of mRNA and complementary DNA (cDNA) also enable the identification of isoform diversity125. Multiple pipelines have been developed to investigate known and novel isoforms, but the field is far from mature. A survey of multiple tissues has already been undertaken125, and an extension of this to the population scale, such as in the short-read GTEx project, is highly likely to yield valuable information about transcript structure and the influence of regulatory (structural) variation. Long-read sequencing approaches have also been extended to the direct sequencing of proteins126 and single-cell transcriptomics127. Although these applications are likely to lead to biologically fascinating insights, the implications for population studies remain unclear127.

Alongside the technological improvements in long-read sequencing, computational analysis has also improved, which is key to enabling population-scale projects. Analyses that took weeks to months to accomplish a year ago can now be completed within a day to a week and at a lower cost24,86,128. However, some conceptual challenges remain, such as the representation of nested and highly complex variation97. Recent advances, such as pan-genome graphs, have the potential to address this challenge97. Furthermore, the use of pan-genome graphs could indeed improve the analysis itself, as they overcome the problem of a linear reference bias by including different alleles96,100,101. Another related computational challenge is the accurate and rapid genotyping of complex alleles. Here, graph genomes have already shown significant benefits, although the process of obtaining a fully genotyped population-level VCF is still far from trivial. This is due to the lack of a gVCF for SV representation, to represent information not only about the alternative alleles (that is, SV) but also about reference alleles. For SNV, this allows the easy comparison of samples and is a requirement for future SV studies.

Despite significant advances in long-read sequencing, several challenges remain to be addressed. The frequently discussed issue regarding the lack of precision and lack of sensitivity in identifying SNVs and small indels, especially involving homopolymers, is likely to be resolved by advancements in sequencing accuracy27,68. However, difficulties remain in assessing variation in complex regions such as segmental duplications, ribosomal DNA (rDNA) tandem arrays, telomeres or centromeres. Spurred by the efforts led by the T2T consortium, which aims to provide the full linear nucleotide sequence of all human chromosomes, new software tools are being developed that specifically aim to resolve these large tandem arrays and also to assess the allelic variation within them. However, whether this solves the problem completely remains to be determined, as at the time of writing even the T2T reference genome has a few gaps remaining and only represents one ethnicity.

In this Review, we provide a snapshot of the present state of large-scale long-read sequencing and discuss the exciting developments in biotechnology and bioinformatics. Despite its challenges, we argue that long-read sequencing has contributed immensely to the advancement of genomics in humans, model organisms and beyond, and that this is the way forward for population-scale studies.