Main

Structural variation in the genome refers to cytogenetically visible and (more commonly) submicroscopic variants, including deletions, insertions, duplications and large-scale copy number variants — collectively termed copy number variations (CNVs) — as well as inversions and translocations (Box 1)1,2,3. Genome scanning technologies are now commonplace in many laboratories, allowing new structural variation to be recognized from general population surveys4,5,6,7,8,9,10,11,12 or studies of diseases13,14,15,16,17,18,19,20,21. In fact, the Database of Genomic Variants4,22 (see list of databases in Table 1) already contains entries (mainly CNVs) covering some 538 Mb (18.8% of the euchromatic genome) derived from the study of fewer than 1,000 genomes from individuals with no obvious disease phenotype.

Table 1 Databases

This first round of observations came from several studies, each using a different technology platform and data processing algorithms, with different degrees of pre- and postexperimental standardization and validation. As a result, the data vary in quality and often have both high false-positive and false-negative rates. There is the very real possibility of the entire human genome soon being presented as 'structurally variant' in one form or another, based solely on studies of nondisease samples, which would be a distortion. It will be important for all future applications of structural variation information that the scope and detail of variants in the general population be accurately cataloged. In particular, medical genetics research — investigating structural variation profiles in individuals or clinical cohorts — will need a reliable foundation against which to interpret possible pathogenic findings in cytogenomic (Fig. 1), linkage and genome-wide association studies21,23,24,25.

Figure 1: Lexicon of genomic variation.
figure 1

Descriptors of variation began in the realm of cytogenetics, followed by those from the field of molecular genetics and, most recently, by technologies such as those described in this perspective, which bridge the gap for detection of genomic variants (sometimes called cytogenomics55). The designation of the category '1 kb to submicroscopic' is somewhat arbitrary at both ends, but is used for operational definition. In a broad sense, structural variation has been used to refer to genomic segments both smaller and larger than the narrower operational definition, as illustrated by the large bracket. The focus of recent discoveries has been the subgroup in the midrange (indicated with strong highlighting), but the gradation of shading illustrates that the biological boundaries may really encompass some forms of variation previously recognized from either cytogenetic or molecular genetic approaches. At the molecular level, SNPs can be identified that are representative of the underlying haplotype structure (tagSNPs). As structural variation becomes better integrated with the existing SNP-based linkage disequilibrium maps, it is likely that presence or absence of many structural variants will simply be inferred by typing selected SNPs11,25,73.

The field of genomic structural variation, however, is on the cusp of change. Pioneering approaches, often fragmented or fraught with technical limitations, are being supplanted by new technologies that afford much higher resolution screening of the genome at lower cost. We anticipate that, in the next year, the quantity of structural variation data will increase by orders of magnitude owing to microarray-based experiments alone, not to mention the plethora soon to flow from clone-end6,26 or whole-genome sequencing experiments27,28,29,30. Many of these studies will survey nondisease samples for structural variation discovery to create control databases. Moreover, in little more than two years from the first description of global CNV distribution4,5, the field is poised to make structural variation analyses standard in the design of all studies of the genetic basis of phenotypic variation. At this inflection point, we examine what is known about genomic structural variation, and consider perspectives and simple standards designed to safeguard integrity and maximize data utility for the immediate future.

Challenges in characterizing structural variants

Research into structural variation is currently at a state of development comparable to that of the earliest SNP studies. Initiatives to discover and characterize simpler structural variants — such as small insertions, deletions (indels) and balanced inversions — is likely to yield results in proportion to investment, as was the case for SNPs31,32,33. However, for larger and particularly for more complex structural variants, there are additional confounding factors. To provide a framework for discussion of prospective standards, we group into five categories the major issues currently curbing progress in this field. Data quality, which has impact throughout these other issues, is discussed in the subsequent subsection. The majority of the discussion pertains to the variants classed as CNVs, as these represent the predominant form studied to date. Our comments also mostly target issues related to whole-genome discovery surveys.

Terminology. The newly recognized domain of structural variation is blurring the distinction between traditional cytogenetic and molecular analyses, as it fills the (albeit narrowing) gap between the limits of resolution of these earlier approaches to genetic variation (Fig. 1). Terminology established within each camp is sometimes unwieldy in the crossover (Box 1). Moreover, there is no standard nomenclature for structural variants that fall between those that can be classified by naming systems established from the cytogenetic34,35 or mutation literature36 (for example, indels). For some terms, such as CNV, there is added complication because they are being used regularly as a descriptor in both control and disease studies, but with different meaning. Different classes of CNVs are described in Redon et al.11 and in Supplementary Figure 1 online. Nomenclature for genes encompassed by structural variants also needs to be considered, but no rules have yet been established.

Annotating complex structural variants. Many structural variants are large in size, flanked by or encompassing complex repetitive DNA sequences. They may be unbalanced in content or highly polymorphic, characteristics that pose significant challenges for detection and analysis. There are many complexities associated with classifying and characterizing CNVs (Supplementary Figs. 1, 2 and 3 online). As the precise rearrangement breakpoints are usually not resolved (because of coincidence with large repeats or because of low resolution coverage of assays), it is typically not possible to determine whether the underlying variants are identical by descent or represent independent events in close proximity to one another. Regions of high sequence identity may also cause cross-hybridization on comparative genome hybridization (CGH) platforms, leading to CNV calls in regions that are not actually variable (Supplementary Fig. 3). Determining the meiotic and mitotic characteristics of these variants — such as the de novo mutation rate, stability and level of mosaicism — can also be confounded not only by the complex nature of the underlying sequences but by technical and comparative limitations, including the source of the DNA (described below).

Technological limitations. At present, no single approach identifies all types of structural variation. Current scans of genome-wide structural variation are screening or discovery assays, and not definitive tests. In our hands, the testing of a single sample by different platforms and 'call' algorithms can lead to substantially different CNV call rates, owing to differing sensitivity, specificity, probe density and type of probe used (Table 2 and Supplementary Table 1 online). This matter is underscored by the relatively small degree of overlap among published datasets2,37, even when assessing identical samples7,9,10,11. The progress on CNV discovery to date is largely due to the availability of numerous microarray platforms, which detect quantitative imbalances. In contrast, there is currently no high-throughput, cost-effective method to scan the genome for inversions or translocations. Short of comparing 'finished' sequence assemblies from independent sources38,39, it can take a multitude of approaches to identify, validate and sequence the compendium of structural variation comprehensively (Table 3 and Supplementary Table 2 online). Other issues, such as relative costs of arrays and reagents and availability of specialized equipment, often limit access to the most appropriate experiments.

Table 2 Copy number variants called on the same test sample (NA15510) using different experimental platforms and algorithmsa
Table 3 Summary of 12 published surveys (2004–2007) of structural variation content in human genomesa

Characteristics of reference and test samples. Identification of variation requires comparison to either a reference DNA source4,5,11,40,41, a reference dataset11 or a reference genome sequence6,39,42, which has implications for experimental design and interpretation of results43. For example, at present, no standardized 'reference' control DNA has been adopted for laboratory experiments, and in some cases, 'pools' of samples or datasets are used to represent an averaged genome (Table 2). This lack of standard reference genomes can complicate both the designation of relative copy-number differences among samples from different projects and the standardization of databases (Table 1) that contain information about structural variants. Specifically, if in a single experiment it is impossible to distinguish a loss in the test sample from a gain in the reference sample, then two different studies may report the same CNV as a relative gain or loss (duplication or deletion), respectively. Moreover, using pools of DNA or their intensity outputs as hybridization controls or in comparative intensity analysis (Table 2) may lead to a decreased power to detect variants in highly polymorphic regions of the genome. In these regions, the pool will represent an intermediate between the polymorphic and nonpolymorphic states, resulting in smaller relative difference in intensity than a nonpolymorphic single reference would yield. In terms of annotating variants, the relative nature of CNV determination can pose a problem, as it leads to an overestimation of regions with both apparent gains and losses.

Ultimately, the underlying sequence characteristics of any newly identified structural variant will be compared to the human genome reference assembly. The latest release from the US National Center for Biotechnology Information (NCBI), called Build 36, is a mosaic of some 708 different sources1, and covers mainly the euchromatic portion of the genome, with some 302 known gaps (http://www.ncbi.nlm.nih.gov/). Concomitance of incomplete or falsely merged regions of the reference assembly with the position of structural variants can confound comparisons of one against the other44,45. Moreover, as many technologies use the NCBI reference sequence to guide product development, structural variants residing in the unannotated segments of the human genome may be missed (Supplementary Fig. 2). Test samples can also be from a mix of untransformed or transformed tissues, all impacting on interpretation11,46. Finally, samples used to discover structural variants from control populations may have little or no genetic (for example, parent of origin) information or phenotypic assessment protocols attached to them. So, despite common presumptions, any variant described by such studies is not necessarily either neutral or benign.

Database issues. The main sources of information for human structural variation are the Database of Genomic Variants and the Human Structural Variation Database. Both are currently limited, in that variants are simply represented as they are described in publications and overlaid on the current reference assembly, without precise location of most breakpoints. There are some unpublished data at these sites, but so far there is no active effort to standardize CNV calling or characteristics through reexamination of the original primary data. Moreover, as the human reference assembly is updated in subsequent assemblies, sites of apparent structural variation can disappear and reappear, presenting a challenge for database management. Although Ensembl and UCSC Genome Browser display data from the Database of Genomic Variants, there is currently no standard requirement to submit published structural variants to any database. Further, there is no system for naming structural variants with unique accession numbers, and surprisingly, only a proportion of studies post their raw or underlying data, and full method of interpretation, for public access.

There are also many challenges in the layout and visualization of the data. For example, it is current practice to display structural variants using estimates of start- and end-points when the breakpoint(s) are suboptimally resolved. When there are two or more overlapping variants originating from the same study, they are sometimes grouped together even if they are not identical11, and misgrouping can occur, particularly near segmental duplications. Moreover, as the number of surveys continues to grow, the CNVs discovered will become more redundant.

Presenting structural variation data in relation to the reference assembly can also be problematic1,39 because the standard browsers were not designed to display these data. This issue notwithstanding, smaller variants (usually <10 kb) are present in NCBI's dbSNP, and a goal of the Human Structural Variation Database is to integrate structural variation data, such as fosmid paired-end sequences6, with the NCBI human reference sequence (including those regions not represented in the current assembly)26. The Database of Genomic Variants will continue to display structural variation data originating from nondisease-defined samples, but stricter criteria for inclusion, as well as assessment and annotation of the quality standards described below, will become critical aspects of the curatorial process.

Content and quality of early studies of structural variants

To assess current practices in collection and validation of discovery data, we review and comment on 12 experimentally diverse and highly cited studies, each undertaken to search for structural variation in the human genome. In Table 3 and Supplementary Table 2, we summarize selected parameters and the strengths and weaknesses of these studies.

Genomes surveyed and reference samples. The number of genomes investigated with each study ranged from one (in sequence comparisons to reference assemblies6,39) to 270 (in three studies of the HapMap collection9,10,11). Appropriate attention was given to samples being from unrelated individuals or from families, and ethnic diversity was usually noted. Tissue sources of DNA were heterogeneous, and whether or not they were transformed or cultured was inconsistently documented. Phenotypic information would generally have been unknown, or assumed to be unremarkable (from 'healthy volunteers'), although Iafrate et al. included samples with known karyotypic abnormalities as controls4, and Wong et al. used some material from cancer programs41. Each study used different reference sample(s) for genome comparison. One used pooled DNA4, three compared to the reference human genome assembly6,39,42, one made a variety of comparisons5 and the other CGH approaches each used a different single male reference sample. Future studies will increase the variety of genomes surveyed, and these would benefit from a consensus standard of documented information about their sources. In contrast, a smaller number of reference sequences would facilitate the process of collective documentation.

Primary discovery methods. Table 3 is organized according to the methods used to search for structural variants. The upper portion includes seven studies that employed CGH, each with a different array platform, encompassing a range of probe size, complexity and resolution. One approach9,40 targeted regions associated with segmental duplications, but the rest spanned the genome, with arrays carrying from 2,000 up to about 26,000 clones in genome tiling-path arrays11,41. Redon et al.11 added a second complementary screening strategy based on relative fluorescence intensities with arrays designed originally for SNP genotyping. The lower portion of Table 3 summarizes five studies with completely different strategies, based on genomic sequence comparisons. These studies used existing data from either the reference human genome sequence6,39,42 or the HapMap project7,10 to mine for deletions and other relatively small structural rearrangements. The fosmid-based approach6 and sequence comparison39 were able to discern orientational as well as quantitative variants.

Experimental quality controls. Before structural variants can be revealed by genome comparisons, positive data arising from other biological or technical causes need to be filtered. Biological differences that were variously accounted for among these studies include (i) male-female X and Y chromosome dosage differences9,11,40, (ii) somatic rearrangements of the immunoglobulin genes5,11, (iii) cell-culture artifacts such as mosaic trisomies46 and (iv) results of genomic instability of virus-transformed cell lines11. Similarly, any variation relative to a reference human genome sequence in the computational approaches must be interpreted in light of the known gaps and potential assembly artifacts1,6,39.

As these screening strategies are themselves biological, with associated technical artifacts, replication is the most important experimental tool for assessing the validity of observations, and it took many forms among these studies. Within each CGH array, clones were typically in duplicate or triplicate. Interexperimental replication involved ostensibly the same conditions and/or an experimental alternate, such as 'dye-swap' of the two fluorochrome labels between the test and reference samples. The means of dealing with discordant replicates was inconsistent among the studies, and sometimes difficult to discern from the publications. In most studies4,9,11,40, discordant dye-swap results were eliminated, but in Wong et al.41, only 20% of samples were assayed in both orientations. Within each study, experiments also showed variable background 'noise', and some studies repeated and/or deleted individual assays that did not meet a defined quality threshold. When sources of 'noise' are nonrandom, replication alone will reproducibly yield false positive calls, which argues for replication by diverse methods.

Other controls showed the effectiveness of the respective screening methods. Self-versus-self hybridization was used4,5,9,40 to estimate somatic effects and/or numbers of false positive calls. Two studies assayed samples with previously characterized imbalances4,40. Sharp et al.40 showed the enhanced (11-fold) effectiveness of their targeted 'hot spot' array relative to a genome-wide assay. Redon et al.11 evaluated concordance between their two primary platforms and undertook numerous technical replicates.

Each study defined its own algorithm for 'calling' differences between sample and reference as putative structural variants. As for all screening assays, they were driven to optimize both sensitivity and specificity of the ascertainment, but approaches to this balance differed. Redon et al.11 set parameters in their algorithm to allow fewer than 5% false positive 'calls' per experiment. Other studies set thresholds and assessed numbers of false positives retrospectively. Some reported these type I errors in relation to the number of clones in the array4,40,41 and others relative to the proportion of positive calls5,7, prohibiting a direct comparison of specificity among the various studies. Sensitivity was harder to assess, and arguably impossible without knowledge of the true (or at least gold standard–based data) underlying numbers of structural variants. Estimates ranged from 5% false negatives9 to 50% power to detect 25-kb deletions7, but sensitivity was generally compromised in favor of specificity.

Structural variants identified. Assay design had a strong impact on the type and size of structural variants detected (Fig. 1, Supplementary Fig. 2 and Table 2). All revealed quantitative variation (gains or losses), but three recognized only deletions7,8,10, and two could also detect evidence of inversions6,39. Sizes of variant segments could be as small as 1 bp with computational alignments39,42 (though many of these were smaller than our defining size threshold of 1 kb1). Small deletions were detected through haploid hybridization (70 bp–10 kb)8 or oligonucleotide (SNP) footprints (1–404 kb)7 (1–745 kb)10, and the fosmid approach revealed variants in the range of library inserts (40 kb)6. Array methods approached the larger end of the spectrum for CNVs (collectively, about 50 kb–1 Mb)4,5,9,11,40,41. BAC clone probes tend to initially overestimate the apparent size of variants, as the clones may be large relative to the variant segment(s) they harbor, and the more sensitive the platform, the greater the overestimation11,47. Oligonucleotide arrays, on the other hand, approach the boundaries of variable segments from within, and should provide more accurate size estimates as long as the region has sufficient probe density.

The architecture of a variant region can influence its apparent size. Independently discrete genomic segments whose borders overlap can form a variable region characterized as much larger than its component variants, or containing complex rearrangements of smaller independently variable elements (Supplementary Figs. 1 and 3). As a result, the basis for definitions of overlap, variants, variant regions, merged variants, locations and so forth have been discretionary and varied. The field is probably ready for functional consensus in this area.

The earliest surveys reported about 100 variants or regions4,5; more recently, Wong et al. reported a disproportionate 3,654 CNVs, from which only 800 were considered 'high frequency' and more likely to be true positives41. Sequence comparisons flagged many more thousands of sites39,42, albeit ones that were much smaller and often reflected sequence assembly artifacts. Each of the 12 studies in Table 3 added a majority of apparently new variant loci, though as the catalog of genomic structural variants accumulates, the number of such new additions will eventually plateau.

Validation of putative structural variants. We reemphasize that the discovery strategies in Table 3 are screening tests, which draw attention to genome segments with an increased probability of harboring true structural variation. Eventually, comprehensive sequence data will document the breadth and detail of each variable region and individual variant, as illustrated by fosmid insert sequence data6 and direct sequence assembly comparisons39. In the meantime, various validation strategies have been applied to subsets of putative variants in each of the discovery reports. These included (i) FISH of metaphase, interphase or fiber chromosomes using various clones or PCR-amplified molecules; (ii) PCR or quantitative PCR (qPCR) for allele loss or quantitative variation; (iii) multiple ascertainment, whereby considerable weight was given to whether or not a putative variant was seen in more than one individual or had been reported in previous studies; (iv) array CGH to validate computational screening results6,7 or for finer resolution of BAC-screening results by oligonucleotide arrays9,41; (v) sequence analysis of fosmid inserts to confirm calls and to assess some discordant ones6,9; (vi) allele-specific fluorescence intensities10 and (vii) familial clustering41.

These assays were variously applied to subsets of data, and outcomes were used effectively in some studies7,10,11 to further evaluate the sensitivity and specificity and/or error rates of the primary screening methods. The proportion of putative variant loci that have been individually validated by means other than multiple ascertainments remains small, presumably due to the technical challenges of the confirmatory tests. All studies provided some information about the frequency of each putative structural variant or region, both as an argument for validation and to characterize the findings. A growing consensus in the field is for more validation of variants using two or more technologies.

Recommendations for standards

Based on our enumeration of the challenges facing this new field and a thorough review of published experimental designs, we provide four broad guidelines that follow the natural progression of experimentation as an initial step toward the development of standards. As the field matures, these guidelines should serve as precursors to stricter standards that undergo regular and comprehensive vetting by the community48. We are struck by the resemblance to issues raised by the MIAME (minimum information about a microarray experiment) standards49, as well as by Lander and Kruglyak50, with recommendations to find the right balance of stringency and value judgment to avoid as much error as possible without delaying discovery. The latter paper's recommendations for modifiers (suggestive, significant, highly significant and confirmed) might well be adapted for the statistical annotation of structural variants in databases.

In their current form, the recommended standards could also serve as a checklist for reviewers and editors as they assess manuscripts that report structural variation data. Moreover, as more structural variation data are reported and the nature of the variants becomes better understood, curators of databases would be at greater liberty to accept or reject complete or partial datasets according to established quality thresholds.

1. Describing the sample. The study should report the origin of each sample (for example, new or from a repository) and all of its characteristics, including the source (for example, blood, cell line, tissue) and karyotypic status, as well as the age, sex, ethnicity and phenotype (disease or nondisease features) of the donor. For surveys aiming to capture structural variation from the general population for control databases, there should be particular emphasis on detailing the extent of phenotype investigation. The study should also accurately document the genetic relationship of samples and any manipulation of the samples such as cell-culturing conditions or whole genome amplification, including protocols for extracting and labeling samples. Previous publications using the sample and all associated aliases should be listed.

2. Reporting experiments. Upon publication, the researchers must declare all aspects of the experimental design and results, including the experimental platform (for example, all clone or sequence identifiers used in arrays), technical procedures, data extraction and processing protocols, the version of the reference genome sequence used for comparison or annotation, and all validation results. The information must be made available in a format that enables unambiguous interpretation, replication of the experiment and the opportunity for other researchers to reanalyze the data to verify the conclusions48,49. For example, many array CGH experiments are performed using different test and reference samples, a variable number of spot replicates and differential use of dye-swap replicates. These methodological details affect the interpretation of the data and inferences regarding the presence or absence of a particular structural variant. Most existing new structural variation data are being generated using microarrays; therefore, suitable repositories include the Gene Expression Omnibus (GEO)51, ArrayExpress52 and CIBEX53 databases. As more sequence data emerge in structural-variation discovery initiatives, it is important that the underlying sequences and traces be made publicly available. Similarly, methodological differences exist in alignment algorithms; in addition to simple lists of sequence differences between assemblies or traces, the underlying alignments from which these events were called should be available.

3. Quality control. All studies should apply stringent criteria to ensure an accurate empirical estimation of the performance of the detection protocol used. Ideally, the parameters of the detection should be calibrated using a limited set of test data to achieve an acceptable level of false positive among the regions that are called. There are several metrics for this estimation, for example, the false discovery rate54. Parameters should be set to maximize screening specificity (minimize false positive calls) without undue compromise to sensitivity. To simplify this process, we recommend that all studies include at least one (and preferably more) standard control sample to be used as a reference for comparison. Initially, we propose sample NA15510 from the US National Institute of General Medical Sciences (NIGMS) Human Genetic Cell Repository, as it has already been characterized using a number of platforms (Table 2), and is also now being sequenced. A second reference sample could be NA10851, as it has also been characterized extensively11.

In addition to calibrating the parameters used for CNV calling, the quality of the total set of variants called across the entire sample set should be assessed. This requires unbiased sampling of the putative variants to be validated: that is, not just assessing those called most frequently, but ensuring representation of the entire frequency distribution. Good examples from the different experimental approaches outlined in Table 3 include validation of singleton and nonsingleton error rates11, estimation of fosmid read-pair error rates by sequencing the fosmid6 and estimation of error rates using a secondary technology such as oligonucleotide arrays7. It should no longer be considered sufficient to estimate the error rates by extrapolating from self-self experiments, without confirming that the estimated error rates were indeed correct and investigating how individual experimental error rates translate into study-wide error rates.

4. Describing structural variants. The study should thoroughly report characteristics of the structural variants, including sequence content (start and end points or complete sequence content with appropriate annotation), and population frequency and distribution (if known), including samples and assays used to determine these parameters. A future challenge will be to develop standards for defining CNV regions (CNVRs)—merging data from different individuals and different surveys into a single set of CNVRs. The ideal situation would be that each 'called' CNVR has an audit trail of both the experimental data and the processing of the data to the final call. Robust documentation of standardized CNVRs in databases will require specific rules to be established, and although their description is beyond the scope of this Perspective, the writing of it will stimulate future discussion. For CNVs and CNVRs, the definitions and criteria used by Redon et al.11 offer a good framework to build on (also see Supplementary Fig. 1). The current limitations in breakpoint resolution make it difficult to assign specific accession numbers to CNVs. However, once structural variants are described with boundaries mapped at nucleotide resolution, identifiers should be assigned using a nomenclature similar to that currently used for SNPs.

Summary and the future

Many of the issues confronting the field of structural variation will be resolved as advances in technology allow robust and economical analysis of structural variants at the nucleotide level in multiple genomes. Such techniques will include 'tiling path'-coverage oligonucleotide arrays, paired-end sequence relationship comparisons, and partial or complete sequence assembly comparisons. The ultimate standard will be sequence resolution of all structural variation in a defined set of reference individuals to establish a benchmark for genotyping platforms. We do not foresee that any one approach will capture all genetic variation reliably, nor, for at least a few more years, will a single strategy predominate over microarray-based approaches. Therefore, the main challenges from this point onward will surely include managing a huge data volume, integrating information from various discovery platforms and discerning phenotypic implications. New issues will arise, such as how to best annotate structural variation data in individual diploid genome assemblies (arising from personalized sequencing projects), as well as how to put haplotypes of structural variants (with or without SNPs) into context with respect to the latest human reference sequence. Structural variation data should also assist SNP, linkage disequilibrium and gene expression determination, but new database tools will be required to fully interpret the data.

Structural variation discoveries offer the potential to bridge a longstanding gap between cytogenetic and sequence-based investigations, and unify our understanding of genetic variation. Interestingly, at the onset of writing, we tried to sidestep the topic of terminology (and nomenclature), but kept returning to it in some way or another as we worked to define and distill the breadth of issues before us. In fact, it was the issue of terminology that highlighted the extreme heterogeneity in data being published, with the related strengths, caveats and differences in the studies being attributable in part to the different backgrounds of the researchers involved.

An equally intricate issue for data integration in the future will be categorizing structural variants in terms of whether they are 'normal', 'disease-causing' or 'phenotype-associated', as these designations can be part of a continuous range1,24,55,56. In Table 4, we put forward ideas of annotation modifiers that will assist in maximizing the utility of structural variation information. Molecular cytogeneticists have always been faced with this dilemma and its particular implications in the prenatal or diagnostic setting. Now, with the ability to readily recognize submicroscopic and sequence-level variation, the question of how to differentiate benign and disease-associated structural changes will be increasingly important. There are already well defined examples in which the presence of a structural variant correlates directly with a syndrome or phenotype, such as the many dosage-related microdeletions and duplications that cause genomic disorders57,58,59,60,61,62,63 (also see the DECIPHER database). Family-based studies can demonstrate whether a change is de novo or has been inherited and, in the latter case, whether there are likely to be associated phenotypic consequences (noting there are numerous examples of variable expression of phenotype and disease in inherited chromosomal rearrangements)1,21,55. Otherwise, large population studies and control and disease reference databases will provide the best source of information about a structural variant's frequency and likelihood of causing a phenotypic outcome.

Table 4 Classification of modifiers used for the description of structural variationa

Notwithstanding the challenges, we believe that the recommendations presented here offer necessary first steps toward standardization of many of the variables that, if ignored, will impede progress. At the same time, we recognize that consensus is important, and that standards require time to mature before adoption and implementation48. With some ground rules now set, it is also our intention to continue discussions with the genomic structural variation research community at the most relevant meeting opportunities.

Note: Supplementary information is available on the Nature Genetics website.