Fig. 1: Comparison of tree sequences with standard methods for storing genetic variation data. | Nature Genetics

Fig. 1: Comparison of tree sequences with standard methods for storing genetic variation data.

From: Inferring whole-genome histories in large population datasets

Fig. 1

a, The variant matrix underlying conventional storage methods for genetic variation data. b, A genealogical encoding of the data; if we know the tree we can store each variant site in constant space. c, Estimated sizes of files required to store the genetic variation data for a simulated human-like chromosome (100 Mb) for up to 10 billion haploid (5 billion diploid) samples. Simulations were run for between 101 and 107 haplotypes using msprime28, and the sizes of the resulting files are plotted (points). In each case, we show the original tree sequence file uncompressed and compressed using tszip (retaining only the topological information that is needed to represent genotypes using the --variants-only option). We also show the corresponding variation data encoded in the VCF35 and PBWT36 formats, along with their gzip-compressed equivalents. The VCF files for 107 samples were too large and time-consuming to process. The projected file sizes for VCF and compressed VCF files are based on fitting a simple exponential model. Projected files sizes for tree sequences are based on fitting a model based on the theoretical growth of tree sequences28. In cases in which we extrapolated the data, the largest data point was withheld from fitting to assess the model fit. We do not extrapolate for the PBWT files, as there is no theoretical model to predict their size.

Back to article page