Our view of genetic polymorphism is shaped by methods that provide a limited and reference-biased picture. Long-read sequencing, which is starting to provide nearly complete genome sequences for population samples, should solve the problem---except that characterizing and making sense of non-SNP variation is difficult even with perfect sequence data. Here, we analyze 27 genomes of Arabidopsis thaliana in an attempt to address these issues.
The genomes range from 135-155 Mb in size, and we show that this variation is almost entirely due to centromeric and rDNA repeats, which we do not attempt to assemble. The completely assembled chromosome arms comprise roughly 120 Mb in all accessions, but are full of structural variants, mostly associated with transposons. Analysis of these variants reveals an incompletely annotated mobile-ome. A pan-genome coordinate system that includes the resulting variation ends up being 1/3 larger than the size of any one genome. In contrast to this, the gene-ome is highly conserved. By annotating each genome using accession-specific transcriptome data, we identify 2,647 previously un-annotated non-TE-like genes, many of which turn out to be ancestral genes that are missing in the reference genome. Finally we show that previous SNP data had missed over 40% of SNPs, mostly in regions where short reads could not be mapped reliably. We demonstrate that SNP-calling errors can be biased by the choice of reference genome, and also that RNA-Seq and BS-Seq results can be strongly affected by mapping reads to a reference genome rather than the genome of the assayed individual.