Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Apr 8;20(1):275.
doi: 10.1186/s12864-019-5642-0.

A hybrid de novo genome assembly of the honeybee, Apis mellifera, with chromosome-length scaffolds

Affiliations

A hybrid de novo genome assembly of the honeybee, Apis mellifera, with chromosome-length scaffolds

Andreas Wallberg et al. BMC Genomics. .

Abstract

Background: The ability to generate long sequencing reads and access long-range linkage information is revolutionizing the quality and completeness of genome assemblies. Here we use a hybrid approach that combines data from four genome sequencing and mapping technologies to generate a new genome assembly of the honeybee Apis mellifera. We first generated contigs based on PacBio sequencing libraries, which were then merged with linked-read 10x Chromium data followed by scaffolding using a BioNano optical genome map and a Hi-C chromatin interaction map, complemented by a genetic linkage map.

Results: Each of the assembly steps reduced the number of gaps and incorporated a substantial amount of additional sequence into scaffolds. The new assembly (Amel_HAv3) is significantly more contiguous and complete than the previous one (Amel_4.5), based mainly on Sanger sequencing reads. N50 of contigs is 120-fold higher (5.381 Mbp compared to 0.053 Mbp) and we anchor > 98% of the sequence to chromosomes. All of the 16 chromosomes are represented as single scaffolds with an average of three sequence gaps per chromosome. The improvements are largely due to the inclusion of repetitive sequence that was unplaced in previous assemblies. In particular, our assembly is highly contiguous across centromeres and telomeres and includes hundreds of AvaI and AluI repeats associated with these features.

Conclusions: The improved assembly will be of utility for refining gene models, studying genome function, mapping functional genetic variation, identification of structural variants, and comparative genomics.

Keywords: Centromeres; Genome assembly; Hi-C; Linked-read sequencing; Optical mapping; Single-molecule real-time (SMRT) sequencing; Telomeres.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not Applicable.

Consent for publication

Not Applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Comparison between assemblies. a Stacked contigs from the previous honeybee genome assembly Amel_4.5 [34] and the long-read sequencing technologies used in this project. Sequences are sorted by length (x-axis) and the cumulative proportion of each assembly that is covered by the contigs is displayed on the y-axis. Dashed line indicates contig with length equivalent to N50. From the left: Amel_4.5, 10x Chromium-only (assembled using Supernova), PacBio-only (assembled using FALCON), Amel_HAv1 (PacBio contigs +10x scaffolding, see Methods) and Amel_HAv3 (Amel_HAv1 scaffolded using BioNano to produce AmelHA_v2, followed by Hi-C scaffolding). For 10x Chromium sequences, the full-length linked-read scaffolds are shown (i.e. including gaps). b Stacks from A super-imposed over the Amel_HAv3 scaffolds (i.e. including gaps). These scaffolds are chromosome-length and contain 51 gaps
Fig. 2
Fig. 2
Assembly overview. An overview of the 16 linkage groups or chromosomes of Amel_HAv3 after anchoring and orienting the contigs according to the genetic map [38]. Grey shades indicate the intervals of each contig. Dots above each chromosome indicate the locations of genetic map markers (black = markers that are congruent with the assembly; red = markers that are incongruent, i.e. interleaved or reversed; blue = ambiguous markers, i.e. overlapping or widely separated primer sites). Genome-wide GC-content is indicated with a white dashed line and local %GC is mapped across all chromosomes (10 kbp non-overlapping windows; light-blue curve on y1-axis). The density of telomeric TTAGG/CCTAA repeats is shown (10 kbp non-overlapping windows; dark-blue curve on y2-axis; filled circles shown for values > 10%). Extended low-GC regions indicating putative centromere regions are shown above chromosomes (bounded by adjacent 100 kbp windows < genome-wide %GC; light-blue), whereas experimental centromere mappings from [31] are indicated below chromosomes (boxes bounded by genetic map markers; extended upstream to the tip of the chromosome as dots when the area started at the first genetic map marker; light-yellow). The locations of centromeric AvaI (green) and telomeric AluI (black) clusters, respectively, are marked along chromosomes. Miniature chromosome models are redrawn from [30] and indicate experimental detection of AvaI and AluI arrays
Fig. 3
Fig. 3
Interspersed and tandem repeats detected with RepeatMasker. a The proportion of different repeat classes across the Amel_HAv3 in: i) all contigs; ii) anchored contigs; and iii) unplaced contigs. The total length and proportion of each repeat is given below each class. b Comparison of repeat frequencies in anchored sequence and unplaced sequence between Amel_4.5 and Amel_HAv3. C) Overall enrichment of repeats in Amel_HAv3 compared to Amel_4.5
Fig. 4
Fig. 4
The Longest tandem arrays of AluI and AvaI repeats. a Location of the longest AluI cluster. Genome-wide GC-content is indicated with a white dashed line and local %GC is shown across 1kbp non-overlapping windows (light-blue curve on y1-axis). Grey curve indicates the proportion of simple repeats (1kbp non-overlapping windows; y2-axis). b Location of the longest AvaI cluster. Other statistics as in A
Fig. 5
Fig. 5
Properties of sequences classified from whole-genome alignments between Amel_HAv3 and Amel_4.5 using Satsuma. a The proportions of the Amel_HAv3 assembly with or without matching sequence in Amel_4.5 is displayed at the top. The first four categories (left-to-right) refer to anchored sequence: blue = alignments between sequences that occur on the same chromosome in both assemblies; green = alignments between sequences that are anchored to chromosomes in Amel_HAv3 but were unplaced in Amel_4.5; yellow = alignments between sequences that have switched chromosomes; grey = unaligned Amel_HAv3 sequence without detected matches in Amel_4.5. The two last categories refer to unplaced sequence: light-grey = alignments between sequences that were not anchored to chromosomes in either assembly; dark-grey = unanchored and unaligned Amel_HAv3 sequence. The amount and proportion of simple repeats and the different classes of interspersed repeats according to the alignment regions in A is show below. b The average mappability, %GC and density of simple and interspersed repeats/low complexity sequence according to the regions in A (95% confidence intervals generated from 2000 bootstrap replicates of 1 kbp non-overlapping windows)
Fig. 6
Fig. 6
Model and properties of distal telomeres. a A model of the subtelomeric and telomeric regions as inferred from alignment and sequence analysis of the distal ends of 14 chromosomes (two telomere sequences from chromosome 1). All statistics are computed across 100-bp windows using the distal telomere on chromosome 8 as backbone. A 3-kbp subtelomeric region is indicated with a white box, together with conserved and GC-rich sub-regions within it. A shared repeat element is indicated at the subtelomere-telomere junction. A > 10-kbp telomeric region is indicated in the last box and the proportions of the canonical TTAGG repeat and variants are indicated for every 100-bp window. b Number of subtelomere/telomere sequences extending across the alignment; c The average density of TTAGGs and variants along the region. 95% confidence intervals for each window was computed from 2000 bootstrap replicates. d The average pairwise sequence divergence between chromosomes. Confidence intervals computed as in C. e Average GC-content along the region. Confidence intervals computed as in C
Fig. 7
Fig. 7
Features around centromeric AvaI repeats. a Average GC-content was computed from 1kbp windows located within intervals at different distances from AvaI clusters with at least 3 repeats (0-20kbp; 20-40kbp; 40-80kbp; 80-160kbp; 160-320kbp; 320-640kbp; 640–1280kbp; 1280–2560kbp; 2560–5120kbp). 95% confidence intervals were computed from 2000 bootstrap replicates of each interval. b As in A but tracing the density of simple repeats/low complexity sequence. c As in A, but tracing the density of DNA transposons, the dominant interspersed repeat class in the honeybee genome
Fig. 8
Fig. 8
Recombination rates in different genomic regions. Recombination rates were computed from the genetic and physical distances between genetic map markers scattered across the whole genome or located within putative centromere regions. 95% confidence intervals were computed from bootstrapping marker-to-marker pairs (2000 replicates)

References

    1. Worley KC, Richards S, Rogers J. The value of new genome references. Exp Cell Res. 2017;358:433–438. doi: 10.1016/j.yexcr.2016.12.014. - DOI - PMC - PubMed
    1. Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010;11:31–46. doi: 10.1038/nrg2626. - DOI - PubMed
    1. Treangen TJ, Salzberg SL. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet. 2011;13:36–46. doi: 10.1038/nrg3117. - DOI - PMC - PubMed
    1. Chénais B, Caruso A, Hiard S, Casse N. The impact of transposable elements on eukaryotic genomes: from genome size increase to genetic adaptation to stressful environments. Gene. 2012;509:7–15. doi: 10.1016/j.gene.2012.07.042. - DOI - PubMed
    1. Weischenfeldt J, Symmons O, Spitz F, Korbel JO. Phenotypic impact of genomic structural variation: insights from and for human disease. Nat Rev Genet. 2013;14:125–138. doi: 10.1038/nrg3373. - DOI - PubMed

LinkOut - more resources