The recent successes of new sequencing technologies has allowed increasingly large genomes to be sequenced at reduced costs. Transposable elements (TEs) constitute the most structurally dynamic components and typically the largest portion of nuclear sequences of these large genomes. For example, 85% of the maize genome (Schnable et al. 2009), and 88% of the wheat genome (Choulet et al. 2014) are composed of TEs. Therefore, TE annotation should be considered a major task in these genome projects. However, this task still remains a significant computational challenge. This crucial step is now a bottleneck for many whole genome analyses. We scaled-up a repeat detection and annotation package called REPET (Flutre, et al. 2011), now at its v2.2 release (http://urgi.versailles.inra.fr/Tools/REPET).
We improved it, adding:
- Structural TE detection is now implemented. LTRharvest (Ellinghaus, Kurtz, & Willhoeft, 2008) is used to search for LTR retrotransposons, using structural features of this TE category. Potential TEs are then classified to reduce false positives.
- Classification has been improved with the development of PASTEC (Hoede et al., 2014). It tests all TE classifications, each result being weighted according to the evidence found. In addition to similarities to known TEs in Repbase, Update and the search for repeated structures, it uses HMM profiles to classify TEs and to detect host genes.
- We created a new pipelines based on Tallymer (Kurtz, Narechania, Stein, & Ware, 2008), called TallymerPipe, as pre-processing tool for a fast repeated region detection.
- We created SegDup, a pipeline to detect segmental duplications, taking care of TEs, based on our previous work (Fiston-Lavier, Anxolabehere, & Quesneville, 2007).
Using these pipelines and the tools from the REPET package, we applied a new strategy, to cope with very large genomes such as the wheat (Choulet et al 2014, Daron et al 2014).
Choulet F, Alberti A, Theil S, Glover N, Barbe V, Daron J, Pingault L, Sourdille P, Couloux A, Paux E, Leroy P, Mangenot S, Guilhot N, Le Gouis J, Balfourier F, Alaux M, Jamilloux V, Poulain J, Durand C, Bellec A, Gaspin C, Safar J, Dolezel J, Rogers J, Vandepoele K, Aury JM, Mayer K, Berges H, Quesneville H, Wincker P, Feuillet C. Structural and functional partitioning of bread wheat chromosome 3B. Science. 2014 Jul 18;345(6194):1249721. doi: 10.1126/science.1249721. PMID: 25035497
Daron J, Glover N, Pingault L, Theil S, Jamilloux V, Paux E, Barbe V, Mangenot S, Alberti A, Wincker P, Quesneville H, Feuillet C, Choulet F. Organization and evolution of transposable elements along the bread wheat chromosome 3B. Genome Biol. 2014;15(12):546. PMID:25476263
Ellinghaus, D., Kurtz, S., & Willhoeft, U. (2008). LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC bioinformatics, 9, 18. doi:10.1186/1471-2105-9-18
Fiston-Lavier, A.-S., Anxolabehere, D., & Quesneville, H. (2007). A model of segmental duplication formation in Drosophila melanogaster. Genome research, 17(10), 1458–70. doi:10.1101/gr.6208307
Flutre, T., Duprat, E., Feuillet, C., & Quesneville, H. (2011). Considering transposable element diversification in de novo annotation approaches. PloS one, 6(1), e16526. doi:10.1371/journal.pone.0016526
Hoede C, Arnoux S, Moisset M, Chaumier T, Inizan O, Jamilloux V, Quesneville H. PASTEC: an automatic transposable element classification tool. PLoS One. 2014 May 2;9(5):e91929. doi: 10.1371/journal.pone.0091929. eCollection 2014. PMID: 24786468
Kurtz, S., Narechania, A., Stein, J. C., & Ware, D. (2008). A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes. BMC genomics, 9(1), 517. doi:10.1186/1471-2164-9-517
Schnable PS, et al. The B73 maize genome: complexity, diversity, and dynamics. Science. 2009 Nov 20;326(5956):1112-5. doi: 10.1126/science.1178534. Erratum in: Science. 2012 Aug 31;337(6098):1040. PMID:19965430