Zepu Miao^, Yifan Ren^, Andrea Tarabini^, Ludong Yang, Huihui Li, Chang Ye, Gianni Liti, Gilles Fischer, Jing Li*, Jia-Xing Yue*. (2024) ScRAPdb: an integrated pan-omics database for the Saccharomyces cerevisiae reference assembly panel. Nucleic Acids Research, (in press; doi: 10.1093/nar/gkae955)
Jia-Xing Yue (yuejiaxing[at]gmail[dot]com)
[Q: question; R: reply]
Q1: How were the genome assembly sequences been collected?
R1: Exhaustive search was conducted to collect currently available long-read-based high-quality genome assemblies (for both nuclear and mitochondrial genomes) and their associated metadata for S. cerevisiae and its close relatives from the Saccharomyces species complex. These outgroup species include S. paradoxus, S. mikatae, S. kudriavzevii, S. arboricola, S. eubayanus, S. uvarum, as well as the lately described S. jurei and S. chiloensis. For each included Saccharomyces species, a combination of manual literature curation, Genbank query, and automatic web crawling was used. For Genbank query, all chromosome-level Saccharomyces genome assemblies uploaded to Genbank since 2015 were retrived and those sequenced by long-read sequencing technologies were kept for downstream quality control. For automatic web crawling, we developed a python script using the Biopython’s Entrez module to enable extensive and in-depth literature based search for long-read genomic datasets.
Q2: How were these genome assemblies been processed and annotated?
R2: All collected genome assemblies were processed with our previously develped LRSDAY pipeline[1] for assembly curation, chromosome-level scaffolding, and genomic feature annotation. The annotated features include centromere, gene, tRNA, Ty transposable element, X element, and Y’ element. Careful curation was performed at both assembly and annotation levels to eliminate potential problematic asemblies.
Q3: How was the Average Nucleotide Identity (ANI) been calculated in ScRAPdb?
R3: For nuclear and mitochondrial genomes, pairwise genome comparison and Average Nucleotide Identity (ANI) calculation were conducted by OrthoANI (v0.50)[2].
Q4: How were SNV, INDEL, and SV been calculated and processed in ScRAPdb?
R4: For S. cerevisiae assemblies, full-spectral of genomic variants such as single nucleotide variants (SNV), insertions/deletions (INDEL), and structural variants (SV) were detected using PAV (v2.3.4)[3] based on the S. cerevisiae reference genome (SGDref). The called variants were further processed with VEP (v109.3)[4] for variant effect prediction.
Q5: What are the original sources of the pan-omics datasets presented in ScRAPdb?
R5: The pan-omics datasets used in this database comes from published studies listed below:
The pangenome dataset: [5] [LINK]
The pantranscriptome dataset: [6] [LINK]
The panproteome dataset 1: [7] [LINK]
The panproteome dataset 2: [8] [LINK]
The panphenome dataset 1: [5] [LINK]
The panphenome dataset 2: [9] [LINK]
Q6: What omics data could I submit to ScRAPdb and what standardization procedure is needed for that?
R6: When you have published omics data (e.g., genome, transcriptome, proteome, phenome, etc.) for one or more strains of S. cerevisiae or its close relatives, you are welcome to contact us to potentially host your data at ScRAPdb. Datasets with more strains/environments/traits covered will have higher priority. For genome data, long-read-based genome assembly is required. Additional quality control and annotation will be performed on our side using our LRSDAY pipeline [1]. For transcriptome data, raw read counts need to be transformed to the widely adopted TPM (transcript per million) value for standardization. For proteome data, the raw mass spectrometry data needs to be processed by DIA-NN [10] to obtain normalized protein abundances. For phenome data, the doubling time or yield values will be normalized in relation to the corresponding values observed under the YPD medium.
1. Yue J-X, Liti G. (2018) Long-read sequencing data analysis for yeasts. Nature Protocols. 13:1213–1231. doi:10.1038/nprot.2018.025
2. Lee I, Ouk Kim Y, Park S-C, Chun J. (2016) OrthoANI: An improved algorithm and software for calculating average nucleotide identity. International Journal of Systematic and Evolutionary Microbiology. 66:1100–1103. doi:10.1099/ijsem.0.000760
3. Ebert P, Audano PA, Zhu Q, Rodriguez-Martin B, Porubsky D, Bonder MJ, et al. (2021) Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science. 372:eabf7117. doi:10.1126/science.abf7117
4. McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GRS, Thormann A, et al. (2016) The Ensembl variant effect predictor. Genome Biology. 17:122. doi:10.1186/s13059-016-0974-4
5. Peter J, De Chiara M, Friedrich A, Yue J-X, Pflieger D, Bergström A, et al. (2018) Genome evolution across 1,011 Saccharomyces cerevisiae isolates. Nature. 556:339–344. doi:10.1038/s41586-018-0030-5
6. Caudal É, Loegler V, Dutreux F, Vakirlis N, Teyssonnière É, Caradec C, et al. (2024) Pan-transcriptome reveals a large accessory genome contribution to gene expression variation in yeast. Nature Genetics. 56:1278–1287. doi:10.1038/s41588-024-01769-9
7. Teyssonnière E. M, Trébulle P, Muenzner J, Loegler V, Ludwig D, Amari F, et al. (2024) Species-wide quantitative transcriptomes and proteomes reveal distinct genetic control of gene expression variation in yeast. Proceedings of the National Academy of Sciences. 121(19). doi:10.1073/pnas.2319211121
8. Muenzner J, Trébulle P, Agostini F, Zauber H, Messner CB, Steger M, et al. (2024) Natural proteome diversity links aneuploidy tolerance to protein turnover. Nature. 630:149–157. doi:10.1038/s41586-024-07442-9
9. De Chiara M, Barré BP, Persson K, Irizar A, Vischioni C, Khaiwal S, et al. (2022) Domestication reprogrammed the budding yeast life cycle. Nature Ecology and Evolution. 6:448–460. doi:10.1038/s41559-022-01671-9
10. Demichev V, Messner CB, Vernardis SI, Lilley KS, Ralser M. (2020) DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nature Methods, 17(1):41–44. doi: 10.1038/s41592-019-0638-x