Monday, September 12, 2022
HomeBiologyA curated information useful resource of 214K metagenomes for characterization of the...

A curated information useful resource of 214K metagenomes for characterization of the worldwide antimicrobial resistome


Introduction

The huge quantity of genomic information accessible in public information repositories is a singular and doubtlessly essential useful resource for doing analysis and genomic surveillance of antimicrobial resistance (AMR). Utilizing these datasets collected from places everywhere in the world throughout completely different years and from varied sampling sources may additional help our understanding of the emergence and distribution of antimicrobial resistance genes (ARGs).

The sharing of genomic sequence information to one of many accessible repositories is right now a significant and infrequently obligatory step in peer-reviewed journals, for which a number of repositories had been created by the members of the Worldwide Nucleotide Sequence Database Collaboration (INSDC) [1], together with the European Nucleotide Archive (ENA) [2]. The variety of sequencing information accessible at ENA continues to extend with an estimated doubling time of 18 months (https://www.ebi.ac.uk/ena/browser/about/statistics; accessed 2022-03-08).

A number of approaches for analyzing genomic information relying on the pattern sorts are already effectively established.

Nonetheless, the exploration of those sources is commonly restricted to a couple analysis teams solely since each adequate abilities in bioinformatics and entry to high-performing pc sources are wanted to deal with the big quantity of accessible information.

Current collections of analyzed datasets are likely to give attention to both particular pattern sources, equivalent to people [3,4], marine [5], or city sewage [6,7], or give attention to particular genera [8]. Particularly the COVID-19 pandemic has highlighted the worth of knowledge sharing to hint the unfold and evolution of the virus [9]. Regardless of the makes an attempt to standardize the evaluation workflows of those databases, they’re restricted of their skill to generalize throughout environments and places. A current research [10] has shared a searchable assortment of 661K bacterial genomes for exploring the worldwide bacterial variety throughout completely different origins, offering an easy-to-access useful resource for genomic analysis. Whereas that is a formidable data-sharing effort, the authors didn’t embrace metagenomic samples of their pipeline. Metagenomic strategies purpose to sequence all DNA in a pattern and can be utilized to characterize the microbiome in several environments [11,12], uncover novel organisms [13], monitor illness [14,15], and particular genes, equivalent to ARGs [5,6,16].

Right here, we current a large-scale metagenomic evaluation of 214,095 metagenomic samples retrieved from ENA. We have now carried out an assembly-free method by aligning sequencing reads in opposition to ARGs and 16S/18S ribosomal RNA genes. We have now beforehand revealed an in-depth evaluation of the distribution of mobilized colistin resistance [17] primarily based on these information. Now we each share the whole assortment of mapping outcomes and showcase learn how to characterize the worldwide resistome and microbiome with this dataset. The curated metadata and mapping outcomes can be found at https://doi.org/10.5281/zenodo.6919377 and documentation at https://hmmartiny.github.io/mARG/Tables.html.

Supplies and strategies

Retrieval of metagenomes

We retrieved metagenomic datasets from ENA [2] uploaded between 2010-01-01 and 2020-01-01 that had library supply as “METAGENOMIC” and library technique of “WGS.” We collected 214,095 sequencing runs from 146,732 samples from 6,307 tasks equivalent to 442 Tbp of uncooked reads taking over 300 TB of storage. The related metadata for every pattern was additionally retrieved.

Preprocessing and mapping of sequencing reads

The retrieved uncooked FASTQ reads had been trimmed and aligned in opposition to reference sequences, as outlined in Martiny (2022) [17]. In short, we used FASTQC v.0.11.15 (https://www.bioinformatics.babraham.ac.uk/tasks/fastqc/) for learn high quality checking and BBduk2 v.36.49 [18] for trimming the uncooked sequencing reads. With the k-mer-based alignment software KMA 1.2.21 [19], the trimmed reads had been mapped in opposition to reference sequences from 2 completely different databases: The AMR gene database ResFinder [20] (downloaded 2020-01-25), which contained 3,085 sequences of acquired ARGs, and the ribosomal rRNA Silva [21] gene database (model 138, downloaded 2020-01-16), which had 2,225,272 reference sequences with greater than 88% of them being 16/18S rRNA genes. For KMA, we used the next alignment parameters: 1, -2, -3, -1 for a match, mismatch, hole opening, and hole extension. For learn pairing, we used a price of seven and a minimal relative alignment rating of 0.75. Information retrieval, high quality checking, trimming, and skim alignments had been achieved utilizing the Danish Nationwide Supercomputer for Life Sciences (https://www.computerome.dk/).

Standardization of metadata

The next attributes for every metagenome had been standardized: sampling location, sampling host or atmosphere (known as a number under), and sampling date.

To standardize the label for sampling places, we seemed on the values entered within the two fields “nation” and “location.” First, the latitude and longitude coordinates had been mapped to a rustic utilizing the Python library Shapely 1.7.1 [22] to seek out the matching space outlined in one of many 3 public area map datasets (nations, marine, and lakes) accessible within the Pure Earth Information assortment. If the lookup failed or the coordinates weren’t given, the second step was to match the textual content attribute within the nation label to ISO 3166 nation codes with a fuzzy search with the Python library PyCountry 20.7.3 (https://github.com/flyingcircusio/pycountry). Lastly, if the two lookup searches didn’t yield a match, we did a guide lookup of the nation labels to standardize the textual content.

For the standardization of host labels, we mapped the taxonomic id given by the attribute “host_tax_id” to the NCBI Taxonomy database [23], or if the characteristic was lacking, the “tax_id” was used as a substitute.

For the reason that solely solution to curate entered assortment dates is to lookup suspicious dates in revealed research manually, and that was deemed too time-intensive, we determined to interchange dates entered as later than 2020-01-01 within the pattern attribute discipline “collection_date” with the lacking worth NULL.

Outcomes

Right here, we current a large-scale mapping of 442 Tbp of uncooked reads of 214,095 metagenomic samples appropriate for analyzing the distribution of acquired antimicrobial resistance genes and 16S/18S rRNA genes. Moreover, now we have spent appreciable effort standardizing 3 important pattern attributes: sampling date, location, and supply. To facilitate easy accessibility and utilization, now we have shared the mapping outcomes and corrected metadata in 3 completely different information codecs (TSV, HDF, and MySQL dumps). We additionally present tutorials with code examples in R and Python on utilizing the information in several situations. Information recordsdata are all accessible at https://doi.org/10.5281/zenodo.6919377.

By accumulating the sequencing reads from ENA, we might additionally confirm the inherited bias of particular pattern sorts or sources being overrepresented merely because of the availability within the public repository. Whereas the 214,095 metagenomic datasets had been collected from 797 completely different hosts, most had been both of human or marine origin (Fig 1A). The same skewed geographical distribution in the direction of European and North American nations was noticed within the sampling places (Fig 1B). The distribution of samples in response to the sampling yr reveals {that a} appreciable quantity had been collected between 2010 and 2020 (Fig 1C).

thumbnail

Fig 1. Distribution of metagenomes reveals the overrepresentation of samples from particular sources.

(a) Variety of samples grouped per sampling host, the place solely hosts with greater than 1,000 samples are plotted. (b) Pattern places for metagenomes with accessible GPS coordinates; every marker is a pattern. A complete of 83,903 samples didn’t have coordinates accessible. (c) Yr of which a pattern was collected. A complete of 84,238 of the samples didn’t have a legitimate sampling date recorded. The info underlying this determine may be discovered at https://doi.org/10.5281/zenodo.6919377, and the bottom layer map was created with information from https://www.naturalearthdata.com/.


https://doi.org/10.1371/journal.pbio.3001792.g001

Of the greater than 1.8∙1012 uncooked sequencing reads, equivalent to 442.1 Tbp, 93% of the reads had been generated utilizing Illumina sequencing applied sciences (S1 Fig). We mapped over 1.69∙1012 trimmed learn fragments, with a median of 784,748 fragments per pattern (vary 1 to 916,901,400) (Fig 2A). Roughly 0.04% of all learn fragments may very well be aligned to ARGs, and 0.19% to rRNA genes. Total, the quantity of sequencing reads and bases accessible did enhance the rely of aligned learn fragments (S3 Fig). The variety of ARG fragments aligned elevated with the variety of aligned rRNA fragments, though for 34% of the samples, we didn’t discover any ARGs regardless of having learn fragments aligning to 16S rRNA genes (Fig 2B). The microbial variations within the completely different sampling origins had been highlighted within the variety of aligned fragments (S4 Fig).

The worldwide abundance of antimicrobial resistance

To measure the worldwide distribution of ARGs and the composition of the resistome, we calculated the abundance of ARGs because the log-ratio of ARG fragments over summed rRNA sequence fragments. Nearly the entire reference sequences from the ResFinder database had at the least 1 fragment aligned, and solely 94 ARGs had no hits (S2 Fig). The median noticed resistance load per metagenomic pattern was 11.74 (log vary: −1.45 to 23.52) (Fig 3A), which seemed to be primarily depending on the geographic origin and atmosphere (Fig 3B–3D) and never on which yr the pattern was taken. For instance, samples originating from places inside Europe confirmed related abundance ranges for a lot of the samples however with a number of outliers, whereas a number of samples from places within the Oceania area had a wider load distribution with few outliers (Fig 3C).

thumbnail

Fig 3. Boxplots of ARG abundances in metagenomic samples present that ranges range throughout completely different origins.

(a) Distribution of ARG abundance per pattern. (b) Distribution of sample-wise ARG abundance grouped by sampling yr. (c) Pattern-wise ARG abundance per sampling location. (d) Pattern-wise ARG abundance grouped by hosts. Solely hosts with greater than 1,000 metagenomes analyzed are proven. The info underlying this determine may be discovered at https://doi.org/10.5281/zenodo.6919377.


https://doi.org/10.1371/journal.pbio.3001792.g003

Whereas the distribution of sample-wise resistance masses illustrates the excessive variability on this information assortment (Fig 3), we noticed that after we stratified the relative ARG learn proportions per resistance class and pattern kind, there have been clear separations between completely different teams (Fig 4). For the sampling years with a substantial variety of samples accessible (2004 to 2019), the relative proportion of courses was comparatively constant, with Tetracycline reads being the most typical, aside from a spike of Beta-lactam reads in 2017 (Fig 4A). Throughout the continents and enormous water our bodies, we noticed that ARGs conferring resistance to Aminoglycosides or Beta-lactam antimicrobials had been extra frequent in water environments, whereas mainland areas had a extra various distribution (Fig 4B). As soon as we stratified by sampling host or supply, the distribution of resistance courses was very depending on the group, as seen by the excessive proportion of learn fragments aligned to, for instance, Phenicol for marine and soil samples and Tetracycline reads being extremely prevalent in mice (Mus musculus) samples (Fig 4C).

Linking the microbiome variety with resistance variety

The connection between the variety of the microbiome and the resistance genes was quantified by calculating the species richness and a couple of alpha variety measurements (Shannon and Gini–Simpson) on ARG ranges and phyla and genera taxonomic ranges. With out trying on the pattern origin, we noticed {that a} majority of the samples had each excessive microbial variety and ARG variety (Figs 5 and S5). Nonetheless, the connection between genera and ARG variety indexes differed between sampling sources, with a number of teams containing samples that didn’t observe the belief of the two variety measurements following one another, suggesting that elevated variety of microbes in, for instance, soil samples doesn’t essentially result in the next variety of resistance genes. Contrarily, the rooster (Gallus gallus) samples confirmed that they nonetheless had elevated ARG variety regardless of having decrease microbial variety (Fig 5).

Dialogue

World surveillance of AMR primarily based on genomics continues to develop into extra accessible because of the development in NGS applied sciences and the follow of sharing uncooked sequencing information in public repositories. Standardized pipelines and databases are wanted to make the most of these massive information volumes for monitoring the dissemination of AMR. We have now uniformly processed the sequencing reads of 214,095 metagenomes for the abundance evaluation of ARGs.

Our information sharing efforts allow customers to carry out abundance analyses of particular person ARGs, the resistome, and the microbiome throughout completely different environments, geographic places, and sampling years.

We have now given a short characterization of the distribution of ARGs in response to the gathering of metagenomes. Nonetheless, in-depth analyses stay to be carried out to research the affect of temporal, geographical, and environmental origins on the dissemination and evolution of antimicrobial resistance. For instance, analyzing the unfold of particular ARGs throughout places and completely different environments might reveal new transmission routes of resistance and information the design of intervention methods to cease the unfold. We have now beforehand revealed a research specializing in the distribution of mobilized colistin resistance (mcr) genes utilizing this information useful resource, exhibiting how extensively disseminated the genes had been [17]. One other use of the information assortment may very well be to discover how the adjustments in microbial abundances have an effect on and are affected by the resistome. Moreover, our protection statistics of reads aligned to ARGs may very well be used to research the speed of latest variants occurring in several reservoirs. Though now we have centered on the specter of antimicrobial resistance, potential purposes of this useful resource may be to take a look at the results of, for instance, local weather adjustments on microbial compositions. Linking our noticed learn fragment counts with different kinds of genomic information, equivalent to evaluating the danger of ARG mobility, accessibility, and pathogenicity in assembled genomes [27,28], and verifying observations from scientific information [29].

We advocate that potential customers think about all of the confounders current on this information assortment of their statistical exams and modeling workflows, emphasizing that the experimental strategies and sequencing platforms dictate the obtained sequencing reads and that metadata for a pattern may be mislabeled, regardless of our efforts to reduce these sorts of errors. Moreover, it’s important to contemplate the compositional nature of microbiomes [30]. The reads don’t depend upon the distribution of genetic materials within the pattern however on the capability of the sequencing platform [24,31]. Numerous statistical strategies exist already that think about the compositionality [24,32,33]. Lastly, it is very important spotlight that the outcomes now we have introduced right here embrace fragment counts of 1 for the sake of transparency, however we additionally advocate potential customers think about acceptable filters of their evaluation.

The sequencing information in public repositories has continued to develop, giving us loads of alternatives to proceed to broaden our information assortment much more. To ascertain a really international surveillance program of AMR, sequencing information must be analyzed as quickly as revealed in these archives. Though this might require entry to much more computational sources, we hope to attain this quickly and evaluate our method with different strategies, equivalent to AMRFinderPlus [34] and CARD [35]. As new sequencing applied sciences have gotten extra used, our settings for our alignment process must also be tuned to higher take benefit and pay attention to the issues of various sequencing platforms.

With this information useful resource, now we have taken a step in the direction of enabling the scientific group to make the most of the wealth of data in these metagenomic samples to broaden our understanding of the dissemination of antimicrobial resistance and adjustments in microbiomes at each native and international scales by means of time and environments.

References

  1. 1.
    Arita M., Karsch-Mizrachi I., Cochrane G. The worldwide nucleotide sequence database collaboration. Nucleic Acids Res. (2021) 49, D121. pmid:33166387
  2. 2.
    Leinonen R. et al. The European nucleotide archive. Nucleic Acids Res. (2011) 39, 44–47.
  3. 3.
    Shao L., Liao J., Qian J., Chen W., Fan X. MetaGeneBank: a standardized database to check deep sequenced metagenomic information from human fecal specimen. BMC Microbiol. (2021) 21, 1–12.
  4. 4.
    Almeida A. et al. A unified catalog of 204,938 reference genomes from the human intestine microbiome. Nat Biotechnol. (2021) 39, 105–114. pmid:32690973
  5. 5.
    Cuadrat R. R. C., Sorokina M., Andrade B. G., Goris T., Dávila A. M. R. World ocean resistome revealed: Exploring antibiotic resistance gene abundance and distribution in TARA Oceans samples. Gigascience. (2020) 9, 1–12. pmid:32391909
  6. 6.
    Hendriksen R. S. et al. World monitoring of antimicrobial resistance primarily based on metagenomics analyses of city sewage. Nat Commun, (2019) 10. pmid:30850636
  7. 7.
    Fresia P. et al. City metagenomics uncover antibiotic resistance reservoirs in coastal seaside and sewage waters. Microbiome. (2019) 7, 1–9.
  8. 8.
    Zhou Z., Alikhan N. F., Mohamed Okay., Fan Y., Achtman M. The EnteroBase consumer’s information, with case research on Salmonella transmissions, Yersinia pestis phylogeny, and Escherichia core genomic variety. Genome Res. (2020) 30, 138–152. pmid:31809257
  9. 9.
    Khare S. et al. GISAID’s Function in Pandemic Response. China CDC Wkly. (2021) 3, 1049–1051. pmid:34934514
  10. 10.
    Blackwell G. A. et al. Exploring bacterial variety by way of a curated and searchable snapshot of archived DNA sequences. PLoS Biol. (2021) 19, e3001421. pmid:34752446
  11. 11.
    Fierer N. et al. Cross-biome metagenomic analyses of soil microbial communities and their purposeful attributes. Proc Natl Acad Sci U S A. (2012) 109, 21390–21395. pmid:23236140
  12. 12.
    Gill S. R. et al. Metagenomic evaluation of the human distal intestine microbiome. Science (80-). (2006) 312, 1355–1359. pmid:16741115
  13. 13.
    Al-Shayeb B. et al. Clades of big phages from throughout Earth’s ecosystems. Nature. (2020) 578, 425–431. pmid:32051592
  14. 14.
    Nieuwenhuijse D. F. et al. Setting a baseline for international city virome surveillance in sewage. Sci Rep. (2020) 10, 1–13.
  15. 15.
    Liu P., Chen W., Chen J. P. Viral Metagenomics Revealed Sendai Virus and Coronavirus An infection of Malayan Pangolins (Manis javanica). Viruses 2019, Vol 11, Web page 979 (2019) 11, 979. pmid:31652964
  16. 16.
    Forsberg Okay. J. et al. Bacterial phylogeny constructions soil resistomes throughout habitats. Nature. (2014) 509, 612–616. pmid:24847883
  17. 17.
    Martiny H.-M. et al. World distribution of mcr gene variants in 214,095 metagenomic samples. mSystems. (2022). pmid:35343801
  18. 18.
    Bushnell B. BBMap. (2014).
  19. 19.
    Clausen P. T. L. C., Aarestrup F. M., Lund O. Speedy and exact alignment of uncooked reads in opposition to redundant databases with KMA. BMC Bioinformatics. (2018) 19, 1–8.
  20. 20.
    Zankari E. et al. Identification of acquired antimicrobial resistance genes. J Antimicrob Chemother. (2012) 67, 2640–2644. pmid:22782487
  21. 21.
    Quast C. et al. The SILVA ribosomal RNA gene database mission: Improved information processing and web-based instruments. Nucleic Acids Res. (2013) 41, 590–596. pmid:23193283
  22. 22.
    Gillies S., Others A. Shapely: manipulation and evaluation of geometric objects. (2007).
  23. 23.
    Federhen S. The NCBI Taxonomy database. Nucleic Acids Res. (2012) 40, D136–D143. pmid:22139910
  24. 24.
    Gloor G. B., Macklaim J. M., Pawlowsky-Glahn V., Egozcue J. J. Microbiome datasets are compositional: And this isn’t non-obligatory. Entrance Microbiol. (2017) 8, 1–6.
  25. 25.
    Shannon C. E. A mathematical idea of communication. Bell Syst Tech J. (1948) 27, 379–423.
  26. 26.
    Jost L. Entropy and variety. Oikos. (2006) 113, 363–375.
  27. 27.
    Zhang A. N. et al. An omics-based framework for assessing the well being danger of antimicrobial resistance genes. Nat Commun. (2021) 12, 1–11.
  28. 28.
    Zhang Z. et al. Evaluation of worldwide well being danger of antibiotic resistance genes. Nat Commun, (2022) 13. pmid:35322038
  29. 29.
    Karkman A., Berglund F., Flach C. F., Kristiansson E., Larsson D. G. J. Predicting scientific resistance prevalence utilizing sewage metagenomic information. Commun Biol. (2020) 3, 1–10.
  30. 30.
    Aitchison J. The Statistical Evaluation of Compositional Information. J R Stat Soc Ser B. (1982) 44, 139–160.
  31. 31.
    Quinn T. P. et al. A discipline information for the compositional evaluation of any-omics information. Gigascience. (2019) 8, 1–14. pmid:31544212
  32. 32.
    Fernandes A. D., Macklaim J. M., Linn T. G., Reid G., Gloor G. B. ANOVA-Like Differential Expression (ALDEx) Evaluation for Blended Inhabitants RNA-Seq. PLoS ONE. (2013) 8. pmid:23843979
  33. 33.
    Friedman J., Alm E. J. Inferring Correlation Networks from Genomic Survey Information. PLoS Comput Biol. (2012) 8, 1–11.
  34. 34.
    Feldgarden M. et al. AMRFinderPlus and the Reference Gene Catalog facilitate examination of the genomic hyperlinks amongst antimicrobial resistance, stress response, and virulence. Sci Rep. (2021) 11.
  35. 35.
    Alcock B. P. et al. CARD 2020: Antibiotic resistome surveillance with the excellent antibiotic resistance database. Nucleic Acids Res. (2020) 48, D517–D525. pmid:31665441
RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments