A curated information useful resource of 214K metagenomes for characterization of the worldwide antimicrobial resistome

September 12, 2022

1

Summary

The rising risk of antimicrobial resistance (AMR) calls for brand new epidemiological surveillance strategies, in addition to a deeper understanding of how antimicrobial resistance genes (ARGs) have been transmitted world wide. The big pool of sequencing information accessible in public repositories supplies a wonderful useful resource for monitoring the temporal and spatial dissemination of AMR in several ecological settings. Nonetheless, solely a restricted variety of analysis teams globally have the computational sources to research such information. We retrieved 442 Tbp of sequencing reads from 214,095 metagenomic samples from the European Nucleotide Archive (ENA) and aligned them utilizing a uniform method in opposition to ARGs and 16S/18S rRNA genes. Right here, we current the outcomes of this intensive computational evaluation and share the counts of reads aligned. Over 6.76∙10⁸ learn fragments had been assigned to ARGs and three.21∙10⁹ to rRNA genes, the place we noticed distinct variations in each the abundance of ARGs and the hyperlink between microbiome and resistome compositions throughout varied sampling sorts. This assortment is one other step in the direction of establishing international surveillance of AMR and might function a useful resource for additional analysis into the environmental unfold and dynamic adjustments of ARGs.

Quotation: Martiny H-M, Munk P, Brinch C, Aarestrup FM, Petersen TN (2022) A curated information useful resource of 214K metagenomes for characterization of the worldwide antimicrobial resistome. PLoS Biol 20(9):
e3001792.

https://doi.org/10.1371/journal.pbio.3001792

Educational Editor: Tobias Bollenbach, Universitat zu Koln, GERMANY

Acquired: Might 12, 2022; Accepted: August 9, 2022; Printed: September 6, 2022

Copyright: © 2022 Martiny et al. That is an open entry article distributed underneath the phrases of the Inventive Commons Attribution License, which allows unrestricted use, distribution, and copy in any medium, supplied the unique creator and supply are credited.

Information Availability: The code to supply the figures is on the market at https://github.com/hmmartiny/mARG. The info has been deposited at https://doi.org/10.5281/zenodo.6919377, and documentation of the varied tables may be accessed at https://hmmartiny.github.io/mARG.

Funding: This work was supported by the European Union’s Horizon H2020 grant VEO (874735) and the Novo Nordisk Basis (grant NNF16OC0021856: World Surveillance of Antimicrobial Resistance). HMM, PM, CB, TNP, and FMA had been all supported by each grants. The funders had no function in research design, information assortment and evaluation, choice to publish, or preparation of the manuscript.

Competing pursuits: The authors have declared that no competing pursuits exist.

Abbreviations:
AMR,
antimicrobial resistance; ARG,
antimicrobial resistance gene; ENA,
European Nucleotide Archive; INSDC,
Worldwide Nucleotide Sequence Database Collaboration; mcr,
mobilized colistin resistance

Introduction

The huge quantity of genomic information accessible in public information repositories is a singular and doubtlessly essential useful resource for doing analysis and genomic surveillance of antimicrobial resistance (AMR). Utilizing these datasets collected from places everywhere in the world throughout completely different years and from varied sampling sources may additional help our understanding of the emergence and distribution of antimicrobial resistance genes (ARGs).

The sharing of genomic sequence information to one of many accessible repositories is right now a significant and infrequently obligatory step in peer-reviewed journals, for which a number of repositories had been created by the members of the Worldwide Nucleotide Sequence Database Collaboration (INSDC) [1], together with the European Nucleotide Archive (ENA) [2]. The variety of sequencing information accessible at ENA continues to extend with an estimated doubling time of 18 months (https://www.ebi.ac.uk/ena/browser/about/statistics; accessed 2022-03-08).

A number of approaches for analyzing genomic information relying on the pattern sorts are already effectively established.

Nonetheless, the exploration of those sources is commonly restricted to a couple analysis teams solely since each adequate abilities in bioinformatics and entry to high-performing pc sources are wanted to deal with the big quantity of accessible information.

Current collections of analyzed datasets are likely to give attention to both particular pattern sources, equivalent to people [3,4], marine [5], or city sewage [6,7], or give attention to particular genera [8]. Particularly the COVID-19 pandemic has highlighted the worth of knowledge sharing to hint the unfold and evolution of the virus [9]. Regardless of the makes an attempt to standardize the evaluation workflows of those databases, they’re restricted of their skill to generalize throughout environments and places. A current research [10] has shared a searchable assortment of 661K bacterial genomes for exploring the worldwide bacterial variety throughout completely different origins, offering an easy-to-access useful resource for genomic analysis. Whereas that is a formidable data-sharing effort, the authors didn’t embrace metagenomic samples of their pipeline. Metagenomic strategies purpose to sequence all DNA in a pattern and can be utilized to characterize the microbiome in several environments [11,12], uncover novel organisms [13], monitor illness [14,15], and particular genes, equivalent to ARGs [5,6,16].

Right here, we current a large-scale metagenomic evaluation of 214,095 metagenomic samples retrieved from ENA. We have now carried out an assembly-free method by aligning sequencing reads in opposition to ARGs and 16S/18S ribosomal RNA genes. We have now beforehand revealed an in-depth evaluation of the distribution of mobilized colistin resistance [17] primarily based on these information. Now we each share the whole assortment of mapping outcomes and showcase learn how to characterize the worldwide resistome and microbiome with this dataset. The curated metadata and mapping outcomes can be found at https://doi.org/10.5281/zenodo.6919377 and documentation at https://hmmartiny.github.io/mARG/Tables.html.

Supplies and strategies

Retrieval of metagenomes

We retrieved metagenomic datasets from ENA [2] uploaded between 2010-01-01 and 2020-01-01 that had library supply as “METAGENOMIC” and library technique of “WGS.” We collected 214,095 sequencing runs from 146,732 samples from 6,307 tasks equivalent to 442 Tbp of uncooked reads taking over 300 TB of storage. The related metadata for every pattern was additionally retrieved.

Preprocessing and mapping of sequencing reads

The retrieved uncooked FASTQ reads had been trimmed and aligned in opposition to reference sequences, as outlined in Martiny (2022) [17]. In short, we used FASTQC v.0.11.15 (https://www.bioinformatics.babraham.ac.uk/tasks/fastqc/) for learn high quality checking and BBduk2 v.36.49 [18] for trimming the uncooked sequencing reads. With the k-mer-based alignment software KMA 1.2.21 [19], the trimmed reads had been mapped in opposition to reference sequences from 2 completely different databases: The AMR gene database ResFinder [20] (downloaded 2020-01-25), which contained 3,085 sequences of acquired ARGs, and the ribosomal rRNA Silva [21] gene database (model 138, downloaded 2020-01-16), which had 2,225,272 reference sequences with greater than 88% of them being 16/18S rRNA genes. For KMA, we used the next alignment parameters: 1, -2, -3, -1 for a match, mismatch, hole opening, and hole extension. For learn pairing, we used a price of seven and a minimal relative alignment rating of 0.75. Information retrieval, high quality checking, trimming, and skim alignments had been achieved utilizing the Danish Nationwide Supercomputer for Life Sciences (https://www.computerome.dk/).

Standardization of metadata

The next attributes for every metagenome had been standardized: sampling location, sampling host or atmosphere (known as a number under), and sampling date.

To standardize the label for sampling places, we seemed on the values entered within the two fields “nation” and “location.” First, the latitude and longitude coordinates had been mapped to a rustic utilizing the Python library Shapely 1.7.1 [22] to seek out the matching space outlined in one of many 3 public area map datasets (nations, marine, and lakes) accessible within the Pure Earth Information assortment. If the lookup failed or the coordinates weren’t given, the second step was to match the textual content attribute within the nation label to ISO 3166 nation codes with a fuzzy search with the Python library PyCountry 20.7.3 (https://github.com/flyingcircusio/pycountry). Lastly, if the two lookup searches didn’t yield a match, we did a guide lookup of the nation labels to standardize the textual content.

For the standardization of host labels, we mapped the taxonomic id given by the attribute “host_tax_id” to the NCBI Taxonomy database [23], or if the characteristic was lacking, the “tax_id” was used as a substitute.

For the reason that solely solution to curate entered assortment dates is to lookup suspicious dates in revealed research manually, and that was deemed too time-intensive, we determined to interchange dates entered as later than 2020-01-01 within the pattern attribute discipline “collection_date” with the lacking worth NULL.

Measuring the abundance of ARGs

Since we report the fragment rely aligned to every reference gene, the mapping outcomes are compositional and must be handled as such [24]. Within the easiest kind, the ARG abundance for a pattern or pattern group may be calculated because the log-ratio of the rely of reads, n_i, aligned to every ARG i over the full sum of rRNA learn fragments n_B:

the place D is the variety of ARGs and with D_B being the variety of learn fragments aligned to rRNA genes. Every ARG rely n_i has been adjusted with the size of the gene in kilobases.

The relative abundance resistance courses had been calculated because the proportion of ARG resistance assigned to completely different courses and scaled with κ = 100:

Variety measurements

Apart from the learn abundance values, we report the species richness, Shannon variety index [25], and the Gini–Simpson [26] variety index of learn counts of ARGs, genera, and phyla per pattern. Species richness is the variety of completely different genes or taxonomic teams current within the pattern with at the least 1 learn fragment aligned.

The Shannon index (H′) was calculated utilizing the proportions of reads :

whereas the Gini–Simpson index (GS) was calculated utilizing the learn counts n = [n₁,…,n_D] and N = ∑n is the full rely of reads for the group:

Along with these 2 indices, we additionally report the sample-wise distinctive variety of reference sequences or taxonomic teams matched.

Outcomes

Right here, we current a large-scale mapping of 442 Tbp of uncooked reads of 214,095 metagenomic samples appropriate for analyzing the distribution of acquired antimicrobial resistance genes and 16S/18S rRNA genes. Moreover, now we have spent appreciable effort standardizing 3 important pattern attributes: sampling date, location, and supply. To facilitate easy accessibility and utilization, now we have shared the mapping outcomes and corrected metadata in 3 completely different information codecs (TSV, HDF, and MySQL dumps). We additionally present tutorials with code examples in R and Python on utilizing the information in several situations. Information recordsdata are all accessible at https://doi.org/10.5281/zenodo.6919377.

By accumulating the sequencing reads from ENA, we might additionally confirm the inherited bias of particular pattern sorts or sources being overrepresented merely because of the availability within the public repository. Whereas the 214,095 metagenomic datasets had been collected from 797 completely different hosts, most had been both of human or marine origin (Fig 1A). The same skewed geographical distribution in the direction of European and North American nations was noticed within the sampling places (Fig 1B). The distribution of samples in response to the sampling yr reveals {that a} appreciable quantity had been collected between 2010 and 2020 (Fig 1C).

Fig 1. Distribution of metagenomes reveals the overrepresentation of samples from particular sources.

(a) Variety of samples grouped per sampling host, the place solely hosts with greater than 1,000 samples are plotted. (b) Pattern places for metagenomes with accessible GPS coordinates; every marker is a pattern. A complete of 83,903 samples didn’t have coordinates accessible. (c) Yr of which a pattern was collected. A complete of 84,238 of the samples didn’t have a legitimate sampling date recorded. The info underlying this determine may be discovered at https://doi.org/10.5281/zenodo.6919377, and the bottom layer map was created with information from https://www.naturalearthdata.com/.

https://doi.org/10.1371/journal.pbio.3001792.g001

Of the greater than 1.8∙10¹² uncooked sequencing reads, equivalent to 442.1 Tbp, 93% of the reads had been generated utilizing Illumina sequencing applied sciences (S1 Fig). We mapped over 1.69∙10¹² trimmed learn fragments, with a median of 784,748 fragments per pattern (vary 1 to 916,901,400) (Fig 2A). Roughly 0.04% of all learn fragments may very well be aligned to ARGs, and 0.19% to rRNA genes. Total, the quantity of sequencing reads and bases accessible did enhance the rely of aligned learn fragments (S3 Fig). The variety of ARG fragments aligned elevated with the variety of aligned rRNA fragments, though for 34% of the samples, we didn’t discover any ARGs regardless of having learn fragments aligning to 16S rRNA genes (Fig 2B). The microbial variations within the completely different sampling origins had been highlighted within the variety of aligned fragments (S4 Fig).

The worldwide abundance of antimicrobial resistance

To measure the worldwide distribution of ARGs and the composition of the resistome, we calculated the abundance of ARGs because the log-ratio of ARG fragments over summed rRNA sequence fragments. Nearly the entire reference sequences from the ResFinder database had at the least 1 fragment aligned, and solely 94 ARGs had no hits (S2 Fig). The median noticed resistance load per metagenomic pattern was 11.74 (log vary: −1.45 to 23.52) (Fig 3A), which seemed to be primarily depending on the geographic origin and atmosphere (Fig 3B–3D) and never on which yr the pattern was taken. For instance, samples originating from places inside Europe confirmed related abundance ranges for a lot of the samples however with a number of outliers, whereas a number of samples from places within the Oceania area had a wider load distribution with few outliers (Fig 3C).

Fig 3. Boxplots of ARG abundances in metagenomic samples present that ranges range throughout completely different origins.

(a) Distribution of ARG abundance per pattern. (b) Distribution of sample-wise ARG abundance grouped by sampling yr. (c) Pattern-wise ARG abundance per sampling location. (d) Pattern-wise ARG abundance grouped by hosts. Solely hosts with greater than 1,000 metagenomes analyzed are proven. The info underlying this determine may be discovered at https://doi.org/10.5281/zenodo.6919377.

https://doi.org/10.1371/journal.pbio.3001792.g003

Whereas the distribution of sample-wise resistance masses illustrates the excessive variability on this information assortment (Fig 3), we noticed that after we stratified the relative ARG learn proportions per resistance class and pattern kind, there have been clear separations between completely different teams (Fig 4). For the sampling years with a substantial variety of samples accessible (2004 to 2019), the relative proportion of courses was comparatively constant, with Tetracycline reads being the most typical, aside from a spike of Beta-lactam reads in 2017 (Fig 4A). Throughout the continents and enormous water our bodies, we noticed that ARGs conferring resistance to Aminoglycosides or Beta-lactam antimicrobials had been extra frequent in water environments, whereas mainland areas had a extra various distribution (Fig 4B). As soon as we stratified by sampling host or supply, the distribution of resistance courses was very depending on the group, as seen by the excessive proportion of learn fragments aligned to, for instance, Phenicol for marine and soil samples and Tetracycline reads being extremely prevalent in mice (Mus musculus) samples (Fig 4C).

Linking the microbiome variety with resistance variety

The connection between the variety of the microbiome and the resistance genes was quantified by calculating the species richness and a couple of alpha variety measurements (Shannon and Gini–Simpson) on ARG ranges and phyla and genera taxonomic ranges. With out trying on the pattern origin, we noticed {that a} majority of the samples had each excessive microbial variety and ARG variety (Figs 5 and S5). Nonetheless, the connection between genera and ARG variety indexes differed between sampling sources, with a number of teams containing samples that didn’t observe the belief of the two variety measurements following one another, suggesting that elevated variety of microbes in, for instance, soil samples doesn’t essentially result in the next variety of resistance genes. Contrarily, the rooster (Gallus gallus) samples confirmed that they nonetheless had elevated ARG variety regardless of having decrease microbial variety (Fig 5).

Dialogue

World surveillance of AMR primarily based on genomics continues to develop into extra accessible because of the development in NGS applied sciences and the follow of sharing uncooked sequencing information in public repositories. Standardized pipelines and databases are wanted to make the most of these massive information volumes for monitoring the dissemination of AMR. We have now uniformly processed the sequencing reads of 214,095 metagenomes for the abundance evaluation of ARGs.

Our information sharing efforts allow customers to carry out abundance analyses of particular person ARGs, the resistome, and the microbiome throughout completely different environments, geographic places, and sampling years.

We have now given a short characterization of the distribution of ARGs in response to the gathering of metagenomes. Nonetheless, in-depth analyses stay to be carried out to research the affect of temporal, geographical, and environmental origins on the dissemination and evolution of antimicrobial resistance. For instance, analyzing the unfold of particular ARGs throughout places and completely different environments might reveal new transmission routes of resistance and information the design of intervention methods to cease the unfold. We have now beforehand revealed a research specializing in the distribution of mobilized colistin resistance (mcr) genes utilizing this information useful resource, exhibiting how extensively disseminated the genes had been [17]. One other use of the information assortment may very well be to discover how the adjustments in microbial abundances have an effect on and are affected by the resistome. Moreover, our protection statistics of reads aligned to ARGs may very well be used to research the speed of latest variants occurring in several reservoirs. Though now we have centered on the specter of antimicrobial resistance, potential purposes of this useful resource may be to take a look at the results of, for instance, local weather adjustments on microbial compositions. Linking our noticed learn fragment counts with different kinds of genomic information, equivalent to evaluating the danger of ARG mobility, accessibility, and pathogenicity in assembled genomes [27,28], and verifying observations from scientific information [29].

We advocate that potential customers think about all of the confounders current on this information assortment of their statistical exams and modeling workflows, emphasizing that the experimental strategies and sequencing platforms dictate the obtained sequencing reads and that metadata for a pattern may be mislabeled, regardless of our efforts to reduce these sorts of errors. Moreover, it’s important to contemplate the compositional nature of microbiomes [30]. The reads don’t depend upon the distribution of genetic materials within the pattern however on the capability of the sequencing platform [24,31]. Numerous statistical strategies exist already that think about the compositionality [24,32,33]. Lastly, it is very important spotlight that the outcomes now we have introduced right here embrace fragment counts of 1 for the sake of transparency, however we additionally advocate potential customers think about acceptable filters of their evaluation.

The sequencing information in public repositories has continued to develop, giving us loads of alternatives to proceed to broaden our information assortment much more. To ascertain a really international surveillance program of AMR, sequencing information must be analyzed as quickly as revealed in these archives. Though this might require entry to much more computational sources, we hope to attain this quickly and evaluate our method with different strategies, equivalent to AMRFinderPlus [34] and CARD [35]. As new sequencing applied sciences have gotten extra used, our settings for our alignment process must also be tuned to higher take benefit and pay attention to the issues of various sequencing platforms.

With this information useful resource, now we have taken a step in the direction of enabling the scientific group to make the most of the wealth of data in these metagenomic samples to broaden our understanding of the dissemination of antimicrobial resistance and adjustments in microbiomes at each native and international scales by means of time and environments.