Friday, March 23, 2018

Pan-genome of three genera

Here is another example of using Circos diagrams to visualize biological data. This time it is the concept of bacterial pan-genome.

These days sequencing become very cheap and data on bacterial genomes accumulates particularly fast. As of today GeneBank contains information on 131801 bacterial genomes, but thousands more are sequenced every day without being published. Faced with such load of information in their hands researches were able to look deeper in intra-species genome organisation. Having so many genomes allows you to ask questions like, what part of genome common for particular species? What composition of those common features makes a species?

Compared to animals, bacterial genomes are more dynamic as a result of horizontal gene transfer (HGT) between individual cells. Often HGT appears between different strains of same species, but sometimes can go even further, passing genes to other taxonomic groups (families and classes).

 If we take, for example, a genus of bacteria (doesn't matter which) and compare all genomes we have on the global scale there will be three areas. Some areas of genomes are unique, because they usually found only in a few species of the genus. These unique areas often are result of HTG and might find more similarity with species from other genera or even family. Other areas, called variable, are common for almost all species in the genus, but prone to mutation. Finally the third group of regions are specific to particular taxonomic group. These areas called core genome. Both variable and core genome regions make a pan-genome. Pan-genome regions create a signature, typical genome for particular species or genus. Genome that is shared among all members of particular taxon.

In the outer circle green shows core areas in pan-genomes of three genera. Those genes that are shared among more than 90% of all strains used in the analysis.

Second circle is called GC% content. Which basically means what percentage GC bases in particular gene. This matters because genomes that make up a genus usually have similar GC%. In this case for example, Arcobacter usually has 27%, similar to Campylobacter. In the diagram threshold is set to 28% GC and everything over it highlighted red (in light blue area). You can see that Helicobacter has different GC% from other two. It doesn't it somthing wrong, it's just different.

Another important point of GC% graph it allow to see foreign elements. If some genes or genome regions have GC% content significantly different from rest of the genome, it most likely appeared there recently through HGT. If you look at Helicobacter diagram you can easily see that some regions of GC% are not colored red. More interestingly if you look at pan-genome diagram above, you wont find green or grey stripes. These are most likely candidates for Unique regions.


How virulence factors related to pan-genome.

Figure caption: Connection and distribution of virulence factors between pan-genomes of three genera based on all available complete genome sequences. Outer circle 1 showing region similarity ranging 90-100% (indicated in dark green bars) to 80-90% and 70-80% (indicated in light green and gray bars).  Circle 2 shows GC content where upper (indicated in light blue) and lower (light red) boundaries set to 40% and 20%, respectively.  Circle 3 shows histogram of the distribution frequency of variable and core genes where Red bars indicate genes shared by number of strains to each particular cluster, whereas Blue bars represent heterogeneity of number of strains to that cluster. Circle 4 shows virulence, antibiotic resistance and toxin genes identified in pan-genomes of each genus. GenBank identifiers (GIs) from virulence factor database (VFDB) (black), Comprehensive Antibiotic Resistance Database (CARD) (green) and Toxin-antitoxin database (TADB) (blue). IDs shown in Red are connected by lines (in the center) where connecting lines in the center link to the IDs found in pan-genomes of three genera showing homologous virulence factor (blue), antibiotic resistance (green) and toxin (purple) genes.