Saturday, May 11, 2013

DNA G+C Content and Survival Value

One of biology's big open questions is why organisms differ so much with regard to the relative amounts of GC and AT in their DNA. You'd think that if there are only two kinds of DNA base pairs (see diagram) they'd be more-or-less equally abundant. Not so. There are organisms with DNA that's mostly GC (and/or CG) pairs; there are organisms with very-AT-rich DNA; and within the chromosomes of higher organisms you find large GC-rich regions (isochores) in the midst of great swaths of AT-rich DNA.
DNA contains adenine and thymine in equal amounts, and
guanine and cytosine in equal amounts, but it does not
usually contain GC pairs and AT pairs in equal amounts. And
it doesn't seem as if there is an "optimum" GC:AT ratio. The
GC:AT ratio varies by species. Within a species, it's constant.

There are two really odd facts at work here:

1. The GC content of DNA varies by species, and it varies a lot.

2. Evolution doesn't seem to trend toward an "optimum CG:AT ratio" of any kind.

If there were such thing as an optimum GC:AT ratio for DNA, surely microorganisms would've figured it out by now. Instead, we find huge diversity: There are bacteria on every point in the GC% spectrum, running from 16% GC for the DNA of Candidatus Carsonella ruddii (a symbiont of the jumping plant louse) to 75% for Anaeromyxobacter dehalogenans 2CP-C (a soil bacterium). At each end of the spectrum you find aerobes and anaerobes; extremophiles and blandophiles; pathogens and non-pathogens. About the only generalization you can make is that the smaller an organism's genome is, the more likely it is to be rich in A+T (low GC%).

Genome size correlates loosely with GC content. The very smallest
bacteria tend to have AT-rich (low GC%) DNA.
The huge diversity in GC:AT ratios among bacteria is impressive. But does it simply represent a random walk all over the possibility-space of DNA? Or do the various points on the spectrum constitute special niches with important advantages? What advantage could there be for having high-GC% DNA? Or high-AT% DNA?

Some subtle clues tell us that this is not just random deviation from the mean. First, suppose we agree for sake of argument that lateral gene transfer (LGT) is common in the microbial world (a point of view I happen to agree with). Over the course of millions of years, with pieces of DNA of all kinds (high GC%, low GC%) flying back and forth, LGT should force a regression to the mean: It should make genomes tend toward a 50-50 GC:AT ratio. That clearly hasn't happened.

And then there's ordinary mutational pressures. It's beginning to be fairly well accepted (see Hershberg and Petrov, "Evidence That Mutation is Universally Biased Toward AT in Bacteria," PLoS Genetics, 2010, 6:9, e1001115, full version here) that natural mutation is strongly biased in the direction of AT by virtue of the fact that deamination of cytosine and methylcytosine (which occurs spontaneously at high frequency) leads to replacement of 'C' with 'T', hence GC pairs becoming AT pairs. The strong natural mutational bias toward AT says that all DNA should creep in the direction of low GC% and end up well below 50% GC. But again, this is not what we see. We see that high-GC organisms like Anaeromyxobacter (and many others) maintain their DNA's unusually high (75%) GC content across millions of generations. Even middle-of-the-road organisms like E. coli (with 50% GC content) don't slowly slip in the direction of high-AT/low-GC.

Clearly, something funny is going on. For a super-high-GC organism like Anaeromyxobacter to maintain its DNA's super-high GC content against the constant tug of mutations in the AT direction, it must be putting significant energy into maintaining that high GC percentage. But why? Why pay extra to maintain a high GC%? And how does the cost get paid?

I think I've come up with a possible answer. It has to do with DNA replication cost, where "cost" is figured in terms of time needed to synthesize a new copy of the DNA (for cell division). Anything that favors low replication cost (high replication speed) should favor survival; that's my main assumption.

My other assumption is that DNA polymerases (the enzymes involved in replication) are not clairvoyant. They can't know, until the need arises, which of the four deoxyribonucleotide triphosphates (dATP, dTTP, dGTP, dCTP) will be needed at a given moment, to elongate the new strand of DNA. When the need arises for (let's say) an 'A', the 'A' (in the form of dATP) has to come from an existing endogenous pool of dNTPs containing all four bases (dATP, dTTP, dGTP, dCTP) in whatever concentrations they're in. The enzyme has to wait until a dATP (if that's what's needed) randomly happens to lock into the active site. Odds are only one in four (assuming equal concentrations of dNTPs) of a dATP coming along at exactly the right moment. Odds are 3 out of 4 that some incorrect dNTP (either dGTP, dTTP, or dCTP) will try, and fail, to fit the active site first, before dATP comes along.

But imagine that your DNA is 75% G+C. And suppose you've regulated your intracellular metabolism to maintain dGTP and dCTP in a 3:1 ratio over dATP and dTTP. The odds of a good random "first hit" go up.

To simulate the various possibilities, I wrote software (in JavaScript) that simulates DNA replication, where the template DNA molecule is 1000 base-pairs in length and the dNTP pool size is 10000 bases. The software allows you to set the organism's genome GC% to whatever you want, and also set the dNTP pool's relative GC percentage to whatever you want. The template DNA is just a random string of A, T, G, and C bases (1000 total), reflecting their relative abundances as set in the GC% parameter. The pool of dNTPs is set up to be a randomized array (again reflecting abundances set in a GC% parameter).

The way the software works is this. Read a base off the template. Fetch a base randomly from the base pool. If the base happens to be the one (out of four) that's called for, score '1' for the timing parameter, and continue to read another base off the template. If the base was not the one that's called for, put it back in the pool array in a random location, then randomly fetch another base from the pool; and increment the timing parameter. (For each fetch, the timing parameter goes up by 1.) Keep fetching (and throwing back bases) until the proper base comes up, incrementing the time parameter as appropriate. (The time parameter keeps track of the number of fetch attempts.) When the correct base turns up, the pool shrinks by one base. In other words, replication consumes the pool, but as I said earlier, the pool contains ten times as many bases (to start) as the DNA template. So the pool ends up 10% smaller at the end of replication.

Each point on this graph represents the average of 100 Monte Carlo runs, each run representing complete replication of a 1000-bp DNA template, drawing from a pool of 10,000 bases. The blue points are runs that used a DNA template containing 25% G+C content. The red points are runs that used DNA with 75% G+C. The X-axis represents different base-pool compositions. See text for details. Click for larger image.

I ran Monte Carlo simulations for DNA templates having GC contents of 75%, 50%, and 25%, using base pools set up to have anywhere from 15% GC to 85% (in 2.5% increments). The results for the 75% GC and 25% GC templates (representing high- and low-GC organisms) are shown in the above graph. Each point on the graph represents the average of 100 complete replication runs. The Y-axis shows the average number of fetches per DNA base (so, a low value means fast replication; a high value means slower DNA replication). The X-axis shows the percentage of GC in the base-pool, in recognition of the fact that relative dNTP abundances in an organism may vary, in accordance with environmental constraints as well as with organism-specific homeostatic setpoints.

Maximal replication speed (the low point of each curve) happens at a base-pool GC percentage that is displaced in the direction of the DNA's own GC%. So, for the 25%-GC organism (blue data points), max replication efficiency comes when the base-pool is about 33% GC. For the 75% GC organism (red points) the sweet spot is at a base-pool GC concentration of 65%. (Why this is not exactly symmetrical with the other curve, I don't know; but bear in mind, these are Monte Carlo runs. Some variation is to be expected.)

The interesting thing to note is that max replication efficiency, for each organism, comes at 3.73 fetches per base-pair (Y-axis). Cache that thought. It'll be important in a minute.

The real jaw-dropper is what happens when you plot a curve for template DNA with 50% GC content. In the graph below, I've shown the 50%-GC runs as black points. (The red and blue points are exactly as before.)

This is the same graph as before, but with replication data for a 50%-GC genome (black points). Again, each data point represents the average of 100 Monte Carlo runs. Notice that the black curve bottoms out at a higher level (4.0) than the red or blue curves (3.73). This means replication is less efficient for the 50%-GC genome.

Notice that the best replication efficiency comes in the middle of the graph (no big surprise), but check the Y-value: 4.00. The very fastest DNA replication, when the DNA template is 50% GC, requires 4 fetches per base, compared to best-case base-fetching efficiency of 3.73 for the 25%-GC and 75%-GC DNAs.What does this mean? It means DNA replication, in a best-case scenario, is 6.75% more efficient for the skewed-GC organisms. (The difference between 3.73 and 4.00 is 6.75%.)

This goes a long way toward explaining why GC extremism is stable in organisms that pursue it. There is replication efficiency to be had in keeping your DNA biased toward high or low GC. (It doesn't seem to matter which.)

Consider the dynamics of an ATP drawdown. The energy economy of a cell revolves around ATP, which is both an energy molecule and a source for the adenine that goes into DNA and RNA. One would expect normal endogenous concentrations of ATP to be high relative to other NTPs. For a low-GC% organism, that's also a near-ideal situation for DNA replication, because high AT in the base pool puts you near the max-replication-speed part of the curve (see blue points). A sudden drawdown in ATP (when the cell is in crisis) shifts replication speed to the right-hand part of the blue curve, slowing replication significantly. This is what you want if you're an intracellular symbiont (or a mitochondrion, incidentally). You want to stop dividing when the host cell is unable to divide because of an energy crisis.

Consider the high-GC organism (red dots), on the other hand. If ATP levels are high during normal metabolism, replication is not as efficient as it could be, but so what? It just means you're willing to tolerate less-efficient replication in good times. But as ATP draws down (perhaps because nutrients are becoming scarce), DNA replication actually becomes more efficient. This is what you want if you're a free-living organism in the wild. You want to be able to continue replicating your DNA even as ATP becomes scarce. And indeed that's what happens (according to the red data points): As the base pool becomes more GC-rich, replication efficiency increases. The best efficiency comes when base-pool A+T is down around 35%.

I think these simulations are meaningful and I think they help explain the DNA-composition extremism seen among microorganisms. If you're a professional scientist and you find these results tantalizing, and you'd like to co-author a paper for PLoS Genetics (or another journal), please get in touch. (My Google mail is kas-dot-e-dot-thomas.) I'd like to coauthor with someone who is good with statistics, who can contribute more ideas to this line of investigation. I think these results are worth sharing with the scientific community at large.