Molecular Epidemiology of Spike Protein Sequences in 2019-nCoV: Origin Still Uncertain and Transparency Needed
Here are the stats for readership of this blog for the last two weeks. Nearly 200,000 hits.
OUR INITIAL ASSESSMENT that the available 2019-nCoV sequences contained an inserted stretch of nucleotide sequences upstream from the canonical position of the Spike (or Crown) Protein Sequence in the human samples that was similar to pShuttle-SN has been under useful and productive scutiny since we first published that we, unlike other labs, were in fact able to find a match between the “middle fragment” and sequences in non-viridae databases. The match to a pShuttle-SN vector technology, which led to the assessment that perhaps the sequence was the product of an attempt to modifiy a bat coronavirus in the lab has raised controvery but please note that was not the only evidence of interest. We know of viruses within which the SARS protein gene sequence has in fact been added to study the transmission of SARS virus; it has also been added to adenovirus to create hopeful vaccine, so it is not beyond reason to consider whether the virus currently estimated to be infecting >200,000 people in China might be a product of laboratory manipulation, and the reporting of the odd out-of-place sequence in the study that proposed recombination was also important. The divergence of the nCoV Spike protein compared to the rest of nCoV and the bat coronaviruses was also compelling.
The specific mechanism by which those factors could come out is unclear. They could also been due to unwitting recombination in between a SARS virus being studied in a lab that was also studying or housing animals with bat coronaviruses. Or recombination in a human infected with both The scientific community ruled out the possibility of natural recombination in the wild, whereas I preferred to leave a 5% chance that it might have been caused by a recombination event in the wild. Importantly, I still have not ruled that out.
The official Chinese position by Dr. Shi is that the viruses are too different in comparison to other bat coronaviruses across the genome, with random, non-patterned changes, and that there are no endonuclease sites in bat coronaviruses and thus pShuttle-SN or other endonuclease technologies could not have been used, supporting recombination in the wild. The latter statement is demonstrably incorrect, there are many endonuclease sites in bat coronavirus sequences, determined using a bat coronavirus most similar to the sequence clade in question (trees below).
Dr. Shi is correct that there are scattered differences, but evolution need not pattern anything in an orderly manner in RNA viruses, which fast evolving, and we do not yet know that a recombination event might have occurred or have been used where the recombination occurred outside the Spike protein coding sequence, within the Spike protein coding sequence, or perhaps a combination of both. Evolution does not care to “pattern” things for human consumption; evolution brings forth viruses that work to create more viruses. When recombination is suspected, artificial or real, we study evolution at the sequence level best by inheritance patterns of motifs; overall rates are, as we have seen, and will see below, frustratingly limiting.
Before we look at the evolutionary trees, I want to stress that I have published and will repeat that given the mass casualities in China and the prospect of such events around the world, keeping the possibility of one or more recombination events on the table, or even a laboratory origin of nCoV2019 is important specifically and exclusively for scientific and humanitarian purposes. As human societies often do, people will want to rush to point fingers of blame and my position is that if it’s a vaccine type gone wrong, or even a bioweapon that backfired. Let’s not by hypocritical; the US and many other countries of course have been studying the SARS spike protein for vaccines and have of course been conducting research on bioweapon. That’s no longer the point. The point is – and the only point that matters – is that we have a massive humanitarian crisis in China and therefore (1) any available data on the pathophysiology of this virus, man-made in part or in toto or not, must be brought forward, (2) China needs aid to bring R0 down below 1,0, (3) the rest of world needs to act now to stop the spread of the virus by behavioral changes including routine mass effort for santization of common surfaces and self-social isolation (don’t touch other people or your face in public). The world cares, and should care, most about putting the fire out, not who or what started it.
However, Science might help provide clues for treatments and perhaps for rapid diagnosis. As promised to many who have contacted me, we have completed a more thorough (but unfortunately not exhaustive) analysis of Spike protein seqeunces with sequences that are at hand. A note of caution: some of the sequences may be different that then ones we analyzed due to what we were informed to be “database update errors”, whatever those include. We were contacted by a person who evidently knew about a “natural” bat coronavirus isolated in July 2013 at the Institute of Virology in Wuhan, China who pointed us to a sequence available NCBI’s Nucleotide database but that had been uploaded only in January, 2020. We do not know when when the sequencing was done, whether it was a frozen sample just recently sequenced, or whether it was isolated from laboratory propagated viral lines in human cell lines. Oddly, the original gap that we found, and that was reported in the peer-reviewed study that pointed to potential recombination in snakes, and that a second, independent peer-review study also found and could not match (and called a “middle fragment” can no longer be found using the same accession numbers. I am not certain of who curates NCBI’s databases at this time, but NCBI should have a record of any evidence of un-annotated updates and I leave it to them to sort this issue out.
To help elucidate possible relationships among the available Spike proteins, based on current sequences, including the ins1378 segment, I present two phylogenetic trees, derived at https://mafft.cbrc.jp/ and rendered using phylo.io. The tree-generating algorithm was Neighbor Joining (NJ), invented by my postdoctoral mentor Dr. Masatoshi Nei, and Dr. Naruya Saitou in 1987. The tree was estimated using all variable positions, and raw differences. The Jukes-Cantor model was not used because it overweights nucleotide substitutions that might be more frequent and underweights nucleotide substitutions that might be less frequent during RNA virus evolution. N=1,000 bootstrap iterations were used to assess the confidence of the placement. Caveat: any within-sequence recombination is masked by the assumption that the process is tree-like; gappy areas were retained and not force-aligned. The full alignment is available here.
Size = 23 sequences × 1087 sites
Method = Neighbor-Joining
Distance = Raw difference
Bootstrap resampling = 1000
Alignment id = .200206212318090CHjzDGoNBMcL1glmdiZ7Plsfnormal
Tree 1 Bootstrap values.
Tree 2. Same estimated tree with branch lengths.
Clearly we see that the Spike protein from the 2019-nCoV human sequence is most similar to the sequence isolated from bat feces in the Wuhan Institute for Virology in 2013, deposited in January 2020. The next closest sequence is from the Institute of Military Medicine, Nanjing Command.
Looking a the raw distances, compared to those for the overall genome, the spike protein appears to more evolutionary labile – that is, there are more variable sites and the evolutionary distance is greater in the Spike protein-encoding sequences. The great distance between the Wuhan sequences and the other bat-like coronaviruses is distinct for the Spike protein, and contrasts with other published results. There are plenty of variable sites to have moderate confidence in this result (BSV = 82), however, compared to most bootstrap values published in coronavirus sequences and most of those in this tree, the value of 82 points to some signal other than inherited variation that covaries well with the rest of the inherited variation, just as in the original analysis with the low bootstrap value. (The higher bootstrap value here compared to the full genomic analysis placing 2019-nCoV more within the bat coronaviruses albeit with lower bootstrap values is likely due to a number of factors, include the use of a Jukes-Cantor constraint on the model of evolution in the original analysis).
The data do not support a 1:1 relationship with pShuttle-SN (as published in 2005) and the SARS-like spike protein in 2019-nCoV, and I never posited that relationship. I merely pointed out it was similar to it, when no one else could match the middle fragment to anything But the pShuttle-SN has ALSO been evolving in the lab, no doubt, and I would like to see a newly deposited sequence in NCBI’s Nucleotide database. Other vector tech has no doubt been used by other labs putting the SARS spike protein.
Parsimony (Occam’s Razor) would, with the existing sequences, tend to lead to the conclusion that the Spike protein is there simply because bat coronaviruses have Spike proteins. Does that mere fact lead to the assumption that a bat coronavirus never underwent recombination in nature or the lab (experimentally) or accidentally? No, it does not. The oddities in the behavior of the sequence data “updates” deserve further scutiny. Why would two peer-reviewed publications, one from China, mention a middle fragment that could not be aligned? These questions require transparency.
Nevertheless, the Spike protein relationships still seem to tell a different story of relationship of 2019-nCoV and related coronaviruses in Wuhan compared to published full genomic analyses that might be very important. They stand out as different, distinct. Some important limitations here are (1) all trees are estimates, not observations, and should not be used as “proof” of anything (science does not deal with “proof”;’ (2) the method assumes a tree-like relationship and cannot rule out recombination origins; (3) the method is based on limited data in terms of samples (taxon sampling). So, as always, more data may clarify.
Calls to Action for Scientists
I strongly encourage those who can to post their own analyses of the fasta file, or their own alignments, etc to the comments, especially if they are more relevant to the questions of recombination. Please understand if we cannot comment on each and every post given the flurry of activities ongoing at IPAK about this and other pressing issues.
- Detailed sequence-level analyses are needed to determine if there has been recombination or other editing of these sequences both inside and outside of the Spike protein region. Are there the expected number of synonymous and non-synounmous substitutions as would be expected under natural inheritance model? Is there any test that could be done to detect very important changes that might be adaptive to 2019-nCoV?
- Analyses capable of detecting recombination – or ruling it out – should be applied and published ASAP. Feel free to post links to any such analyses in the comment.
- Also, please post your own interpretation and comments, and reference other information as may be relevant.
All deposited sequences will continued to analyzed as we monitor the situation.
- Katoh, Rozewicki, Yamada 2019 (Briefings in Bioinformatics 20:1160-1166)
MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization
- Kuraku, Zmasek, Nishimura, Katoh 2013 (Nucleic Acids Research 41:W22-W28)
- Saitou N, Nei M. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. Jul;4(4):406-25. https://www.ncbi.nlm.nih.gov/pubmed/3447015/