Pradhan et al. did a pairwise alignment of the spike protein of SARS2
against SARS1, and they identified 4 regions where SARS2 had an insert
relative to SARS1: [https://
When Pradhan et al. did BLAST searches for the inserts and the surrounding amino acids, they found that for example the last 6 aa of the first 7-aa insert had a perfect match to an HIV sequence from Thailand, and the last 2 aa of the second insert along with the next 4 aa had a perfect match to an HIV sequence from Kenya:
I tried searching for the first 7-aa insert in protein BLAST:
https://GTNGTKR to the text field at the top, I set the
database to refseq_protein, I set the organism as "Viruses (taxid:10239)",
and I clicked BLAST.
There were many different bacteriopages and a few non-phage viruses which matched 6 out of 7 aa so that one aa at either end was clipped by the local alignment, so that there was 100% identity but 85% query coverage:
HIV-1 was not included in the results above, because I searched in a database which only contained refseq proteins. The standard reference genome of HIV-1 is called HXB2, and its envelope protein matches only 3 out of 7 aa of Pradhan's first insert:
$curl ' https:// eutils. ncbi. nlm. nih. gov/ entrez/ eutils/ efetch. fcgi? db=nuccore& rettype=fasta_ cds_ aa& id=NC_ 001802' > hiv. aa $ seqkit locate -p GTNGTKR -m3 <( seqkit grep -nrp gene=env hiv. aa)| cut -f3, 5-| column -t # no matches for max mismatch 3 pattern start end matched $seqkit locate -p GTNGTKR -m4 <( seqkit grep -nrp gene=env hiv. aa)| cut -f3, 5-| column -t # 7 matches for max mismatch 4 pattern start end matched GTNGTKR 232 238 TFNGTGP GTNGTKR 300 306 NNNTRKR GTNGTKR 321 327 GKIGNMR GTNGTKR 354 360 GNNKTII GTNGTKR 404 410 GSNNTEG GTNGTKR 495 501 GVAPTKA GTNGTKR 822 828 VAEGTDR
Next I tried doing another BLAST search for Pradhan's first insert, but I set the "Non-redundant protein sequences (nr)" database and I set the organism to HIV-1. There were 3 different spots of the envelope protein which all matched 6/7 aa of the insert so that one aa at either end was cut off:
There's more than a million sequences of HIV-1 in GenBank:
https://
The env protein of HIV is divided into the gp120 and gp41
polyproteins, but Pradhan et al.'s first insert matched the gp120 part.
I downloaded 125,046 gp120 sequences from the HIV sequence database of
the Los Alamos National Laboratory:
https://GTNGTKR insert, but there
were 15 sequences that matched the first 6 aa and there were 12
sequences which matched the last 6 aa, and the matches were at 3
different spots:
$seqkit stat HIV- 1_ env. fasta| column -t file format type num_ seqs sum_ len min_ len avg_ len max_ len HIV- 1_ env. fasta FASTA Protein 125, 046 105, 184, 828 479 841. 2 1, 469 $ seqkit locate -p GTNGTK HIV- 1_ env. fasta| cut -f1, 5- 7| column -t seqID start end matched C. IN.-. VB49. EF694035 135 140 GTNGTK C. ZA. 2009. CAP177_ 39mo_ 30. KC833437 452 457 GTNGTK C. ZA. 2009. CAP177_ 4260_ 186wpi_ plasma_ 12. MK205618 452 457 GTNGTK C. ZA. 2009. CAP177_ 4260_ 186wpi_ cvl_ 73. MK205633 452 457 GTNGTK B. US. 1990. 1580m. 11a. MT861903 136 141 GTNGTK B. US. 1990. 1580m. 13s. MT861906 136 141 GTNGTK B. US. 1990. 1580m. 15. MT861908 136 141 GTNGTK B. US. 1990. 1580m. 16a. MT861910 136 141 GTNGTK B. US. 1990. 1580m. 17. MT861911 136 141 GTNGTK B. US. 1990. 1580m. 18. MT861913 136 141 GTNGTK B. US. 1990. 1580m. 22. MT861919 136 141 GTNGTK B. US. 1990. 1580m. 3a. MT861922 136 141 GTNGTK B. US. 1990. 1580m. 5. MT861925 136 141 GTNGTK B. US. 1990. 1580m. 7. MT861927 136 141 GTNGTK A1CD. TZ. 2012. 30196v23_ env12. OM825465 460 465 GTNGTK $ seqkit locate -p TNGTKR HIV- 1_ env. fasta| cut -f1, 5- 7| column -t seqID start end matched 01_ AE. TH. 2006. AA042a12R. JX447199 364 369 TNGTKR 01_ AE. TH. 2006. AA042a13R. JX447200 399 404 TNGTKR 01_ AE. TH. 2006. AA042a09R. JX447201 398 403 TNGTKR 01_ AE. TH. 2006. AA042a08R. JX447202 398 403 TNGTKR 01_ AE. TH. 2006. AA042a10R. JX447203 404 409 TNGTKR 01_ AE. TH. 2006. AA042a11R. JX447204 404 409 TNGTKR 01_ AE. TH. 2006. AA042a02R. JX447205 398 403 TNGTKR 01_ AE. TH. 2006. AA042a05R. JX447206 398 403 TNGTKR 01_ AE. TH. 2006. AA042a07R. JX447207 398 403 TNGTKR 01_ AE. TH. 2006. AA042a03R. JX447208 399 404 TNGTKR 01_ AE. TH. 2006. AA042a06R. JX447209 404 409 TNGTKR 01_ AE. TH. 2007. AA042d_ ENV24. MZ347079 408 413 TNGTKR
So again it's not that unusual that 29 out of over a 100,000 sequences would happen to match 6 out of 7 aa. The reason why all four of Pradhan's inserts happened to match HIV and not some other virus might simply be that BLAST has a very large number of HIV sequences relative to other viruses.
Pradhan's second so-called insert matches the last 2 aa of the
YYHK insert here and the next 4 aa:
But there's a total of 7 possible 6 aa segments which include at least 2 aa of the 4 aa insert, so again it's not that surprising that one of them would happen to match one out of hundreds of thousands or millions of HIV sequences at BLAST. The gp120 sequences I downloaded had a perfect match for only one out of the 7 segments:
$x=FLGVYYHKNNKS; for i in {0.. 6}; do seqkit locate -ip ${ x: i: 6} HIV- 1_ env. fasta; done seqID patternName pattern strand start end matched seqID patternName pattern strand start end matched seqID patternName pattern strand start end matched seqID patternName pattern strand start end matched seqID patternName pattern strand start end matched seqID patternName pattern strand start end matched seqID patternName pattern strand start end matched G. KE. 2006. 06KE275457V6. KT022379 HKNNKS hknnks + 462 467 hknnks
I also tried downloading all protein sequences of influenza A from
the NCBI's influenza virus database. I simply kept the default settings
here and clicked "add query":
https://
$seqkit stat influenzamega. fa file format type num_ seqs sum_ len min_ len avg_ len max_ len influenzamega. fa FASTA Protein 1, 234, 340 497, 828, 616 1 403. 3 775
The spike protein of Wuhan-Hu-1 is 1273 aa long so it has 1268 possible 6 aa subsegments. However 269 of them had an exact match to one or more of the protein sequences of influenza A:
$curl ' https:// eutils. ncbi. nlm. nih. gov/ entrez/ eutils/ efetch. fcgi? db=nuccore& rettype=fasta_ cds_ aa& id=MN908947'| seqkit grep -nrp gene=S> sars2spike. aa $ seqkit seq -s sars2spike. aa| awk '{ for( i=1; i< length- 5; i++) print substr($ 0, i, 6)} ' > kmer $ grep -f kmer <( seqkit seq -s influenzamega. fa) > temp $ for x in `< kmer`; do grep -m1 $l temp; done| wc -l 269
And in a FASTA file with 125,046 gp120 sequences of HIV, 53 out of 1268 6-mers had a perfect match (but the number would've been bigger if I was looking at all proteins of HIV and not just gp120):
$grep -fkmer <( seqkit seq -s HIV- 1_ env. fasta) > temp2 $ for x in `< kmer`; do grep -m1 $x temp2; done| wc -l 53
Pradhan's first insert was GTNGTKR, but Wuhan-Hu-1 only
has one or 2 inserted aa in that region relative to European and African
sarbecoviruses, so it rather looks like SARS1 has deletions in that
region. The TKR at the end of the first so-called insert is
also matched by ZC45 and RpYN06:
One of the major flaws in Pradhan's paper was that he didn't report the E-values of the matches, which would've shown how likely similarly close matches were to occur by chance.
I tried doing a BLAST search for Pradhan's first so-called insert
together with the 5 surrounding amino acids on either side, so that kept
the default nr database but I restricted the species to HIV-1:
https://
The matches to the Thai sequences that were listed in Pradhan's paper had an E-value of about 507, which means that about 507 similarly close matches were expected to occur by chance:
There were 97 results which had a better match to the query than Pradhan's Thai sequences. But the lowest E-value among the matches was still 11, so none of the matches were even close to reaching the significance level of 0.05:
One of the terms which is used to calculate the E-value is the length of the database in letters, so the E-value depends linearly on the length of the database. My database of HIV-1 sequences was about 300 million letters long. If I would've broadened the search to all viruses except SARS-CoV-2, then the length of the database would've been about 7 times bigger, so the E-values would've also been about 7 times higher.
In Pradhan's pairwise alignment SARS2 had the insert
YYHK relative to Tor2. Pradhan's second so-called insert
was HKNNKS, which consisted of the last two aa of
YYHK and the next 4 aa. However the YYHKNNK
without the K in the middle is also included in ZC45 and
ZXC21 (so it's weird that Pradhan didn't mention
them since they were the closest published relatives of SARS2 in January
2020):
Also from my alignment above it looks that relative to SARS1, SARS2
has the deletion SK and the insertion MESEFR.
Pradhan et al. identified the locations of their inserts by doing a
pairwise alignment of SARS2 and SARS1, but they would've been able to
identify the locations more accurately if they would've done a multiple
sequence alignment that also included other sarbecoviruses. Trevor
Bedford also pointed out that the YYHK wasn't really even
an insert relative to other sarbecoviruses, but from his alignment it
looks like SARS2 has the insertion SEFR relative to SARS1:
[https://
The third insert that was highlighted in Pradhan's alignment
consisted of the amino acids SSG, even though a bit before
it there was another 3 aa insert that was not highlighted. But the third
insert might have been placed in the wrong spot because Pradhan et
al. only did a pairwise alignment of SARS2 and SARS1. When I did an
alignment of more viruses, I got a single 6 aa insert in SARS2 relative
to SARS1 instead of two different 3 aa inserts. And my alignment also
shows that the G at the end of SSG is also
included in ZC45 and RpYN06 (and the middle
S in SSG is even included in European and
African sarbecoviruses, even though it might be a
coincidence):
The insert between inserts 2 and 3 that wasn't highlighted by Pradhan
consisted of the amino acids LHR. When I tried searching
for the insert and the surrounding amino acids on BLAST, I found that
the LHR and the next 4 amino acids were matched by a
bacteriophage from the class Caudoviricetes (so it's even better than Pradhan's first two inserts
which only matched 6 amino acids):
Pradhan's fourth insert consisted of QTNS followed by 11
aa with no match in SARS2 followed by PRRA, which was found
in a single random subtype B sequence of HIV from India. But if the
designers of SARS2 were inspired to borrow the PRRA motif
from HIV, how did they know to look at a random HIV sequence from India
out of hundreds of thousands of published sequences? Here the Indian
sequence is compared to the HXB2 refseq, which contains
QVTNS instead of QTNS, but which only matches
1 out of 4 aa of the PRRA:
$curl ' https:// eutils. ncbi. nlm. nih. gov/ entrez/ eutils/ efetch. fcgi? db=protein& rettype=fasta& id=AKR75206, NP_ 057850'| mafft -- clustalout - [...] AKR75206. 1 RVLAEAMSQ- TNS- S I L M Q R S N F K G P R R A V K C F N C G R E G H I A K N C R A P R K K G C W K C G K E G NP_ 057850. 1 R V L A E A M S Q V T N S A T I M M Q R G N F R N Q R K I V K C F N C G K E G H T A R N C R A P R K K G C W K C G K E G ********* *** :*:***.**:. *: *******:*** *:***************** ^^^^^^^^^^^^^^^^^^^^^ Pradhan insert 4 [...]
Here's code for making amino acid alignments similar to the previous sections of this file:
#download whole genome sequences of sarbecoviruses curl -sL sars2.net/ f/ sarbe. fa. xz| xz -dc> sarbe. fa # efetch multiple sequences with accessions from STDIN emu()(curl -sd " id=$( paste -sd,) " " https:// eutils. ncbi. nlm. nih. gov/ entrez/ eutils/ efetch. fcgi? db=${ 1:- nuccore} & rettype=${ 2:- fasta} & retmode=$ 3") # download spike protein sequences of sarbecoviruses and rename them to have same the defline as the whole genome sequences seqkit seq -ni sarbe.fa| emu ' ' fasta_ cds_ aa| seqkit grep -nrp gene=S, spike, surface| sed ' s/_ prot_.*//; s/ lcl|// '| seqkit fx2tab| awk -F\\ t ' NR==FNR{ a[$ 1] =$ 2; next}{ print" > "$ 1" " a[$ 1] "\ n"$ 2} ' <( seqkit seq -n sarbe. fa| sed $' s/ /\ t/ ') -> spike. aa # make a TSV matrix for percentage identity between spike protein sequences aln()(\mafft -- thread 3 -- quiet "${@--} ") curl -Ls sars2. net/ f/ pid. cpp> pid. cpp; g++ pid. cpp -O3 -o pid seqkit replace -isp '[^ X] ' -r ' ' spike. aa| seqkit seq -m 5| seqkit seq -ni| seqkit grep -vf- spike. aa| aln|./ pid> spike. pid # thin out a TSV percentage identity matrix by removing rows with over n% thin()(identity to any previous row (default 99%) awk -F\\ t ' NR> 1{ for( i=2; i< NR; i++) if($ i> x) next; print$ 1} ' x="${ 1- 99} " "${@: 2} ") # print color scheme aacolor()(printf %s\\ n A: 242: 121: 121 C: 242: 182: 121 D: 242: 242: 121 G: 182: 242: 121 F: 121: 242: 151 E: 121: 242: 222 T: 121: 182: 242 I: 121: 121: 242 K: 182: 121: 242 L: 242: 121: 242 M: 255: 191: 191 N: 255: 223: 191 P: 255: 255: 191 Q: 223: 255: 191 R: 191: 255: 207 S: 191: 255: 244 H: 191: 223: 255 V: 191: 191: 255 W: 223: 191: 255 Y: 255: 191: 255 -: 140: 140: 140 X: 255: 255: 255 \*: 255: 255: 255| tr : \ ) # display alignment with colorized amino acids seqcol()(seqkit seq -u| seqkit fx2tab| gawk -F\\ t ' NR==FNR{ a[$ 1] =$ 2; next}{ name[ FNR] =$ 1; split($ 2, z, " "); for( i in z) seq[ FNR][ i] =z[ i]; if( length($ 1) > max1) max1=length($ 1); if( length($ 2) > max2) max2=length($ 2)} END{ for( i=1; i< =max2; i+ =width){ printf( "%