Shell and R code for genetic analysis of SARS-like viruses - sars2.net

Contents

Download FASTA files from GenBank with curl or efetch

This downloads FASTA files for the nucleotide and protein sequences of a few SARS-like viruses:

$ printf %s\\n NC_045512\ sars2 NC_004718\ sars1 MN996532\ ratg13 MG772933\ zc45 MG772934\ zxc21 MZ937000\ banal52 >accession
$ cat accession|while read l m;do curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=$l">$m.fa;done
$ cat accession|while read l m;do curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta_cds_aa&id=$l">$m.aa;done
$ head -n3 sars2.fa
>NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA
CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC
$ head -n3 sars2.aa
>lcl|NC_045512.2_prot_YP_009724389.1_1 [gene=ORF1ab] [locus_tag=GU280_gp01] [db_xref=GeneID:43740578] [protein=ORF1ab polyprotein] [exception=ribosomal slippage] [protein_id=YP_009724389.1] [location=join(266..13468,13468..21555)] [gbkey=CDS]
MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHLKDGTCGLVEVEKGVLPQLEQPYVF
IKRSDARTAPHGHVMVELVAELEGIQYGRSGETLGVLVPHVGEIPVAYRKVLLRKNGNKGAGGHSYGADL

Another option is to use efetch from the NCBI's E-utilities: [https://www.ncbi.nlm.nih.gov/books/NBK179288/]

curl ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh|sh
edirect/efetch -db nuccore -id NC_045512 -format fasta
edirect/efetch -db nuccore -id NC_045512 -format fasta_cds_aa

Align spike protein sequences with MAFFT

The code below creates a single FASTA file for the spike protein sequences of the viruses downloaded in the previous section. It then uses MAFFT to align the sequences and displays the alignment in the human-readable CLUSTAL format. From the output, you can see that the PRRA insert which encodes for the furin cleavage site is missing from each sequence except SARS2, even though BANAL-52 and RaTG13 include the QTNS part of the QTNSPRRA insert:

$ brew install mafft
[...]
$ for x in *.aa;do echo \>${x%%.*};awk '/protein=(spike|surface)/{x=1;next}/^>/{x=0}x' $x;done|mafft --clustalout -
[...]
banal52         GAEHVNNSYECDIPIGAGICASYQTQTNS----RSVASQSIIAYTMSLGAENSVAYSNNS
ratg13          GAEHVNNSYECDIPIGAGICASYQTQTNS----RSVASQSIIAYTMSLGAENSVAYSNNS
sars1           GAEHVDTSYECDIPIGAGICASYHTVSLL----RSTSQKSIVAYTMSLGADSSIAYSNNT
sars2           GAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNS
zc45            GAEHVNASYECDIPIGAGICASYHTASIL----RSTSQKAIVAYTMSLGAENSIAYANNS
zxc21           GAEHVNASYECDIPIGAGICASYHTASIL----RSTGQKAIVAYTMSLGAENSIAYANNS
                *****: ****************:* :      **...::*:********:.*:**:**:
[...]

MAFFT is much faster than older multiple sequence aligners like ClustalW. In my benchmark below, aligning the nucleotide sequences of four SARS-like genomes took 5 seconds with MAFFT, 129 seconds with ClustalOmega, 258 seconds with Muscle, and 278 seconds with ClustalW:

brew install mafft muscle clustal-omega clustal-w
curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id='{NC_045512,NC_004718.3,MN996532,MG772933}>sars.fa
time mafft --thread 4 sars.fa>/dev/null
time clustalw sars.fa -output=fasta>/dev/null
time clustalo -i sars.fa>/dev/null
time muscle -in sars.fa>/dev/null

Make a heatmap which shows the percentage identity of genomes with pheatmap

First align the nucleotide sequences:

for x in {sars2,sars1,ratg13,zc45,zxc21,banal52}.fa;do echo ">${x%.fa}";sed 1d $x;done|mafft --thread 4 ->all.aln

Then run this in R:

install.packages("bio3d")
install.packages("pheatmap")
install.packages("colorspace")

# this uses `bio3d::read.fasta` because the `seqinr` package also has a function called `read.fasta`
t=bio3d::seqidentity(bio3d::read.fasta("all.aln"))

# this reorders the branches of the clustering tree by using the first dimension in an MDS matrix as the weight
hc=as.hclust(reorder(as.dendrogram(hclust(as.dist(1-t))),cmdscale(1-t)[,1]))

# this uses `pheatmap::pheatmap` because the `ComplexHeatmap` package also has a function called `pheatmap`
pheatmap::pheatmap(
  100*t,
  filename="1.png",
  clustering_callback=\(...)hc,
  legend=F,
  cellwidth=20,
  cellheight=20,
  fontsize=9,
  border_color=NA,
  display_numbers=T,
  number_format="%.0f",
  fontsize_number=8,
  number_color="black",
  colorRampPalette(colorspace::hex(colorspace::HSV(seq(240,0),.5,1)))(256)
)

Here you can see that SARS2 has about 80% identity with SARS1 and about 97% identity with BANAL-52:

Compile a FASTA file for SARS-like viruses through BLAST

You can find viruses that are similar to SARS2 by going to nucleotide BLAST. [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn] Enter the accession number of the SARS2 referece genome (NC_045512) to the big text field at the top. Then set the organism to "Coronaviridae (taxid:11118)", click "Add organism", enter "SARS-CoV-2 (taxid:2697049)" and click the "Exclude" checkbox next to it.

When I try to do BLAST searches among all coronaviruses, I usually get an error which says "CPU usage limit was exceeded. You may need to change your search strategy." However I figured out that I can avoid the error by excluding SARS2 from the search results:

Then when you click "BLAST", after a while it should show you the hundred closest matches. Then click "Download" and select "FASTA (complete sequences)" to download all sequences:

You can next repeat the same procedure for other SARS-like sequences, like SARS1 (NC_004718), a Kenyan bat virus BtKY72 (KY352407), a bat virus which is about 4000 nucleotide changes away from SARS1 (DQ648856), a bat virus which is about 100 nucleotide changes away from SARS1 (OK017854), and so on. Then you can concatenate all of the files you downloaded, remove duplicates, exclude partial sequences and monopartite sequences, and add the reference genome of SARS2. And then you can align the sequences with MAFFT:

curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=NC_045512'>sars2.fa
brew install seqkit
cat ~/Downloads/seqdump*.txt sars2.fa|seqkit seq -w 0|paste - -|awk '!a[$1]++'|grep -Ev 'complete cds|partial genome|almost complete genome|genomic sequence, sequence|monopartite'|tr \\t \\n|mafft --thread 4 ->sarslike.fa

In the code above, seqkit -w 0 is used to linearize the FASTA file so that each sequence is placed on a single long line. paste - - joins every other line with a tab so that the title line and sequence are placed on a single line instead of two lines. awk '!a[$1]++' removes duplicate sequences, even though you can remove duplicates with seqkit rmdup -s.

If you don't bother creating the FASTA file yourself, you can also download this FASTA file I made: https://drive.google.com/uc?export=download&id=1eq9q_6jPbjvfvKnpAWDkbIyjgky7cjv5.

Make a TSV distance matrix for an aligned FASTA file with snp-dists

The snp-dists utility calculates the number of nucleotide changes between each pair of sequences in an aligned FASTA file: [https://github.com/tseemann/snp-dists]

$ curl https://sars2.net/f/sarbe.aln.xz|xz -dc>sarbe.aln
$ brew install brewsci/bio/snp-dists
[...]
$ snp-dists sarbe.aln>sarbe.dist
[...]
$ head -n4 sarbe.dist|cut -f-4|tr \\t \; # the output of snp-dists only shows sequence IDs but not sequence names
snp-dists 0.7.0;OR233295.1;OR233321.1;OR233320.1
OR233295.1;0;2018;2014
OR233321.1;2018;0;10
OR233320.1;2014;10;0
$ grep ^\> sarbe.aln|cut -c2-|sed $'s/ /\t/;s/, complete genome//'|awk 'NR==FNR{a[$1]=$2;next}FNR==1{for(i=2;i<=NF;i++)$i=$i" "a[$i]}FNR>1{$1=$1" "a[$1]}1' {,O}FS=\\t - sarbe.dist>sarbe.distn
$ head -n4 sarbe.distn|cut -f-4|tr \\t \; # add sequence names to output of snp-dists
snp-dists 0.7.0;OR233295.1 Horseshoe bat sarbecovirus isolate Rt17DN420 surface glycoprotein gene, complete cds;OR233321.1 Horseshoe bat sarbecovirus isolate Rt22QT53 surface glycoprotein gene, complete cds;OR233320.1 Horseshoe bat sarbecovirus isolate Rt22QT48 surface glycoprotein gene, complete cds
OR233295.1 Horseshoe bat sarbecovirus isolate Rt17DN420 surface glycoprotein gene, complete cds;0;2018;2014
OR233321.1 Horseshoe bat sarbecovirus isolate Rt22QT53 surface glycoprotein gene, complete cds;2018;0;10
OR233320.1 Horseshoe bat sarbecovirus isolate Rt22QT48 surface glycoprotein gene, complete cds;2014;10;0
$ awk -F\\t 'NR==1{for(i=2;i<=NF;i++)if($i~/Wuhan-Hu-1/)break;next}$1!=x{print$i,$1}' sarbe.distn|sort -n|head
0 NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1
905 MZ937000.1 Bat coronavirus isolate BANAL-20-52/Laos/2020
1138 MN996532.2 Bat coronavirus RaTG13
1159 MZ937001.1 Bat coronavirus isolate BANAL-20-103/Laos/2020
1186 MZ937003.2 Bat coronavirus isolate BANAL-20-236/Laos/2020
1540 MZ081381.1 Betacoronavirus sp. RpYN06 strain bat/Yunnan/RpYN06/2020
1790 MZ937002.1 Bat coronavirus isolate BANAL-20-116/Laos/2020
1804 MZ937004.1 Bat coronavirus isolate BANAL-20-247/Laos/2020
1864 OR233327.1 Horseshoe bat sarbecovirus isolate Rp22DB167 surface glycoprotein gene, complete cds
2087 MW251311.1 Bat coronavirus RacCS264, partial genome

From the output above, you can see that the closest neighbor of SARS2 is BANAL-52 which has 905 nucleotide changes from SARS2.

The default behavior of snp-dists is to only count A, C, G, and T letters, so it ignores positions where either sequence has a gap or a degenerate base, and it produces the incorrect result for amino acid sequences. The -a flag enables counting all letters, but it also counts positions with a gap so there's no way to ignore positions with a gap if snp-dists is used with amino acid sequences.

Plot distance to SARS1 on x-axis and SARS2 on y-axis with ggplot2

# `geom_mark_hull` draws the hulls incorrectly in new versions of `ggforce` so you need to install an old version of `ggforce`:
# `install.packages("https://cran.r-project.org/src/contrib/Archive/ggforce/ggforce_0.3.4.tar.gz")`

library(ggplot2)
library(colorspace)

t=read.table("sarslike.dist",sep="\t",row.names=1,header=T)
name=readLines("sarslike.fa")
name=sub("^>","",name[grep("^>",name)])
name=sub(", complete genome","",name)
name=sub(" isolate "," ",name)
name=sub("^([^ ]+) ","",name)

xid="NC_045512.2"
yid="NC_004718.3"
xname=name[match(xid,rownames(t))]
yname=name[match(yid,rownames(t))]

xy=data.frame(x=t[,xid],y=t[,yid])

# # include only samples whose distance to both samples is below 6500 nucleotides
# pick=complete.cases(xy<=5000)
# xy=xy[pick,]
# name=name[pick]
# t=t[pick,pick]

# draw a line between each sample and its two closest neighbors
seg=cbind(xy[rep(1:nrow(xy),each=2),],xy[apply(t,1,\(x)order(x)[2:3]),])
colnames(seg)=paste0("V",1:4)

expand=max(xy)/25
lims=c(-expand,max(apply(xy,2,range))+expand)

k=as.factor(cutree(hclust(as.dist(t)),24))

set.seed(0)
pal=sample(hcl(head(seq(0,360,length.out=nlevels(k)+1),-1),90,60))

ggplot(xy,aes(x,y))+
geom_vline(xintercept=0,color="gray80",size=.3)+
geom_hline(yintercept=0,color="gray80",size=.3)+
geom_abline(linetype="dashed",color="gray80",size=.3)+
geom_segment(data=seg,aes(x=V1,y=V2,xend=V3,yend=V4),color="gray50",size=.1)+
ggforce::geom_mark_hull(aes(color=k,fill=k),concavity=1000,radius=unit(.15,"cm"),expand=unit(.15,"cm"),alpha=.2,size=.15)+
geom_point(aes(color=k),size=.5)+
# geom_text(aes(color=k),label=name,size=2,vjust=-.7)+ # allow overlapping labels
ggrepel::geom_text_repel(aes(color=k),label=name,size=2,max.overlaps=40,segment.size=.2,min.segment.length=.2,box.padding=.15,force=3)+
coord_cartesian(xlim=lims,ylim=lims,expand=F)+
scale_x_continuous(breaks=seq(0,10000,500))+
scale_y_continuous(breaks=seq(0,10000,500))+
labs(x=paste("Distance to",xname),y=paste("Distance to",yname))+
scale_fill_manual(values=pal)+
scale_color_manual(values=pal)+
theme(
  axis.text=element_text(size=6),
  axis.text.y=element_text(angle=90,vjust=1,hjust=.5),
  axis.ticks=element_blank(),
  axis.ticks.length=unit(0,"cm"),
  axis.title=element_text(size=8),
  legend.position="none",
  panel.background=element_rect(fill="white"),
  panel.border=element_rect(color="gray80",fill=NA,size=.6),
  panel.grid.major=element_line(color="gray90",size=.2),
  plot.background=element_rect(fill="white")
)

ggsave("1.png",width=10,height=10)

Make an MDS plot of SARS1 samples with ggplot2

# `geom_mark_hull` draws the hulls incorrectly in new versions of `ggforce` so you need to install an old version of `ggforce`:
# `install.packages("https://cran.r-project.org/src/contrib/Archive/ggforce/ggforce_0.3.4.tar.gz")`

library(ggplot2)

t=read.table("sarslike.dist",sep="\t",row.names=1,header=T)
name=readLines("sarslike.fa")
name=sub("^>","",name[grep("^>",name)])
name=sub(", complete genome","",name)
name=sub(" isolate "," ",name)
name=sub("SARS coronavirus "," ",name)
name=sub("^([^ ]+) ","",name)

# include only samples with 100 or fewer nucleotide changes from the SARS1 reference genome
pick=t[,"NC_004718.3"]<=100
t=t[pick,pick]
name=name[pick]

mds0=cmdscale(t,ncol(t)-1,eig=T)
mds=as.data.frame(mds0$points)
eig=head(mds0$eig,ncol(mds))
pct=paste0("MDS dimension ",1:ncol(mds)," (",sprintf("%.1f",100*eig/sum(eig)),"%)")

k=as.factor(cutree(hclust(as.dist(t)),16))

set.seed(0)
pal=hcl(sample(head(seq(0,360,nlevels(k)+1),-1)),80,70)

seg=cbind(mds[rep(1:nrow(mds),each=2),1:2],mds[apply(t,1,\(x)order(x)[2:3]),1:2])
colnames(seg)=paste0("V",1:4)

ggplot(mds,aes(V1,V2),group=0)+
ggforce::geom_mark_hull(aes(group=k,color=k,fill=k),concavity=1000,radius=unit(.15,"cm"),expand=unit(.15,"cm"),alpha=.2,size=.15)+
geom_segment(data=seg,aes(x=V1,y=V2,xend=V3,yend=V4),color="black",linewidth=.2)+
geom_point(aes(color=k),size=.5)+
# geom_text(aes(color=k),label=name,size=2,vjust=-.7)+
ggrepel::geom_text_repel(aes(color=k),label=name,size=2,max.overlaps=Inf,segment.size=.2,min.segment.length=.2,box.padding=.15,force=3)+
labs(x=pct[1],y=pct[2])+
scale_x_continuous(breaks=seq(-1000,1000,20),expand=expansion(mult=c(.06,.06)))+
scale_y_continuous(breaks=seq(-1000,1000,20),expand=expansion(mult=c(.04,.04)))+
scale_color_manual(values=pal)+
scale_fill_manual(values=pal)+
theme(
  axis.ticks=element_line(linewidth=.4,color="gray40"),
  axis.ticks.length=unit(-.15,"cm"),
  axis.text=element_text(color="black",size=6),
  axis.text.x=element_text(margin=margin(.3,0,0,0,"lines")),
  axis.text.y=element_text(angle=90,vjust=1,hjust=.5,margin=margin(0,.3,0,0,"lines")),
  axis.title=element_text(color="black",size=8),
  legend.position="none",
  panel.background=element_rect(fill="gray25"),
  panel.border=element_rect(color="gray40",fill=NA,linewidth=.6),
  panel.grid=element_blank(),
  plot.background=element_rect(fill="gray40",color=NA),
  plot.title=element_text(size=10,color="black")
)

ggsave("1.png",width=11,height=7)

The red cluster in the middle of the plot is occupied by early human strains of SARS1, and the yellow cluster next to it is occupied by later human stains of SARS2. The bottom left corner of the plot is occupied by lab-created strains like MA5, ExoN1, and the wtic strains, which are all derived from the Urbani strain. Above them there are a couple of infectious clones of the Urbani strain. The bottom right corner of the plot is occupied by samples which were allegedly found in Himalayan palm civets, even though they only have 50-100 nucleotide changes from the reference genome of SARS1:

Make a circular hierarchical clustering tree with Circlize

library(circlize)

t=read.table("sarslike.dist",header=T,row.names=1,sep="\t")
name=readLines("sarslike.fa")
name=sub("^>","",name[grep("^>",name)])
name=sub(", complete genome","",name)
name=sub(" isolate "," ",name)

# reorder clustering tree based on value of first dimension in MDS
hc=as.hclust(reorder(as.dendrogram(hclust(as.dist(t),"ward.D2")),cmdscale(t)[,1]))

labels=hc$label[hc$order]
cut=cutree(hc,30)

set.seed(0)
hue=c(0,30,60,90,130,170,210,240,280,320)+15
color=sample(c(hcl(hue,50,50),hcl(hue,40,60),hcl(hue,30,30)))

dend=dendextend::color_branches(as.dendrogram(hc),k=length(unique(cut)),col=color[unique(cut[labels])])

circos.clear()
size=4500
png("1.png",w=size,h=size,res=150)
circos.par(cell.padding=c(0,0,0,0))
circos.initialize(0,xlim=c(0,nrow(t)))

labelheight=.5 # use 50% of height for labels
circos.track(ylim=c(0,1),bg.border=NA,track.height=labelheight,track.margin=c(0,0),panel.fun=\(x,y)
  for(i in 1:nrow(t))circos.text(i-.5,0,name[hc$order][i],adj=c(0,.5),facing="clockwise",niceFacing=T,cex=.9,col=color[cut[labels[i]]]))

circos.track(ylim=c(0,attr(dend,"height")),track.height=1-labelheight-.01,track.margin=c(0,.005),bg.border=NA,panel.fun=\(x,y)circos.dendrogram(dend))

circos.clear()
dev.off()

# remove empty space around the plot but add back a 24px margin
system("mogrify -gravity center -trim -border 24 -bordercolor white 1.png")

Find genomes with the lowest combined distance to SARS1 and SARS2

This excludes genomes whose distance to either SARS1 or SARS2 is below 500:

t=read.table("sarslike.dist",header=T,row.names=1,sep="\t")
name=readLines("sarslike.fa")
name=sub("^>","",name[grep("^>",name)])
name=sub(", complete genome","",name)
name=sub(" isolate "," ",name)
colnames(t)=rownames(t)=name

s1="NC_004718.3 SARS coronavirus Tor2"
s2="NC_045512.2 Severe acute respiratory syndrome coronavirus 2 Wuhan-Hu-1"

dist=data.frame(sum=t[,s1]+t[,s2],d1=t[,s1],d2=t[,s2],row.names=rownames(t))
dist=dist[rowSums(dist[,-1]<=500)==0,]

# format table with row names on the right side
fotar=\(x)writeLines(apply(cbind(apply(rbind(colnames(x),x),2,\(c)sprintf(paste0("%",max(nchar(c)),"s"),c)),c("",rownames(x))),1,paste,collapse=" "))

fotar(head(dist[order(dist[,1]),],20))

Output:

 sum   d1   d2
6940 6035  905 MZ937000.1 Bat coronavirus BANAL-20-52/Laos/2020
7176 6017 1159 MZ937001.1 Bat coronavirus BANAL-20-103/Laos/2020
7193 1150 6043 KT444582.1 SARS-like coronavirus WIV16
7196 1176 6020 OK017856.1 Sarbecovirus sp. YN2020F
7199 1153 6046 KY417150.1 Bat SARS-like coronavirus Rs4874
7201 1177 6024 OK017857.1 Sarbecovirus sp. YN2020G
7211 1175 6036 OK017852.1 Sarbecovirus sp. YN2020B
7213 1176 6037 OK017855.1 Sarbecovirus sp. YN2020E
7213 1176 6037 OK017854.1 Sarbecovirus sp. YN2020D
7213 6027 1186 MZ937003.2 Bat coronavirus BANAL-20-236/Laos/2020
7214 1177 6037 OK017853.1 Sarbecovirus sp. YN2020C
7232 6094 1138 MN996532.2 Bat coronavirus RaTG13
7264 1254 6010 MK211376.1 Coronavirus BtRs-BetaCoV/YN2018B
7277 1273 6004 OK017849.1 Sarbecovirus sp. YN2016C
7278 1255 6023 KY417152.1 Bat SARS-like coronavirus Rs9401
7292 1283 6009 OK017848.1 Sarbecovirus sp. YN2016B
7294 1280 6014 OK017851.1 Sarbecovirus sp. YN2016E
7294 1279 6015 OK017850.1 Sarbecovirus sp. YN2016D
7299 1283 6016 OK017847.1 Sarbecovirus sp. YN2016A
7300 1256 6044 KY417151.1 Bat SARS-like coronavirus Rs7327

Compile a table of metadata from GenBank

$ mkdir xml
$ awk -F'[> ]' '/^>/{print$2}' sarslike.fa|head -n4|while read l;do curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&retmode=xml&id=$l"|xml fo -D>xml/$l;done
$ brew install xmlstarlet
[...]
$ awk -F'[> ]' '/^>/{print$2}' sarslike.fa|head -n4|while read x;do printf %s\\n "$(xml sel -t -v //GBSeq_primary-accession xml/$x)" "$(xml sel -t -v //GBSeq_definition xml/$x)" "$(gdate -d$(xml sel -t -v //GBSeq_create-date xml/$x) +%F)" "$(xml sel -t -v //GBReference_title xml/$x|tail -n1)" "$(xml sel -t -v '//GBQualifier[GBQualifier_name="collection_date"]/GBQualifier_value' xml/$x)" "$(xml sel -t -v '//GBQualifier[GBQualifier_name="country"]/GBQualifier_value' xml/$x)"|tr \; _|paste -sd\; -;done
FJ882959;SARS coronavirus MA15 ExoN1 isolate P3pp6, complete genome;2009-12-13;A mouse-adapted SARS-coronavirus causes disease and mortality in BALB/c mice;26-Nov-2008;USA: Tennessee
FJ882951;SARS coronavirus MA15 ExoN1 isolate P3pp3, complete genome;2009-12-13;A mouse-adapted SARS-coronavirus causes disease and mortality in BALB/c mice;26-Nov-2008;USA: Tennessee
FJ882962;SARS coronavirus MA15 ExoN1 isolate P3pp10, complete genome;2009-12-13;A mouse-adapted SARS-coronavirus causes disease and mortality in BALB/c mice;18-Dec-2008;USA: Tennessee
FJ882942;SARS coronavirus MA15 ExoN1 isolate P3pp5, complete genome;2009-12-13;A mouse-adapted SARS-coronavirus causes disease and mortality in BALB/c mice;26-Nov-2008;USA: Tennessee

Search for a nucleotide sequence

When I searched for the nucleotide sequence that encodes for the PRRA furin cleavage site in SARS2, it was not included in any other genome:

$ curl -Lso sarslike.fa 'https://drive.google.com/uc?export=download&id=1eq9q_6jPbjvfvKnpAWDkbIyjgky7cjv5'
$ sed 's/, complete genome//' sarslike.fa|tr \  _|seqkit locate -p cctcggcgggca|column -ts$'\t'
seqID                                                                           patternName   pattern       strand  start  end    matched
NC_045512.2_Severe_acute_respiratory_syndrome_coronavirus_2_isolate_Wuhan-Hu-1  cggcgggcacgt  cggcgggcacgt  +       24286  24297  cggcgggcacgt
$ seqkit translate<<<$'>a\ncctcggcgggca'|sed 1d
PRRA

Assemble raw reads of SARS1 contaminants in French influenza A samples

A Chinese paper from 2021 found that influenza samples collected in France between 2007 and 2012 were contaminated with SARS1. [https://www.sciencedirect.com/science/article/pii/S2590053621001075] The SARS1 contaminants matched the lab-created wtic-MB and ExoN1 strains of SARS1. Kevin McKernan posted instructions on how you can assemble the SARS1 sequences in the French influenza samples. [https://x.com/Kevin_McKernan/status/1472265147221786639] I expanded his instructions to also generate the consensus sequence:

brew install sratoolkit bwa samtools bcftools cutadapt # cutadapt is needed by trim_galore
vdb-config # configure sra-tools in order to use fasterq-dump
wget https://github.com/FelixKrueger/TrimGalore/archive/0.6.10.tar.gz;tar -xf TrimGalore-0.6.10.tar.gz;cp TrimGalore-0.6.10/trim_galore /usr/local/bin
x=ERR1091914;fasterq-dump $x # download the raw reads of one sample (the other two samples were ERR1091919 and ERR1091920)
trim_galore $x.fastq
curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NC_004718.3&rettype=fasta'>sars1.fa
bwa index sars1.fa
bwa mem -t 4 sars1.fa ${x}_trimmed.fq|samtools view -Sb ->$x.bam
samtools sort $x.bam -o $x.sort.bam
samtools index $x.sort.bam
bcftools mpileup -Ou -f sars1.fa $x.sort.bam|bcftools call -mv -Oz -o calls.vcf.gz
bcftools index calls.vcf.gz
<sars1.fa bcftools consensus calls.vcf.gz>$x.fa

Perform de-novo assembly of SARS2 from a metagenomic dataset with MEGAHIT

The code below downloads the raw reads from the Wu et al. 2020 paper where the reference genome of SARS2 was published. [https://www.nature.com/articles/s41586-020-2008-3] It then uses MEGAHIT to perform de-novo assembly:

brew install megahit -s # -s compiles from source
brew install trimmomatic sratoolkit
vdb-config # configure sra-tools in order to use fasterq-dump
fasterq-dump SRR10971381 # download raw reads from the NCBI's sequence read archive (about 2 GB)
trimmomatic PE SRR10971381_{1,2}.fastq {1,2}{,un}paired.fastq.gz AVGQUAL:20 HEADCROP:12 LEADING:3 TRAILING:3 MINLEN:75 -threads 4
megahit -1 1paired.fastq.gz -2 2paired.fastq.gz -o out
# megahit -1 SRR10971381_1.fastq -2 SRR10971381_2.fastq -o out2 # try running MEGAHIT with no trimming
curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=NC_045512'>sars2.fa
awk 'length>max{max=length;o=$0}END{print o}' out/final.contigs.fa|rev|sed y/ATCG/TAGC/|sed 1i\>longest_contig|cat - sars2.fa|mafft --clustalout -

The last line above prints the reverse complement of the longest contiguous sequence and aligns it to the SARS2 reference genome. Sometimes the longest contig is the reverse complement of the genome and sometimes not, so you have to check both versions. You can get the reverse complement with rev|sed y/ATCG/TAGC/ or seqtk seq -rp.

The parameters I used with Trimmomatic are probably not optimal, but I just copied them from this article: https://research.csc.fi/metagenomics_quality.

I used brew install megahit -s to compile MEGAHIT from source, because without the -s option I got the error dyld: Library not loaded: /usr/local/opt/gcc/lib/gcc/9/libgomp.1.dylib (because I had GCC 11 and not 9).

When I didn't use Trimmomatic, my longest contiguous sequence was 29802 bases long and it was missing the last 102 bases of the third version of Wuhan-Hu-1 and it had one extra base at the beginning. When I did use Trimmomatic, my longest contiguous sequence was now 29875 bases long, it was only missing the last 30 bases from Wuhan-Hu-1, but it still had one extra base at the beginning. I probably used the wrong parameters with Trimmomatic though, or I'm still missing some steps in the pipeline.

When I ran MEGAHIT without Trimmomatic, my second-longest contig was 16,037 bases long and its best match at BLAST was "Leptotrichia hongkongensis JMUB5056 DNA, complete genome" with 99.18% identity and 99% query coverage. My third-longest contig was 13,657 bases long and its best match at BLAST was "Select seq LR778174.1 Veillonella parvula strain SKV38 genome assembly, chromosome: 1" with 99.82% identity. My fourth-longest contig was 11,777 bases long and its best match was "Leptotrichia sp. oral taxon 212 strain W10393, complete genome" with 99.24% identity. I didn't find any matches for my fifth-longest contig, but my sixth-longest contig was 8,062 bases long and its best match was "Select seq CP068263.2 Homo sapiens isolate CHM13 chromosome 15" with 99.89% identity. From the "Analysis" tab of NCBI's sequence read archive, you can see which organisms the reads are estimated to come from. [https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR10971381&display=analysis] For SRR10971381, 61% of reads are unidentified, 33% are classified under bacteria, 5% under eukaryotes, and only 0.2% under viruses. So the ratio of 33% bacteria to 5% eukaryotes is consistent with my BLAST searches, where three out of four reads were from bacteria and the fourth was a human read.

Find identical or similar proteins

Limeng Yan has pointed out that in the Zhoushan bat viruses ZC45 and ZXC21, the envolope protein is identical to the reference genome of SARS2. In the second Yan report, she pointed out that the envelope protein was also identical in many of the supposed pangolin and bat viruses that are similar to SARS2 and that were published in 2020 and 2021: [https://zenodo.org/record/4650821]

The code below takes the file of nucleotide sequences I've used as an example in this thread, downloads amino acid sequences for each virus, and prints the names of viruses with an identical envelope protein to Wuhan-Hu-1 along with the number of nucleotide changes from Wuhan-Hu-1:

$ curl -Lso sarslike.fa 'https://drive.google.com/uc?export=download&id=1eq9q_6jPbjvfvKnpAWDkbIyjgky7cjv5'
$ brew install brewsci/bio/snp-dists
$ snp-dists sarslike.fa>sarslike.dist
$ sed -n 's/^>//p' sarslike.fa|cut -d\  -f1|while read l;do curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta_cds_aa&id=$l";done>sarslike.aa
$ 1li()(seqkit seq -w 0 "$@"|paste - -) # convert a FASTA file to a format with one entry per line
$ masot()(awk -F\\t 'NR==1{for(i=2;i<=NF;i++)if($i==x)break;next}$1!=x{print$i,$1}' x="$1" "${2--}"|sort -n) # print a column which matches the specified column name from a matrix and display the sorted values with row names
$ 1li<sarslike.aa|grep $'\t'$(1li<sarslike.aa|grep 'NC_045512.*envelope'|cut -f2)\$|sed 's/_prot_.*//;s/>lcl|//'|awk 'NR==FNR{a[$0];next}$2 in a' - <(masot NC_045512.2 sarslike.dist)|awk 'NR==FNR{a[$1]=$0;next}{print$1,a[$2]}' <(sed -n 's/^>//p' sarslike.fa) -|sed 's/, complete genome//'
905 MZ937000.1 Bat coronavirus isolate BANAL-20-52/Laos/2020
1138 MN996532.2 Bat coronavirus RaTG13
1186 MZ937003.2 Bat coronavirus isolate BANAL-20-236/Laos/2020
1540 MZ081381.1 Betacoronavirus sp. RpYN06 strain bat/Yunnan/RpYN06/2020
2860 MT121216.1 Pangolin coronavirus isolate MP789
3092 OK287355.1 Sarbecovirus sp. isolate BetaCoV_Yunnan_Rp_JCC9_2020
3518 OK017803.1 Sarbecovirus sp. isolate HN2021A
3518 OK017804.1 Sarbecovirus sp. isolate HN2021B
3553 MG772933.1 Bat SARS-like coronavirus isolate bat-SL-CoVZC45
3554 MG772934.1 Bat SARS-like coronavirus isolate bat-SL-CoVZXC21
4312 MT040335.1 Pangolin coronavirus isolate PCoV_GX-P5L
4312 MT072864.1 Pangolin coronavirus isolate PCoV_GX-P2V
4313 MW532698.1 Pangolin coronavirus isolate GX_P2V
4314 MT040333.1 Pangolin coronavirus isolate PCoV_GX-P4L
4317 MT040334.1 Pangolin coronavirus isolate PCoV_GX-P1E
4319 MT040336.1 Pangolin coronavirus isolate PCoV_GX-P5E
5629 LC663959.1 Sarbecovirus sp. Rc-mk2 RNA, complete geneome

Another option is to use seqkit locate, even though it will also match proteins that have additional insertions at the beginning or end of the sequence: seqkit grep -nrp 'NC_045512.*envelope' sarslike.aa|seqkit locate -f- sarslike.aa.

However maybe there's nothing that unusual about the envelope protein being idential, because there's viruses that have over 3000 nucleotide changes from SARS1 but that still have an identical envelope protein to SARS1:

$ 1li<sarslike.aa|grep $'\t'$(1li<sarslike.aa|grep 'NC_004718.*envelope'|cut -f2)\$|sed 's/_prot_.*//;s/>lcl|//'|awk 'NR==FNR{a[$0];next}$2 in a' - <(masot NC_004718.3 sarslike.dist)|awk 'NR==FNR{a[$1]=$0;next}{print$1,a[$2]}' <(sed -n 's/^>//p' sarslike.fa) -|sed 's/, complete genome//'|tail -n1
3549 GQ153548.1 Bat SARS coronavirus HKU3-13

In order to return similar proteins in addition to identical proteins, you can use BLAST:

$ brew install blast
[...]
$ makeblastdb -in sarslike.aa -dbtype prot
[...]
$ blastp -query <(seqkit grep -nrp 'NC_004718.*envelope' sarslike.aa) -db sarslike.aa
[...]
$ seqkit grep -nrp NC_004718.\*envelope sarslike.aa|blastp -db sarslike.aa -outfmt 6|head -n2|tr \\t \; # `-outfmt 6` uses TSV output
lcl|NC_004718.3_prot_YP_009825054.1_6;lcl|MT308984.1_prot_QJE50592.1_6;100.000;76;0;0;1;76;1;76;7.96e-49;146
lcl|NC_004718.3_prot_YP_009825054.1_6;lcl|GU553363.1_prot_ADC35486.1_6;100.000;76;0;0;1;76;1;76;7.96e-49;146
$ seqkit grep -nrp NC_004718.\*envelope sarslike.aa|blastp -db sarslike.aa -outfmt 6|cut -f2,5|sed $'s/lcl|//;s/_prot_[^\t]*//'|awk 'NR==FNR{a[$1]=$2;next}$2 in a{print a[$2],$0}' - <(masot NC_004718.3 sarslike.dist)|awk 'NR==FNR{a[$1]=$0;next}{print$1,$2,a[$3]}' <(sed -n s/^\>//p sarslike.fa) -|sed 's/, complete genome//'|sort -nk1,1 -k2,2|head -n2
0 0 JX163927.1 SARS coronavirus Tor2 isolate Tor2/FP1-10851
0 1 JX163923.1 SARS coronavirus Tor2 isolate Tor2/FP1-10912

In the output of the last command above, the first column shows the number of amino acid changes in the envelope protein and the second column shows the number of nucleotide changes in the whole genome.

Find pairs of viruses with a similar spike protein but a divergent whole genome

To calculate a distance matrix for the number of nucleotide or amino acid changes in an aligned FASTA file, you can use Biostrings: stringDist(readDNAStringSet("input.fa"),"hamming")). The following code finds pairs of viruses where there's two or fewer amino acid changes in the spike protein, and it then sorts the results by the number of nucleotide changes in the whole genome:

# install.packages("BiocManager")
# BiocManager::install("Biostrings")

library(Biostrings)

d1=as.matrix(stringDist(readDNAStringSet("sarslike.fa"),"hamming"))
name=sub(", complete genome","",rownames(d1))
name=sub(" isolate "," ",name)
id=sub(" .*","",rownames(d1))
rownames(d1)=colnames(d1)=id
name=setNames(name,id)

d2=as.matrix(stringDist(readAAStringSet("spike.aln"),"hamming"))
colnames(d2)=rownames(d2)=sub("_prot_.*","",sub("lcl\\|","",rownames(d2)))
d1=d1[rownames(d2),rownames(d2)]

long=data.frame(c(d1),c(d2),name[rownames(d1)[row(d1)]],name[colnames(d1)[col(d1)]])
long=long[long[,3]<long[,4],]
pick=long[long[,2]<=2,]
pick=pick[rev(tail(order(pick[,1]),10)),]
write.table(pick,sep=";",quote=F,row.names=F,col.names=F)

Output:

1044;0;KC881005.1 Bat SARS-like coronavirus RsSHC014;MT308984.1 Mutant SARS coronavirus Urbani clone SARS-Urbani-MA_SHC014-spike
871;1;KY417144.1 Bat SARS-like coronavirus Rs4084;MT308984.1 Mutant SARS coronavirus Urbani clone SARS-Urbani-MA_SHC014-spike
786;1;KY417151.1 Bat SARS-like coronavirus Rs7327;KY417152.1 Bat SARS-like coronavirus Rs9401
559;2;KC881006.1 Bat SARS-like coronavirus Rs3367;KF367457.1 Bat SARS-like coronavirus WIV1
432;0;OL674074.1 Severe acute respiratory syndrome-related coronavirus Rs7896;OL674078.1 Severe acute respiratory syndrome-related coronavirus Rs7921
425;0;OK017844.1 Sarbecovirus sp. JX2021O;OK017845.1 Sarbecovirus sp. JX2021P
421;0;OL674074.1 Severe acute respiratory syndrome-related coronavirus Rs7896;OL674077.1 Severe acute respiratory syndrome-related coronavirus Ra7909
417;0;OL674074.1 Severe acute respiratory syndrome-related coronavirus Rs7896;OL674080.1 Severe acute respiratory syndrome-related coronavirus Rs7931
409;1;KC881005.1 Bat SARS-like coronavirus RsSHC014;KY417144.1 Bat SARS-like coronavirus Rs4084
397;0;OL674074.1 Severe acute respiratory syndrome-related coronavirus Rs7896;OL674076.1 Severe acute respiratory syndrome-related coronavirus Rs7907

From the first line of the output above, you can see that the viruses titled "Bat SARS-like coronavirus RsSHC014" and "Mutant SARS coronavirus Urbani clone SARS-Urbani-MA_SHC014-spike" have an identical spike protein even though they have 1044 nucleotide changes in the whole genome. The mutant virus was published in a paper from 2015 by the Wuhan Institute of Virology whose last author was Ralph Baric. [https://www.ncbi.nlm.nih.gov/nuccore/MT308984] They used a mouse-adapted strain of SARS1 called MA15 as the backbone and inserted the spike protein from the SHC014 virus, which had been published in a paper from 2013 whose last author was Zhengli Shi and whose authors also included Peter Daszak. The mutant virus was also mentioned in a report which the EcoHealth Alliance produced in order to renew their research grant from the NIH: "In collaboration with Ralph Baric (UNC), we used the SARS-CoV reverse genetics system to generate a chimeric virus with a mouse-adapted SARS-CoV backbone expressing SHC014 S protein with 10% sequence divergence from SARS-CoV S. This chimera replicated in primary human airway epithelium, using the human ACE2 receptor to enter into cells." [https://s3.documentcloud.org/documents/21055989/understanding-risk-bat-coronavirus-emergence-grant-notice.pdf]

From the second line of the output above, you can see that the spike protein of Rs4084 has only one amino acid change from the MA15-SHC014 mutant. Rs4084 was published in a paper by the Wuhan Institute of Virology from 2017 whose last author was Zhengli Shi. [https://www.ncbi.nlm.nih.gov/nuccore/KY417144.1]

You can also use awk to calculate the distance matrix for the number of amino acid changes in the spike protein:

sed 's/>lcl|\(.*\)_prot_.*/>\1/' spike.aln|awk '{printf(/^>/?"\n%s\t":"%s"),$0}'|grep .|paste - -|cut -c2-|gawk -F\\t '{n[NR]=$1;split($2,a,"");l=length(a);for(i=1;i<=l;i++)x[NR][i]=a[i]}END{for(i=1;i<=NR;i++)printf"\t%s",n[i];print"";for(i=1;i<=NR;i++){printf"%s",n[i];for(j=1;j<=NR;j++){s=0;for(k=1;k<=l;k++)if(x[i][k]!=x[j][k])s++;printf"\t%s",s};print""}}'>spike.dist

Make a vertical hierarchical clustering tree with ggtree

# install.packages("BiocManager")
# BiocManager::install("ggtree")
# BiocManager::install("Biostrings")

library(ggtree)
library(Biostrings)

t=as.matrix(stringDist(readDNAStringSet("https://drive.google.com/uc?export=download&id=1eq9q_6jPbjvfvKnpAWDkbIyjgky7cjv5"),"hamming"))
name=rownames(t)
name=sub(", complete genome","",name)
name=sub(" isolate "," ",name)
name=sub(", genomic sequence, sequence"," ",name)
rownames(t)=name

pick=t[grep("Wuhan-Hu-1",rownames(t)),]<=5000
t=t[pick,pick]

set.seed(0)
hue=c(0,30,60,90,130,170,210,240,280,320)+15
color=c("black",sample(c(hcl(hue,50,30),hcl(hue,60,60))))

# reorder branches based on first dimension in MDS
hc=as.hclust(reorder(as.dendrogram(hclust(as.dist(t))),cmdscale(t)[,1]))
cut=cutree(hc,20)

p=ggtree(hc,size=.15,ladderize=F) # `ladderize=F` is needed to preserve the order of the tree
clades=sapply(split(names(cut),cut),\(n)MRCA(p,n))
p=groupClade(p,clades,group_name="subtree")+aes(color=subtree)

p%<+%data.frame(label=names(cut))+
geom_tiplab(size=2)+
scale_color_manual(values=color)+
xlim(0,1.3*max(hc$height))+ # control ratio of label width to tree width
theme(legend.position="none")

ggsave("1.png",height=.09*nrow(t),width=6)

In the output below, the first date on each line is the date when the sequence was published at GenBank and the second date is the collection date of the sample. There are only two viruses that have less than 5000 nucleotide changes from Wuhan-Hu-1 and that had been published at GenBank before 2020. They are the Zhoushan bat viruses ZX45 and ZXC21, which were submitted to GenBank by "Institute of Military Medicine Nanjing Command":

Find unique K-mers

In a pre-print published on February 2th 2020, a program called FAST-NA was used to identify unique regions within the genome of SARS2 by finding k-mers of amino acid sequences which did not appear in other coronaviruses : [https://www.biorxiv.org/content/10.1101/2020.01.31.929497v1.full]

To identify unique sequences, we adapted FAST-NA, a software tool for screening DNA synthesis orders for pathogens(2,3) that uses methods for automatic signature generation developed originally for cybersecurity malware detection(4). In particular, FAST-NA compares all k-mer sequences of a collection of target sequences to a collection of contrasting sequences in order to identify all k-mer sequences that are unique to the target population. These unique sequences are diagnostic of membership in the population, whereas shared sequences indicate structure that is conserved to some degree.

Here, we applied FAST-NA to identify all of the unique 10-mer sequences in all of the amino acid sequences for 2019-nCoV then available from NCBI: 63 amino acid sequences available in NCBI, comprising a total of 49379 amino acids (5-8). For contrasting sequences, we used a July, 2019 snapshot of all protein sequences in family Coronaviridae available from NCBI, a total of 50574 sequences comprising a total of approximately 40 million residues. The resulting collection of unique 10-mer amino acids sequences were then concatenated where overlapping within the same parent sequence and trimmed to remove non-unique flanking portions.

The FAST-NA program is not available for free, so I wrote my own redneck version of the program. My dataset was about an order of magnitude smaller because it included only SARS-like viruses instead of all coronaviruses, even though I downloaded all viruses available in 2023 and not 2019 so my dataset also includes sequences like RaTG13 and BANAL-52:

$ curl -Lso sarslike.fa 'https://drive.google.com/uc?export=download&id=1eq9q_6jPbjvfvKnpAWDkbIyjgky7cjv5'
$ sed -n 's/^>//p' sarslike.fa|cut -d\  -f1|while read l;do curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta_cds_aa&id=$l";done>sarslike.aa
$ brew install seqkit
[...]
$ seqkit seq -w0 sarslike.aa|paste - -|grep NC_045512|sed $'s/.*gene=//;s/\][^\t]*//'|grep -v ORF1a\ |awk '{l=length($2);for(i=1;i<=l-k+1;i++)print$1,i,substr($2,i,k)}' k=10 >kmer # ORF1a is omitted because it is also included in ORF1ab
$ awk '{print">"NR"\n"$3}' kmer|seqkit locate -Ff - <(seqkit grep -vnrp NC_045512 sarslike.aa)>match # `-F` uses FM-index which makes searching for a long list of patterns faster
$ awk 'NR==FNR{a[$7];next}!($3 in a)' match kmer|head -n4 # show unique 10-mers along with their protein and starting position within the protein
ORF1ab 29 RGFGDSVEEV
ORF1ab 30 GFGDSVEEVL
ORF1ab 31 FGDSVEEVLS
ORF1ab 32 GDSVEEVLSE
$ awk 'NR==FNR{a[$7];next}!($1 in b){b[$1]=0}!($3 in a){b[$1]++}END{for(i in b)print i,b[i]}' match kmer|awk 'NR==FNR{a[$1]++;next}{print$1,$2/a[$1],$2,a[$1]}' OFMT=%.4f kmer -|sort -rnk2|column -t # show the ratio of unique k-mers to total k-mers within each protein
ORF8    0.2679  30   112
ORF3a   0.1053  28   266
S       0.0791  100  1264
ORF7b   0.0588  2    34
ORF1ab  0.0230  163  7087
M       0.0141  3    213
N       0.0098  4    410
ORF7a   0       0    112
ORF6    0       0    52
ORF10   0       0    29
E       0       0    66
$ awk 'NR==FNR{a[$7];next}!($3 in a)' match kmer|awk 'NR==1||$1!=prev1||lastprint==NR-1{start=$2;prev1=$1;prev2=$2;next}$2!=prev2+1{print prev1,start,prev2+k-1;lastprint=NR;prev1=$1;prev2=$2;next}{prev1=$1;prev2=$2}END{if(lastprint!=NR)print $1,start,$2+k-1}' k=10|awk '{print$0,$3-$2+1}' # show the ranges within each protein which feature unique k-mers
ORF1ab 29 47 19
ORF1ab 272 289 18
ORF1ab 368 385 18
ORF1ab 856 868 13
ORF1ab 981 991 11
ORF1ab 993 1008 16
ORF1ab 1008 1017 10
ORF1ab 1152 1168 17
ORF1ab 1385 1402 18
ORF1ab 1726 1742 17
ORF1ab 1771 1788 18
ORF1ab 1787 1804 18
ORF1ab 1814 1831 18
ORF1ab 2025 2038 14
ORF1ab 2074 2091 18
ORF1ab 2397 2410 14
ORF1ab 3135 3152 18
ORF1ab 3598 3615 18
S 23 41 19
S 42 59 18
S 68 85 18
S 210 227 18
S 338 355 18
S 364 379 16
S 395 411 17
S 490 507 18
S 511 528 18
ORF3a 1 19 19
ORF3a 20 37 18
ORF8 1 19 19
ORF8 18 35 18
N 28 40 13

Show positions where all sequences in an alignment are not identical

In a paper from 2020 by Ren et al. titled "Identification of a novel coronavirus causing severe pneumonia in human: a descriptive study", they described de-novo assembly of a sample of SARS2 collected in December 2019. [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7147275/] Compared to the third version of Wuhan-Hu-1 at GenBank, the EPI_ISL_402123/IPBCAMS-WH-01 sequence from the Ren et al. paper is missing the last 4 bases of the poly(A) tail and it has 3 nucleotide changes in the region of the ORF1a protein:

$ curl https://download.cncb.ac.cn/gwh/Viruses/Severe_acute_respiratory_syndrome_coronavirus_2_IPBCAMS-WH-01_GWHABKF00000000/GWHABKF00000000.genome.fasta.gz|gzip -dc>ren
$ curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN908947'>wuhanhu1
$ cat ren wuhanhu1|mafft --quiet -|awk '{printf(/^>/?"\n%s\n":"%s"),$0}'|grep .>temp.aln;paste <(sed -n 2p temp.aln|grep -o .) <(sed -n 4p temp.aln|grep -o .)|awk '$1!=$2{print NR,$1,$2}'
3778 g a
8388 g a
8987 a t
29900 - a
29901 - a
29902 - a
29903 - a

The shell code above only works with pairwise alignments, but the following code can also be used to compare three or more sequences:

> m=t(as.matrix(Biostrings::readDNAStringSet("temp.aln")))
> pick=which(rowSums(m!=m[,1])>0)
> write.table(`rownames<-`(m[pick,],pick),sep=";",quote=F,col.names=NA) # without `col.names=NA` the semicolon at the start of the header line is omitted
;GWHABKF00000001    OriSeqID=IPBCAMS-WH-01_2019 Len=29899;MN908947.3 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
3778;G;A
8388;G;A
8987;A;T
29900;-;A
29901;-;A
29902;-;A
29903;-;A

Or the same in gawk (this doesn't work with nawk which doesn't support arrays of arrays):

$ awk 'NR>1{gsub("\t"," ");sub("\n","\t");gsub("\n","");print}' RS=\> temp.aln|awk -F\\t '{o=o";"$1;l=length($2);for(i=1;i<=l;i++){x=substr($2,i,1);a[i][NR]=x;b[i][x]}}END{print o;for(i in b)if(length(b[i])>1){o=i;for(j=1;j<=NR;j++)o=o";"a[i][j];print o}}'
;GWHABKF00000001 OriSeqID=IPBCAMS-WH-01_2019 Len=29899;MN908947.3 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
3778;g;a
8388;g;a
8987;a;t
29900;-;a
29901;-;a
29902;-;a
29903;-;a

Find sequences which match a primer pair

In February 2020 the WHO published a PDF which featured tables of primer and probe sequences used in seven different RT-PCR protocols. [https://www.who.int/docs/default-source/coronaviruse/whoinhouseassays.pdf] I tried searching for the first primer pair mentioned in the PDF to see if it matches other viruses besides SARS2, but it only matched RaTG13:

$ seqkit amplicon -F atgagcttagtcctgttg -R ctccctttgttgtgttgt sarslike.fa # `-F` specifies forward primer and `-R` specifies reverse primer
[INFO] 1 primer pair loaded
>MN996532.2 Bat coronavirus RaTG13, complete genome
atgagcttagtcctgttgcactacgacagatgtcttgtgctgccggtactacacaaactg
cttgcactgatgacaatgcgttagcttactacaacacaacaaagggag
>NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
atgagcttagtcctgttgcactacgacagatgtcttgtgctgccggtactacacaaactg
cttgcactgatgacaatgcgttagcttactacaacacaacaaagggag
$ awk '{printf(/^>/?"\n%s\n":"%s"),$0}' sarslike.fa|grep .|paste - -|cut -c2-|awk '{if(match($2,x))print RSTART,RLENGTH,substr($2,RSTART,RLENGTH),$1}' {,O}FS=\\t x=atgagcttagtcctgttg.\*$(rev<<<ctccctttgttgtgttgt|sed y/atcg/tagc/)
13118   108 atgagcttagtcctgttgcactacgacagatgtcttgtgctgccggtactacacaaactgcttgcactgatgacaatgcgttagcttactacaacacaacaaagggag    MN996532.2 Bat coronavirus RaTG13, complete genome
13118   108 atgagcttagtcctgttgcactacgacagatgtcttgtgctgccggtactacacaaactgcttgcactgatgacaatgcgttagcttactacaacacaacaaagggag    NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome

Translate a nucleotide sequence to an amino acid sequence

$ seqkit translate<<<$'>a\ncctcggcgggca'|sed 1d
PRRA
$ R -e 'if(!require("BiocManager"))install.packages("BiocManager");BiocManager::install("Biostrings")'
[...]
$ Rscript -e 'writeLines(as.character(Biostrings::translate(Biostrings::DNAString("cctcggcgggca"))))'
PRRA
$ printf %s\\n AAA:K AAC:N AAG:K AAT:N ACA:T ACC:T ACG:T ACT:T AGA:R AGC:S AGG:R AGT:S ATA:I ATC:I ATG:M ATT:I CAA:Q CAC:H CAG:Q CAT:H CCA:P CCC:P CCG:P CCT:P CGA:R CGC:R CGG:R CGT:R CTA:L CTC:L CTG:L CTT:L GAA:E GAC:D GAG:E GAT:D GCA:A GCC:A GCG:A GCT:A GGA:G GGC:G GGG:G GGT:G GTA:V GTC:V GTG:V GTT:V TAA:\* TAC:Y TAG:\* TAT:Y TCA:S TCC:S TCG:S TCT:S TGA:\* TGC:C TGG:W TGT:C TTA:L TTC:F TTG:L TTT:F|tr : \ >codontable
$ echo cctcggcgggca|awk 'NR==FNR{a[tolower($1)]=$2;next}{o="";for(i=1;i<length($0);i+=3)o=o a[tolower(substr($0,i,3))];print o}' codontable -
PRRA
$ seqkit translate -l 1 # list standard table (table number 1)
The Standard Code (transl_table=1)
Source: https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=tgencodes#SG1

Initiation Codons:
  ATG, CTG, TTG

Stop Codons:
  TAA, TAG, TGA

Stranslate Table:
  AAA: K, AAC: N, AAG: K, AAT: N
  ACA: T, ACC: T, ACG: T, ACT: T
  AGA: R, AGC: S, AGG: R, AGT: S
  ATA: I, ATC: I, ATG: M, ATT: I

  CAA: Q, CAC: H, CAG: Q, CAT: H
  CCA: P, CCC: P, CCG: P, CCT: P
  CGA: R, CGC: R, CGG: R, CGT: R
  CTA: L, CTC: L, CTG: L, CTT: L

  GAA: E, GAC: D, GAG: E, GAT: D
  GCA: A, GCC: A, GCG: A, GCT: A
  GGA: G, GGC: G, GGG: G, GGT: G
  GTA: V, GTC: V, GTG: V, GTT: V

  TAA: *, TAC: Y, TAG: *, TAT: Y
  TCA: S, TCC: S, TCG: S, TCT: S
  TGA: *, TGC: C, TGG: W, TGT: C
  TTA: L, TTC: F, TTG: L, TTT: F

Compare methods of calculating a distance matrix between sequences

snp-dists and bio3d::seqidentity only count nucleotide changes but not insertions or deletions, so that they ignore positions where one sequence has a gap and another sequence does not. Biostrings::stringDist(method="hamming") also counts positions where one sequence has a gap but another sequence does not:

$ printf %s\\n \>a gattaca \>b aatt--->test.fa
$ snp-dists -qb test.fa
    a   b
a   0   1
b   1   0
$ Rscript -e 'bio3d::seqidentity(bio3d::read.fasta("test.fa"))'
     a    b
a 1.00 0.75
b 0.75 1.00
$ Rscript -e 'as.matrix(Biostrings::stringDist(Biostrings::readDNAStringSet("test.fa"),"hamming"))'
  a b
a 0 4
b 4 0

The genomes of viruses are often missing a piece from the beginning or end of the genome, so when you have a set of aligned sequences where some sequences have gaps at either end and other sequences don't, it's common to cut out the parts from the ends of the alignment which feature gaps. For example in a paper where they produced an aligned set of genomes of SARS1, they wrote: "For network analysis, an 81-bp block at the 5'-end including gaps and a 77-bp block at the 3'-end including gaps and the poly-A tails were trimmed out of the alignment, and the final alignment contains 30,169 nucleotides." [https://www.sciencedirect.com/science/article/pii/S2590053621001075]

In the code block below where I tried aligning ten random genomes of SARS1, only three out of ten sequences included the poly(A) tail, but its length was really short in all of them: 4 bases in one sequence and 2 bases in two other sequences. The sequence with the 4-base poly-A tail was the only sequence which didn't have any gaps at the end. Two sequences had 2 gaps at the end, three sequences had 4 gaps, two sequences had 24 gaps, one sequence had 45 gaps, and one sequence had 47 gaps. At the beginning of the aligned set of sequences, six sequences had no gaps, one sequence had 20 gaps, one sequence had 32 gaps, and two sequences had 40 gaps:

$ curl -Lso sarslike.fa 'https://drive.google.com/uc?export=download&id=1eq9q_6jPbjvfvKnpAWDkbIyjgky7cjv5'
$ seqkit seq -w0 sarslike.fa|paste - -|grep SARS\ coronavirus|gshuf --random-source=<(seq 1000)|head -n10|tr \\t \\n|mafft --quiet ->random.fa
$ mafft --clustalout random.fa # `--clustalout` uses a human-readable output format
[... lines before the beginning of the alignment omitted]
EU371559.1      atattaggtttttacctacccaggaaaagccaaccaacctcgatctcttgtagatctgtt
MK062180.1      atattaggtttttacctacccaggaaaagccaaccaacctcgatctcttgtagatctgtt
AY502926.1      atattaggtttttacctacccaggaaaagccaaccaacctcgatctcttgtagatctgtt
FJ882926.1      --------------------------------accaacctcgatctcttgtagatctgtt
AY502931.1      atattaggtttttacctacccaggaaaagccaaccaacctcgatctcttgtagatctgtt
FJ882938.1      --------------------caggaaaagccaaccaacctcgatctcttgtagatctgtt
AP006557.1      atattaggtttttacctacccaggaaaagccaaccaacctcgatctcttgtagatctgtt
HQ890532.1      ----------------------------------------cgatctcttgtagatctgtt
AY321118.1      atattaggtttttacctacccaggaaaagccaaccaacctcgatctcttgtagatctgtt
HQ890535.1      ----------------------------------------cgatctcttgtagatctgtt
                                                        ********************
[... lines in the middle of the alignment omitted]
EU371559.1      ctgcctatatggaagagccctaatgtgtaaaattaattttagtagtgctatccccatgtg
MK062180.1      ctgcctatatggaagagccctaatgtgtaaaattaattttagtagtgctatccccatgtg
AY502926.1      ctgcctatatggaagagccctaatgtgtaaaattaattttagtagtgctatccccatgtg
FJ882926.1      ctgcctatatggaagagccctaatgtgtaaaattaattttagtagtgctatccccatgtg
AY502931.1      ctgcctatatggaagagccctaatgtgtaaaattaattttagtagtgctatccccatgtg
FJ882938.1      ctgcctatatggaagagccctaatgtgtaaaattaattttagtagtgctatccccatgtg
AP006557.1      ctgcctatatggaagagccctaatgtgtaaaattaattttagtagtgctatccccatgtg
HQ890532.1      ctgcctatatggaagagccctaatgtgtaaaattaattttagtag---------------
AY321118.1      ctgcctatatggaagagccctaatgtgtaaaattaattttagtagtgctatccccatgtg
HQ890535.1      ctgcctatatggaagagccctaatgtgtaaaattaattttagtagt--------------
                *********************************************

EU371559.1      attttaatagcttcttaggagaatgacaaaa
MK062180.1      attttaatagcttcttaggagaatgac----
AY502926.1      attttaatagcttcttaggagaatgacaa--
FJ882926.1      attttaa------------------------
AY502931.1      attttaatagcttcttaggagaatgacaa--
FJ882938.1      attttaa------------------------
AP006557.1      attttaatagcttcttaggagaatgac----
HQ890532.1      -------------------------------
AY321118.1      attttaatagcttcttaggagaatgac----
HQ890535.1      -------------------------------

When I used Biostrings::stringDist(method="hamming") to calculate a distance matrix between the ten aligned sequences, the maximum distance was at first 120 when I didn't remove any bases, but the maximum distance decreased to 36 when I removed the last 47 and the first 40 bases of the alignment. When I used snp-dists to calculate the distance matrix, the maximum distance was only 34 because snp-dists ignores positions where one sequence has a gap and another one doesn't:

$ Rscript -e 'max(Biostrings::stringDist(Biostrings::readDNAStringSet("random.fa"),"hamming"))'
[1] 120
$ seqkit subseq -r 41:-48 random.fa>random.trim;Rscript -e 'max(Biostrings::stringDist(Biostrings::readDNAStringSet("random.trim"),"hamming"))'
[1] 36
$ snp-dists -q random.fa|sed 1d|cut -f2-|tr \\t \\n|sort -n|tail -n1
34

Reorder sequences based on position in a hierarchical clustering tree

When you have a set of data that can be represented as a distance matrix, a convenient way to reorder items within the dataset so that similar items are grouped together is to order the items based on their position in a hierarchical clustering tree. I already used that method to order the sarslike.fa file I have used as an example in this thread, so in this section I'll use a file of HIV and SIV genomes as an example. The HIV Sequence Database of the Los Alamos National Laboratory has a convenient interface where you can download prealigned sets of HIV and SIV genomes. Go here: https://www.hiv.lanl.gov/content/sequence/NEWALIGN/align.html. Set "Alignment type" to "Compendium" in order to download a small curated set of sequences. Set "Organism" to "Other SIV (includes HIV1 and HIV2 sequences)". Set "Region" to "GENOME" to download the whole genome. Then click "Get Alignment" and click "Download this alignment". Then run this:

library(Biostrings)

fa=readDNAStringSet("SIV_COM_2021_genome_DNA.fasta")
fa=subseq(fa,101,width(fa[1])-100)
d=stringDist(fa,"hamming")
hc=as.hclust(reorder(as.dendrogram(hclust(d)),cmdscale(as.matrix(d))[,1])) # reorder the clustering tree based on the first dimension in a multidimensional scaling matrix
writeXStringSet(fa[hc$order],"sorted.fa")

From the output, you'll see that HIV-2 samples are grouped together with SIV samples from sooty mangabey monkeys (SMM), and chimpanzee strains of SIV are clustered together with HIV-1.

The default method used by Biostrings::stringDist is Levenshtein edit distance, which is extremely slow to calculate and which is not needed with sequences that have already been aligned, but the Hamming method simply returns the number of positions that are different between two sequences.

The code above omits the first 100 and last 100 bases from distance calculation because many samples are missing a piece from the beginning or end of the sequence. However actually in the aligned set of HIV and SIV genomes, many sequences are missing more than 100 bases from the beginning or end of the alignment, so maybe it would've been better to calculate the distance matrix between the genomes so that I only count nucleotide changes but ignore insertions and deletions, like what's done by bio3d::seqidentity and snp-dists.

Search for assembly errors at the beginning or end of sequences

I tried aligning 194 sequences of SARS1 and then printing the last 100 bases of the alignment:

$ curl -Lso sarslike.fa 'https://drive.google.com/uc?export=download&id=1eq9q_6jPbjvfvKnpAWDkbIyjgky7cjv5'
$ seqkit grep -nrp SARS\ coronavirus sarslike.fa|mafft --quiet --thread 4 ->sars1.aln
$ seqkit subseq -r-100:-1 sars1.aln|seqkit fx2tab|awk -F\\t '{print$2,$1}'|sed 's/, complete genome//'
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- FJ882959.1 SARS coronavirus MA15 ExoN1 isolate P3pp6
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- FJ882951.1 SARS coronavirus MA15 ExoN1 isolate P3pp3
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- FJ882962.1 SARS coronavirus MA15 ExoN1 isolate P3pp10
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- FJ882942.1 SARS coronavirus MA15 ExoN1 isolate P3pp5
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- FJ882943.1 SARS coronavirus MA15 ExoN1
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- FJ882957.1 SARS coronavirus MA15
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- FJ882952.1 SARS coronavirus MA15 isolate P3pp4
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- FJ882958.1 SARS coronavirus MA15 isolate P3pp7
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- FJ882948.1 SARS coronavirus MA15 isolate P3pp3
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- FJ882945.1 SARS coronavirus MA15 isolate P3pp6
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- FJ882961.1 SARS coronavirus MA15 isolate P3pp5
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- MK062180.1 SARS coronavirus Urbani isolate icSARS-MA
tgtgtaaaattaattttagtagtgctatccccatgtgattttaa-------------------------------------------------------- FJ882930.1 SARS coronavirus ExoN1
tgtgtaaaattaattttagtagtgctatccccatgtgattttaa-------------------------------------------------------- FJ882926.1 SARS coronavirus ExoN1
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- MK062179.1 SARS coronavirus Urbani isolate icSARS
tgtgtaaaattaattttagtagtgctatccccatgtgattttaa-------------------------------------------------------- FJ882938.1 SARS coronavirus wtic-MB
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- HQ890526.1 SARS coronavirus MA15 ExoN1 isolate d2ym1
tgtgtaaaattaattttagtag------------------------------------------------------------------------------ HQ890532.1 SARS coronavirus MA15 ExoN1 isolate d4ym2
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- HQ890538.1 SARS coronavirus MA15 ExoN1 isolate d2om5
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- HQ890535.1 SARS coronavirus MA15 ExoN1 isolate d2om2
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- HQ890529.1 SARS coronavirus MA15 ExoN1 isolate d2ym4
tgtgtaaaattaattttagtag------------------------------------------------------------------------------ HQ890531.1 SARS coronavirus MA15 ExoN1 isolate d4ym1
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- JF292906.1 SARS coronavirus MA15 ExoN1 isolate d3om5
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- JF292905.1 SARS coronavirus MA15 ExoN1 isolate d3om4
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- DQ497008.1 SARS coronavirus strain MA-15
tgtgtaaaattaattttagtag------------------------------------------------------------------------------ JF292903.1 SARS coronavirus MA15 ExoN1 isolate d4ym5
tgtgtaaaattaattttagtagtgctatccccatgtgattttaa-------------------------------------------------------- FJ882928.1 SARS coronavirus ExoN1 isolate P1pp1
tgtgtaaaattaattttagtagtgctatccccatgtgattttaa-------------------------------------------------------- FJ882927.1 SARS coronavirus wtic-MB isolate P1pp1
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- JF292909.1 SARS coronavirus MA15 isolate d2ym4
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- JF292915.1 SARS coronavirus MA15 isolate d4ym5
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- HQ890541.1 SARS coronavirus MA15 isolate d2ym1
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- AY278741.1 SARS coronavirus Urbani
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- FJ882953.1 SARS coronavirus MA15 ExoN1 isolate P3pp4
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- FJ882933.1 SARS coronavirus wtic-MB isolate P3pp6
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- FJ882935.1 SARS coronavirus wtic-MB isolate P3pp21
tgtgtaaaattaattttagtagtg---------------------------------------------------------------------------- FJ882934.1 SARS coronavirus wtic-MB isolate P3pp29
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- FJ882936.1 SARS coronavirus wtic-MB isolate P3pp2
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY323977.2 SARS coronavirus HSR 1
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- FJ882937.1 SARS coronavirus wtic-MB isolate P3pp18
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- DQ898174.1 SARS coronavirus strain CV7
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaa--------------------------------- AY291451.1 SARS coronavirus TW1
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaa--------------------------------- AY502928.1 SARS coronavirus TW5
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- FJ882949.1 SARS coronavirus wtic-MB isolate P3pp23
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- FJ882939.1 SARS coronavirus wtic-MB isolate P3pp16
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- FJ882932.1 SARS coronavirus wtic-MB isolate P3pp14
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- JQ316196.1 SARS coronavirus HKU-39849 isolate UOB
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- NC_004718.3 SARS coronavirus Tor2
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- AY427439.1 SARS coronavirus AS
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- AY283794.1 SARS coronavirus Sin2500
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaa------------------------------- EU371559.1 SARS coronavirus ZJ02
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaa--------------------------------- AY502929.1 SARS coronavirus TW6
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaa--------------------------------- AY502927.1 SARS coronavirus TW4
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaa--------------------------------- AY502926.1 SARS coronavirus TW3
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY394998.1 SARS coronavirus LC1
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- AY362699.1 SARS coronavirus TWC3
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaa--------------------------------- AY502931.1 SARS coronavirus TW8
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaa--------------------------------- AY502930.1 SARS coronavirus TW7
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY282752.2 SARS coronavirus CUHK-Su10
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- AP006557.1 SARS coronavirus TWH genomic RNA
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- AY283796.1 SARS coronavirus Sin2679
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY350750.1 SARS coronavirus PUMC01
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- AY321118.1 SARS coronavirus TWC
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY345986.1 SARS coronavirus CUHK-AG01
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaa--------------------------------- AY502932.1 SARS coronavirus TW9
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaa-------------------- AY485278.1 SARS coronavirus Sino3-11
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- JN854286.1 SARS coronavirus HKU-39849 isolate recSARS-CoV HKU-39849
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaa--------------------------------- AY502923.1 SARS coronavirus TW10
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- AP006561.1 SARS coronavirus TWY genomic RNA
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- AP006560.1 SARS coronavirus TWS genomic RNA
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- AY283798.2 SARS coronavirus Sin2774
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- FJ882940.1 SARS coronavirus ExoN1 isolate P3pp37
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaaaaa-------- AY357075.1 SARS coronavirus PUMC02
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY864805.1 SARS coronavirus BJ162
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- AY291315.1 SARS coronavirus Frankfurt 1
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY394990.1 SARS coronavirus HZS2-E
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY394989.1 SARS coronavirus HZS2-D
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaa---------------------- AY310120.1 SARS coronavirus FRA
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY345988.1 SARS coronavirus CUHK-AG03
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- AP006559.1 SARS coronavirus TWK genomic RNA
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgaca---------------------------------- AY559081.1 SARS coronavirus Sin842
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY394991.1 SARS coronavirus HZS2-Fc
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaa------------------------------ AY559087.1 SARS coronavirus Sin3725V
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- AP006558.1 SARS coronavirus TWJ genomic RNA
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- MK062182.1 SARS coronavirus Urbani isolate icSARS-C3-MA
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgaca---------------------------------- AY559096.1 SARS coronavirus Sin850
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY394993.1 SARS coronavirus HGZ8L2
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY394992.1 SARS coronavirus HZS2-C
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaa-------------------- AY278491.2 SARS coronavirus HKU-39849
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- AY283797.1 SARS coronavirus Sin2748
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY278554.2 SARS coronavirus CUHK-W1
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa AY357076.1 SARS coronavirus PUMC03
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaa------------------ AY485277.1 SARS coronavirus Sino1-11
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- FJ882955.1 SARS coronavirus ExoN1 isolate P3pp19
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY864806.1 SARS coronavirus BJ202
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgaca---------------------------------- AY559088.1 SARS coronavirus SinP1
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgaca---------------------------------- AY559092.1 SARS coronavirus SinP5
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- FJ882929.1 SARS coronavirus ExoN1 isolate P3pp1
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaa------------------------ AY559084.1 SARS coronavirus Sin3765V
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- FJ882931.1 SARS coronavirus ExoN1 isolate P3pp12
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- GQ153547.1 Bat SARS coronavirus HKU3-12
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- DQ084199.1 bat SARS coronavirus HKU3-2
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- DQ022305.2 Bat SARS coronavirus HKU3-1
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- DQ084200.1 bat SARS coronavirus HKU3-3
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- GQ153540.1 Bat SARS coronavirus HKU3-5
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- GQ153539.1 Bat SARS coronavirus HKU3-4
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- GQ153541.1 Bat SARS coronavirus HKU3-6
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- GQ153542.1 Bat SARS coronavirus HKU3-7
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- GQ153546.1 Bat SARS coronavirus HKU3-11
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- GQ153544.1 Bat SARS coronavirus HKU3-9
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- GQ153545.1 Bat SARS coronavirus HKU3-10
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- GQ153548.1 Bat SARS coronavirus HKU3-13
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- GQ153543.1 Bat SARS coronavirus HKU3-8
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaa------------------- DQ412043.1 Bat SARS coronavirus Rm1
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacgaaaaaaaaaaaaaaaaaaa--------------- DQ071615.1 Bat SARS coronavirus Rp3
---------------------------------------------------------------------------------------------------- AY348314.1 SARS coronavirus Taiwan TC3
---------------------------------------------------------------------------------------------------- AY338174.1 SARS coronavirus Taiwan TC1
---------------------------------------------------------------------------------------------------- AY338175.1 SARS coronavirus Taiwan TC2
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaa--------------------------------- AY502924.1 SARS coronavirus TW11
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaa------------------- AY313906.1 SARS coronavirus GD69
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatcac----------------------------------- AY304486.1 SARS coronavirus SZ3
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY395003.1 SARS coronavirus ZS-C
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY394996.1 SARS coronavirus ZS-B
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaa------------------------------- AY390556.1 SARS coronavirus GZ02
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatc------------------------------------- AY515512.1 SARS coronavirus HC/SZ/61/03
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY394995.1 SARS coronavirus HSZ-Cc
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY394994.1 SARS coronavirus HSZ-Bc
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY394987.1 SARS coronavirus HZS2-Fb
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY394986.1 SARS coronavirus HSZ-Cb
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY394985.1 SARS coronavirus HSZ-Bb
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaa------------------ AY278490.3 SARS coronavirus BJ03
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY394983.1 SARS coronavirus HSZ2-A
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaa------------------ AY278488.2 SARS coronavirus BJ01
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- JX163927.1 SARS coronavirus Tor2 isolate Tor2/FP1-10851
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- JX163926.1 SARS coronavirus Tor2 isolate Tor2/FP1-10912
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- JX163925.1 SARS coronavirus Tor2 isolate Tor2/FP1-10895
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- JX163924.1 SARS coronavirus Tor2 isolate Tor2/FP1-10851
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- JX163923.1 SARS coronavirus Tor2 isolate Tor2/FP1-10912
tgtgtaaaattaattttagtagtgctatccccatgtga-------------------------------------------------------------- FJ882963.1 SARS coronavirus P2
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaa------------------------------- EU371564.1 SARS coronavirus BJ182-12
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaa------------------------------- EU371563.1 SARS coronavirus BJ182-8
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaa------------------------------- EU371561.1 SARS coronavirus BJ182b
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaa------------------------------- EU371560.1 SARS coronavirus BJ182a
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaa--------------------------------- AY559095.1 SARS coronavirus Sin847
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatga------------------------------------ AY559086.1 SARS coronavirus Sin849
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaa--------------------------------- AY559085.1 SARS coronavirus Sin848
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaa----------------------------- AY559083.1 SARS coronavirus Sin3408
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaa---------------- AY654624.1 SARS coronavirus TJF
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- JX163928.1 SARS coronavirus Tor2 isolate Tor2/FP1-10895
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaa------------------------------ AY595412.1 SARS coronavirus LLJ-2004
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY395002.1 SARS coronavirus LC5
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY395000.1 SARS coronavirus LC3
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaa------------------------------- EU371562.1 SARS coronavirus BJ182-4
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttctagg-aagaatga------------------------------------ AY559091.1 SARS coronavirus SinP4
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- MK062183.1 SARS coronavirus Urbani isolate icSARS-C7
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- MK062181.1 SARS coronavirus Urbani isolate icSARS-C3
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaa----------------- AY278487.3 SARS coronavirus BJ02
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaa-------------- AY279354.2 SARS coronavirus BJ04
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttctaggagaaaatgaca---------------------------------- AY559090.1 SARS coronavirus SinP3
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttctta---------------------------------------------- AY304488.1 SARS coronavirus SZ16
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgaca---------------------------------- AY559093.1 SARS coronavirus Sin845
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgaca---------------------------------- AY559082.1 SARS coronavirus Sin852
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- FJ882947.1 SARS coronavirus wtic-MB isolate P3pp7
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- FJ882950.1 SARS coronavirus ExoN1 isolate P3pp60
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- FJ882944.1 SARS coronavirus ExoN1 isolate P3pp23
tgtgtaaaattaattttagta------------------------------------------------------------------------------- FJ882956.1 SARS coronavirus ExoN1 isolate P3pp53
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttctta---------------------------------------------- AY545914.1 SARS coronavirus isolate HC/SZ/79/03
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatcacaaaaaaaaaaaaaaaa------------------- AY545918.1 SARS coronavirus isolate HC/GZ/32/03
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaa------------------------------ AY545919.1 SARS coronavirus isolate CFB/SZ/94/03
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatcacaaaaaaaaaaaaaaaaaaaaa-------------- AY545917.1 SARS coronavirus isolate HC/GZ/81/03
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatcac----------------------------------- AY545916.1 SARS coronavirus isolate HC/SZ/266/03
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatc------------------------------------- AY572038.1 SARS coronavirus civet020
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY613948.1 SARS coronavirus PC4-13
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY613950.1 SARS coronavirus PC4-227
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY613949.1 SARS coronavirus PC4-136
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY613947.1 SARS coronavirus GZ0402
---------------------------------------------------------------------------------------------------- AY572034.1 SARS coronavirus civet007
---------------------------------------------------------------------------------------------------- AY686864.1 SARS coronavirus B039
---------------------------------------------------------------------------------------------------- AY572035.1 SARS coronavirus civet010
---------------------------------------------------------------------------------------------------- AY686863.1 SARS coronavirus A022
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaa------------------ AY278489.2 SARS coronavirus GD01
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatcacaaaaaaaa--------------------------- AY304495.1 SARS coronavirus GZ50
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gaga---------------------------------------- DQ182595.1 SARS coronavirus ZJ0301 from China
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- GU553363.1 SARS coronavirus HKU-39849 isolate TCVSP-HARROD-00001
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaa---------------------- DQ640652.1 SARS coronavirus GDH-BJH01
tgtgtaaaattaattttagtagtgctatccccatgtgatttta----cacattag-ggctcttccatatagg---------------------------- AY461660.1 SARS coronavirus SoD
tgtgtaaaattaattttagtagt----------------------------------------------------------------------------- GU553365.1 SARS coronavirus HKU-39849 isolate TCVSP-HARROD-00003
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY568539.1 SARS coronavirus GZ0401
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatcac----------------------------------- FJ959407.1 SARS coronavirus isolate A001
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY394999.1 SARS coronavirus LC2
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- AY283795.1 SARS coronavirus Sin2677
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaa-------------------------------- AY394850.2 SARS coronavirus WHU
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgac----------------------------------- MT308984.1 Mutant SARS coronavirus Urbani clone SARS-Urbani-MA_SHC014-spike
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY394978.1 SARS coronavirus GZ-B
tgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttag-gagaatgacaaaaaaaaaaaaaaaaaaaaaaaa----------- AY394979.1 SARS coronavirus GZ-C

There was only one sequence which didn't have any gaps at the end, but the number of gaps at the end ranged from 8 to 242 in the other sequences, where in over half of the cases the only difference was the length of the poly(A) tail.

From the tenth-last line of the output above, you can see that AY461660 (SARS coronavirus SoD) has a 24-base segment at the end which is not shared by any other sequence in my alignment (CACATTAGGGCTCTTCCATATAGG). However the 24-base segment turned out to be the reverse complement of another 24-base segment near the end of the genome which started 63 bases earlier:

$ seqkit seq -g sars1.aln|seqkit grep -nrp AY461660|seqkit locate -p cacattagggctcttccatatagg|awk '{$2=""}1'|column -t
seqID       pattern                   strand  start  end    matched
AY461660.1  cacattagggctcttccatatagg  +       29692  29715  cacattagggctcttccatatagg
AY461660.1  cacattagggctcttccatatagg  -       29629  29652  cacattagggctcttccatatagg
$ seqkit seq -g sars1.aln|seqkit grep -nrp AY461660|seqkit subseq -r29629:29652 # print the region of the reverse complement
>AY461660.1 SARS coronavirus SoD, complete genome
cctatatggaagagccctaatgtg

seqkit locate doesn't ignore gaps so I used seqkit seq -g to delete gaps. Another way to delete gaps is sed '/^>/!s/-//g'.

When I also looked at the first 100 bases of each sequence, I noticed another assembly error which was that in the beginning of AY559083 (SARS coronavirus Sin3408), there's two copies of a 29-base segment (GGAAAAGCCAACCAACCTCGATCTCTTGT), where one copy starts at position 22 and another copy starts at position 57:

$ seqkit seq -g sars1.aln|seqkit grep -nrp AY559083|seqkit locate -p ggaaaagccaaccaacctcgatctcttgt|awk '{$2=""}1'|column -t
seqID       pattern                        strand  start  end  matched
AY559083.1  ggaaaagccaaccaacctcgatctcttgt  +       22     50   ggaaaagccaaccaacctcgatctcttgt
AY559083.1  ggaaaagccaaccaacctcgatctcttgt  +       57     85   ggaaaagccaaccaacctcgatctcttgt

Return variant name and mutations of SARS2 samples with Nextclade CLI

See https://docs.nextstrain.org/projects/nextclade/en/stable/user/nextclade-cli.html.

$ brew install --cask miniconda;conda init ${SHELL##*/} # this didn't work without `--cask`
[...]
$ conda install -c bioconda nextclade
[...]
$ nextclade dataset get --name sars-cov-2 --output-dir sars-cov-2 # download files required to analyze SARS2 variants
[...]
$ nextclade run --input-dataset sars-cov-2 --output-all outdir sars-cov-2/sequences.fasta # `sequences.fasta` is a demo dataset of SARS2 sequences
[...]
$ awk -F\\t 'NR>1{print$2": "$8" ("$17" substitutions, "$18" insertions, "$19" deletions)"}' outdir/nextclade.tsv|tail -n3
USA/MA-CDCBI-CRSP_6WUXOVDJ3B5G2RNQ/2021: 21J (Delta) (32 substitutions, 4 insertions, 0 deletions)
USA/CA-CDC-FG-161774/2021: 21J (Delta) (34 substitutions, 4 insertions, 0 deletions)
OU819939: 21J (Delta) (31 substitutions, 6 insertions, 0 deletions)
$ head -n2 outdir/nextclade.tsv|csvtk -t transpose|column -s$'\t' -t # show the results of one sample in the demo dataset
index                                            1
seqName                                          OP971202
clade                                            23A
Nextclade_pango                                  XBB.1.5.13
partiallyAliased                                 XBB.1.5.13
clade_nextstrain                                 23A
clade_who                                        Omicron
clade_legacy                                     23A (Omicron)
qc.overallScore                                  0
qc.overallStatus                                 good
totalSubstitutions                               92
totalDeletions                                   30
totalInsertions                                  0
totalFrameShifts                                 0
totalMissing                                     0
totalNonACGTNs                                   0
totalAminoacidSubstitutions                      65
totalAminoacidDeletions                          13
totalAminoacidInsertions                         0
totalUnknownAa                                   0
alignmentScore                                   88550
alignmentStart                                   55
alignmentEnd                                     29733
coverage                                         0.992509112798047
isReverseComplement                              false
substitutions                                    C241T,A405G,T670G,C2790T,C3037T,G4184A,C4321T,C9344T,A9424G,C9534T,C9866T,C10029T,C10198T,G10447A,C10449A,C12880T,C14408T,G15451A,C15714T,C15738T,T15939C,T16342C,T17124C,C17410T,T17859C,A18163G,A19326G,C19955T,A20055G,C21618T,T21810C,T21880C,G21987A,C22000A,C22109G,T22200A,G22317T,G22577C,G22578A,G22599C,C22664A,C22674T,T22679C,C22686T,A22688G,G22775A,A22786C,G22813T,T22882G,G22895C,T22896C,G22898A,T22942G,G22992A,C22995A,A23013C,T23018C,T23019C,T23031C,A23055G,A23063T,T23075C,A23403G,C23525T,T23599G,C23604A,C23854A,G23948T,A24424T,T24469A,C25000T,C25416T,C25584T,C26060T,C26270T,A26275G,C26577G,G26709A,C26858T,A27259C,G27382C,A27383T,T27384C,C27807T,G27915T,G28079T,A28271T,C28311T,G28881A,G28882A,G28883C,A29510C
deletions                                        11288-11296,21633-21641,21992-21994,28362-28370
insertions
frameShifts
aaSubstitutions                                  E:T9I,E:T11A,M:Q19E,M:A63T,N:P13L,N:R203K,N:G204R,N:S413R,ORF1a:K47R,ORF1a:S135R,ORF1a:T842I,ORF1a:G1307S,ORF1a:L3027F,ORF1a:T3090I,ORF1a:L3201F,ORF1a:T3255I,ORF1a:P3395H,ORF1b:P314L,ORF1b:G662S,ORF1b:S959P,ORF1b:R1315C,ORF1b:I1566V,ORF1b:T2163I,ORF3a:T223I,ORF6:D61L,ORF8:G8*,ORF9b:P10S,S:T19I,S:A27S,S:V83A,S:G142D,S:H146Q,S:Q183E,S:V213E,S:G252V,S:G339H,S:R346T,S:L368I,S:S371F,S:S373P,S:S375F,S:T376A,S:D405N,S:R408S,S:K417N,S:N440K,S:V445P,S:G446S,S:N460K,S:S477N,S:T478K,S:E484A,S:F486P,S:F490S,S:Q498R,S:N501Y,S:Y505H,S:D614G,S:H655Y,S:N679K,S:P681H,S:N764K,S:D796Y,S:Q954H,S:N969K
aaDeletions                                      N:E31-,N:R32-,N:S33-,ORF1a:S3675-,ORF1a:G3676-,ORF1a:F3677-,ORF9b:E27-,ORF9b:N28-,ORF9b:A29-,S:L24-,S:P25-,S:P26-,S:Y144-
aaInsertions
privateNucMutations.reversionSubstitutions
privateNucMutations.labeledSubstitutions
privateNucMutations.unlabeledSubstitutions
privateNucMutations.totalReversionSubstitutions  0
privateNucMutations.totalLabeledSubstitutions    0
privateNucMutations.totalUnlabeledSubstitutions  0
privateNucMutations.totalPrivateSubstitutions    0
missing
unknownAaRanges
nonACGTNs
qc.missingData.missingDataThreshold              3000
qc.missingData.score                             0
qc.missingData.status                            good
qc.missingData.totalMissing                      0
qc.mixedSites.mixedSitesThreshold                10
qc.mixedSites.score                              0
qc.mixedSites.status                             good
qc.mixedSites.totalMixedSites                    0
qc.privateMutations.cutoff                       24
qc.privateMutations.excess                       -8
qc.privateMutations.score                        0
qc.privateMutations.status                       good
qc.privateMutations.total                        0
qc.snpClusters.clusteredSNPs
qc.snpClusters.score                             0
qc.snpClusters.status                            good
qc.snpClusters.totalSNPs                         0
qc.frameShifts.frameShifts
qc.frameShifts.totalFrameShifts                  0
qc.frameShifts.frameShiftsIgnored
qc.frameShifts.totalFrameShiftsIgnored           0
qc.frameShifts.score                             0
qc.frameShifts.status                            good
qc.stopCodons.stopCodons
qc.stopCodons.totalStopCodons                    0
qc.stopCodons.score                              0
qc.stopCodons.status                             good
totalPcrPrimerChanges                            7
pcrPrimerChanges                                 Charité_RdRp_F:G15451A,Charité_E_F:C26270T;A26275G,ChinaCDC_N_F:G28881A;G28882A;G28883C,USCDC_N1_P:C28311T
failedGenes
warnings
errors

Nextclade also produces a FASTA file where each sequence is aligned against Wuhan-Hu-1, and it additionally produces an aligned FASTA file for each protein. Below you can see that among the samples included in the demo dataset, the QTNSPRRA motif of the spike protein has changed to QTNSRRRA in most delta samples and to QTKSHRRA in most Omicron samples:

$ seqkit subseq -r $(seqkit locate -p QTNSPRRA outdir/nextclade_gene_S.translation.fasta|awk 'NR==2{print$5":"$6}') outdir/nextclade_gene_S.translation.fasta|grep -v \>|sort|uniq -c|sort -r
     37 QTNSPRRA
     32 QTKSHRRA
     27 QTNSRRRA
     18 QTNSHRRA
      2 XXXXXXXX
      1 QTNXXXXX
      1 QTNSPRXX
      1 QTNSPRRX
      1 QTKSHRRX
      1 HTNSPRXX
      1 HTNSPRRA
$ seqkit subseq -r $(seqkit locate -p QTNSPRRA outdir/nextclade_gene_S.translation.fasta|awk 'NR==2{print$5":"$6}') outdir/nextclade_gene_S.translation.fasta|seqkit grep -vsp X|seqkit fx2tab|awk -F\\t 'NR==FNR{a[$1]=$2;next}{print$2" "a[$1]}' <(awk -F\\t 'NR>1{print$2"\t"$8}' outdir/nextclade.tsv) -|sort|uniq -c|sort -r
     18 QTNSRRRA 21J (Delta)
      7 QTNSHRRA 20I (Alpha, V1)
      5 QTKSHRRA 23A (Omicron)
      5 QTKSHRRA 21L (Omicron)
      4 QTNSPRRA 20A
      4 QTKSHRRA recombinant
      4 QTKSHRRA 22D (Omicron)
      4 QTKSHRRA 22B (Omicron)
      3 QTNSRRRA 21I (Delta)
      3 QTNSRRRA 21B (Kappa)
      3 QTNSRRRA 21A (Delta)
      3 QTNSPRRA 21G (Lambda)
      3 QTNSPRRA 21F (Iota)
      3 QTNSPRRA 20J (Gamma, V3)
      3 QTNSPRRA 20H (Beta, V2)
      3 QTNSPRRA 20G
      3 QTNSPRRA 20F
      3 QTNSPRRA 20E (EU1)
      3 QTNSPRRA 20D
      3 QTNSPRRA 19B
      3 QTNSHRRA 21H (Mu)
      3 QTNSHRRA 21E (Theta)
      3 QTNSHRRA 20B
      3 QTKSHRRA 21K (Omicron)
      2 QTNSPRRA 21C (Epsilon)
      2 QTNSPRRA 20C
      2 QTNSHRRA recombinant
      2 QTKSHRRA 22C (Omicron)
      2 QTKSHRRA 22A (Omicron)
      1 QTNSPRRA 21D (Eta)
      1 QTNSPRRA 19A
      1 QTKSHRRA 22F (Omicron)
      1 QTKSHRRA 22E (Omicron)
      1 QTKSHRRA 21M (Omicron)
      1 HTNSPRRA 21D (Eta)

Download a global subsample of SARS2 sequences from Nextstrain

The SARS2 data from GISAID is not publicly available from Nextstrain but the data from GenBank is. Nextstrain's website has links to FASTA files of global and regional subsamples from GenBank. [https://docs.nextstrain.org/projects/ncov/en/latest/reference/remote_inputs.html] Each file consists of about 3000-4000 sequences.

$ curl -s https://data.nextstrain.org/files/ncov/open/global/aligned.fasta.xz|gzip -dc>global.fa # macOS gzip supports decompressing XZ
$ curl -s https://data.nextstrain.org/files/ncov/open/global/metadata.tsv.xz|gzip -dc>global.tsv
$ seqkit stat global.fa # all sequences have the same length because they are aligned against Wuhan-Hu-1
file       format  type  num_seqs     sum_len  min_len  avg_len  max_len
global.fa  FASTA   DNA      2,929  87,585,887   29,903   29,903   29,903
$ awk -F\\t 'NR>1{x=substr($7,1,4);a[x]+=$37;n[x]++}END{for(i in a)print i,n[i],a[i]/n[i]}' OFMT=%.1f global.tsv|sort # show the number of sequences from each year and the average number of nucleotide changes for each year
2019 1 0
2020 280 14.1
2021 346 34.8
2022 1791 75.5
2023 512 86.9
$ sed 1d global.tsv|cut -f8|sort|uniq -c|sort # number of samples per continent
     43 Oceania
    217 South America
    525 North America
    711 Africa
    715 Europe
    719 Asia
$ sed 1d global.tsv|cut -f20|sort|uniq -c|sort # number of samples per variant
      1
      1 21E (Theta)
      1 21M (Omicron)
      3 20F
      3 21F (Iota)
      5 21C (Epsilon)
      6 21H (Mu)
      7 20E (EU1)
      7 21A (Delta)
      7 21D (Eta)
      9 21G (Lambda)
     13 22C (Omicron)
     13 recombinant
     14 20G
     17 20D
     19 20H (Beta, V2)
     20 19B
     22 20J (Gamma, V3)
     23 21I (Delta)
     27 19A
     40 20C
     50 22A (Omicron)
     68 20I (Alpha, V1)
     69 21L (Omicron)
     80 21K (Omicron)
     93 20A
    103 21J (Delta)
    113 20B
    180 23A (Omicron)
    221 22D (Omicron)
    291 22F (Omicron)
    569 22E (Omicron)
    835 22B (Omicron)

Here's all metadata for one sample:

$ tp(){ awk '{for(i=1;i<=NF;i++)a[i][NR]=$i}END{for(i in a)for(j in a[i])printf"%s"(j==NR?"\n":FS),a[i][j]}' "FS=${1-$'\t'}";} # transpose
$ tab(){ awk '{if(NF>m)m=NF;for(i=1;i<=NF;i++){a[NR][i]=$i;l=length($i);if(l>b[i])b[i]=l}}END{for(h in a){for(i=1;i<=m;i++)printf("%-"(b[i]+n)"s",a[h][i]);print""}}' "${1+FS=$1}" "n=${2-1}"|sed 's/ *$//';} # this is like `column -s` but it tolerates empty cells and prints one instead of two spaces between columns
$ sed 2q global.tsv|tp|tab \\t
strain                 DJI/NAMRU3_IT001/2021
virus                  ncov
gisaid_epi_isl         EPI_ISL_6693791
genbank_accession      OL640827
genbank_accession_rev  OL640827.1
sra_accession
date                   2021-10-12
region                 Africa
country                Djibouti
division               Djibouti
location
region_exposure        Africa
country_exposure       Djibouti
division_exposure      Djibouti
segment                genome
length                 29764
host                   Homo sapiens
age                    ?
sex                    ?
Nextstrain_clade       21J (Delta)
pango_lineage          AY.127.1
GISAID_clade           ?
originating_lab        ?
submitting_lab         Naval Medical Research Center, Biological Defense Research Directorate, Genomics and Bioinformatics Department
authors                Arnold et al
url                    https://www.ncbi.nlm.nih.gov/nuccore/OL640827
title
paper_url              ?
date_submitted         2021-11-26
date_updated           2021-11-26
sampling_strategy      ?
database               genbank
Nextclade_pango        AY.127.1
immune_escape          ?
ace2_binding           ?
missing_data           0
divergence             43
nonACGTN               0
coverage               0.9957863759489014
rare_mutations         2
reversion_mutations    0
potential_contaminants 0
QC_missing_data        good
QC_mixed_sites         good
QC_rare_mutations      good
QC_snp_clusters        good
QC_frame_shifts        good
QC_stop_codons         good
QC_overall_score       0
QC_overall_status      good
frame_shifts
deletions              22029-22034,28248-28253,28271
insertions
substitutions          G210T,C241T,G242A,C3037T,G4181T,T4634C,C6402T,T7049C,C7124T,T7816C,C8986T,G9053T,C10029T,A11201G,A11332G,C14408T,G15451A,C16466T,C19220T,A19445G,C21618G,C21846T,G21987A,T22917G,C22995A,A23403G,C23604G,G23608T,C24138A,G24410A,C25469T,G25538T,C26054A,T26767C,T27638C,C27752T,C27874T,G28378T,A28461G,G28881T,G28916T,G29402T,G29742T
aaSubstitutions        M:I82T,N:D63G,N:R203M,N:G215C,N:D377Y,ORF1a:A1306S,ORF1a:P2046L,ORF1a:Y2262H,ORF1a:P2287S,ORF1a:V2930L,ORF1a:T3255I,ORF1a:T3646A,ORF1b:P314L,ORF1b:G662S,ORF1b:P1000L,ORF1b:A1918V,ORF1b:K1993R,ORF3a:S26L,ORF3a:G49V,ORF3a:T221K,ORF7a:V82A,ORF7a:T120I,ORF7b:T40I,ORF9b:R32L,ORF9b:T60A,S:T19R,S:T95I,S:G142D,S:R158G,S:L452R,S:T478K,S:D614G,S:P681R,S:T859N,S:D950N
clock_deviation        1.00

I tried searching for samples dated 2023 with the lowest number of nucleotide changes from Wuhan-Hu-1. The first two samples below have only 3 nucleotide changes, but it's probably because they have an incorrect collection date at GenBank, because at GenBank the collection date of OQ244525 is also listed as January 3rd 2023:

$ awt '$7~/^2023/{print$37,$1,$4,$20,$7}' global.tsv|sort -n|head|tab \\t
3  OQ244525              OQ244525 19A           2023-01-03
3  OQ519925              OQ519925 19A           2023-02-02
47 OQ346308              OQ346308 22B (Omicron) 2023-01-11
51 OQ439757              OQ439757 22F (Omicron) 2023-01-27
52 OQ346309              OQ346309 recombinant   2023-01-11
57 THA/CONI-3970/2023    OQ585995 22F (Omicron) 2023-02-02
58 USA/OKPHL0025701/2023 OQ458506 23A (Omicron) 2023-01-23
61 KEN/SS11785/2023      OQ608726 22E (Omicron) 2023-01-19
62 JPN/SZ-NTC-166/2023   LC758541 22B (Omicron) 2023-02-19
62 OX416655              OX416655 22E (Omicron) 2023-01-10

There's three samples that are missing more than two thirds of all SNPs. It causes them to have only 12 nucleotide changes from Wuhan-Hu-1 even though they are Omicron samples:

$ copt(){ awk -F\\t 'NR==FNR{a[NR]=$0;next}{for(i in a)printf"%s"(i==length(a)?"\n":FS),$(a[i])}' <(printf %s\\n "$@") -; }
$ copt 36 1 20 37 <global.tsv|sort -rn|head|tab \\t
22930 Switzerland/ZH-UZH-IMV-3ba67d27/2022      21L (Omicron) 12
22930 Switzerland/ZH-UZH-IMV-3ba678a9/2022      21L (Omicron) 12
22930 Switzerland/ZH-UZH-IMV-3ba66eac/2022      21L (Omicron) 12
11898 OQ439756                                  22F (Omicron) 76
11437 CHN/CPL-12-43/2022                        22B (Omicron) 29
11275 BGD/Laila_489/2021                        21J (Delta)   25
9420  Switzerland/GR-ETHZ-150039/2020           20A           5
9109  OQ244528                                  22B (Omicron) 46
8945  PRY/WOR12903A608_202103275_2-4710519/2020 19B           7
8787  OQ692394                                  22F (Omicron) 64

This downloads NextStrain's subsamples for each continent, which you can concatenate to get a bigger global dataset:

u=https://data.nextstrain.org/files/ncov/open;for x in africa asia europe north-america oceania south-america;do curl $u/$x/aligned.fasta.xz>continent.$x.fa.xz;curl $u/$x/metadata.tsv.xz>continent.$x.tsv.xz;done

Make an MDS plot of a global subsample from Nextstrain

curl https://data.nextstrain.org/files/ncov/open/global/aligned.fasta.xz|gzip -dc>global.fa
curl https://data.nextstrain.org/files/ncov/open/global/metadata.tsv.xz|gzip -dc>global.tsv
brew install brewsci/bio/snp-dists
snp-dists global.fa>global.dist
# `geom_mark_hull` draws the hulls incorrectly in new versions of `ggforce` so you need to install an old version of `ggforce`:
# `install.packages("https://cran.r-project.org/src/contrib/Archive/ggforce/ggforce_0.3.4.tar.gz")`

library(ggplot2)
library(ggrepel)

t=read.table("global.dist",header=T,row.names=1,sep="\t")
meta=read.table("global.tsv",header=T,row.names=1,sep="\t",quote="")
meta=meta[rownames(t),]

pick=meta[,35]<1000 # reject samples with 1000 or more missing SNPs
t=t[pick,pick]
meta=meta[pick,]
variant=meta$Nextstrain_clade

mds0=cmdscale(t,ncol(t)-1,eig=T) # `cmdscale(t,2,eig=T)` would be a lot faster but it doesn't return all eigenvalues so the proportion of explained variance cannot be calculated correctly
mds=as.data.frame(mds0$points)
eig=head(mds0$eig,ncol(mds))
pct=paste0("Dim  ",1:ncol(mds)," (",sprintf("%.1f",100*eig/sum(eig)),"%)")

centers=data.frame(aggregate(mds,list(variant),mean),row.names=1)
k=as.factor(variant)

set.seed(0)
color=as.factor(sample(length(unique(k))))
hue=c(0,30,60,100,170,210,240,300)
cl=list(c(70,60),c(70,40),c(30,70),c(30,30))
pal1=unlist(lapply(cl,\(x)hcl(hue+15,x[1],x[2])))
pal2=unlist(lapply(cl,\(x)hcl(hue+15,if(x[2]>=60).5*x[1]else .4*x[1],if(x[2]>=60).1*x[2]else 90)))

i=1 # change to `i=3` to plot dimensions 3 and 4
xdim=as.name(paste0("V",i))
ydim=as.name(paste0("V",i+1))

# draw a line between each point and its two closest neighbors
seg=cbind(mds[rep(1:nrow(mds),each=2),c(i,i+1)],mds[apply(t,1,\(x)order(x)[2:3]),c(i,i+1)])
colnames(seg)=paste0("V",1:4)

ggplot(mds,aes(!!xdim,!!ydim),group=0)+
ggforce::geom_mark_hull(aes(group=k),color=pal1[color[k]],fill=pal1[color[k]],concavity=1000,radius=unit(.15,"cm"),expand=unit(.15,"cm"),alpha=.2,size=.15)+
geom_segment(data=seg,aes(x=V1,y=V2,xend=V3,yend=V4),color="gray75",linewidth=.15)+
geom_point(color=pal1[color[k]],size=.5)+
labs(x=pct[i],y=pct[i+1])+
# geom_label(data=centers,aes(V1,V2,label=rownames(centers)),fill=pal1[color],color=pal2[color],alpha=.8,size=2.2,label.r=unit(.1,"lines"),label.padding=unit(.1,"lines"),label.size=.1)+ # allow overlapping labels
ggrepel::geom_label_repel(data=centers,aes(!!xdim,!!ydim),fill=pal1[color],color=pal2[color],label=rownames(centers),size=2.2,alpha=.8,label.r=unit(.1,"lines"),label.padding=unit(.1,"lines"),label.size=.1,box.padding=0.1,max.overlaps=Inf,point.size=0,segment.size=.2,force=3)+
scale_x_continuous(breaks=seq(-1000,1000,20),expand=expansion(mult=c(.04,.04)))+
scale_y_continuous(breaks=seq(-1000,1000,20),expand=expansion(mult=c(.04,.04)))+
theme(
  axis.ticks=element_line(linewidth=.4,color="gray80"),
  axis.ticks.length=unit(-.15,"cm"),
  axis.text=element_text(color="black",size=6),
  axis.text.x=element_text(margin=margin(.3,0,0,0,"lines")),
  axis.text.y=element_text(angle=90,vjust=1,hjust=.5,margin=margin(0,.4,0,0,"lines")),
  axis.title=element_text(size=8),
  legend.position="none",
  panel.background=element_rect(fill="white"),
  panel.border=element_rect(color="gray80",fill=NA,linewidth=.8),
  panel.grid=element_blank(),
  plot.background=element_rect(fill="white")
)

ggsave(paste0(i,".png"),width=7,height=7)

Make a FASTA file of the 1000 closest matches to the SARS2 spike protein on BLAST

I entered the accession number of Wuhan-Hu-1's spike protein to protein BLAST (QHD43416). [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins] I entered "SARS-CoV-2 (taxid:2697049)" in the organism field and clicked the "exclude" checkbox next to it. Then I clicked "Algorithm parameters" and I set "Max target sequences" to 1000. Then I clicked I clicked "BLAST" and I selected "FASTA (complete sequence)" from the "Download" menu.

The following code adds Wuhan-Hu-1's spike protein to the file I downloaded from BLAST, aligns the sequences, excludes sequences without a parent organism in the nuccore database, creates a TSV file for metadata about the sequences, and creates a new FASTA file with renamed sequences:

# download the SARS2 spike and align the spike proteins (`--reorder` groups similar proteins together instead of preserving the previous order)
curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&rettype=fasta&id=YP_009724390'>sars2.spike.fa
cat seqdump.txt sars2.spike.fa|mafft --reorder ->spike1k.aln

# `emu` stands for "Entrez multi"
# this uses `curl -d` to send the data via a POST request, because if you specify more than a few hundred accessions at a time as part of the URL, you get an error that the URL is too long
# `xml fo -D` deletes the DOCTYPE so xmlstarlet won't later try to fetch external entities
emu(){ curl -d id=$(paste -sd, -) "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=${1-nuccore}&retmode=${2-xml}${3+&rettype=$3}";}
sed -n 's/^>\([^ ]*\).*/\1/p' spike1k.aln|emu protein|xml fo -D>spike1k.xml

# the full organism title is not included in the XML metadata of the protein sequences, so you need to get the organism title from the parent nucleotide sequence
# 84 out of 1001 of the spike sequences were missing the parent accession number, but they consisted of 83 entries like "pdb: molecule 7QUR, chain A" and one contaminated sequence which had been deleted from nuccore
# my code also omitted 8 entries from the UniProt database whose `GBSeq_source-db` field started with text like "UniProtKB: locus SPIKE_BC133, accession Q0Q4F2"
xml sel -t -v //GBSeq_source-db spike1k.xml|sed -En 's/^(REFSEQ: )?accession //p'|emu|xml fo -D>spike1k.nuc.xml

# get TSV metadata from protein XML
<spike1k.xml xml sel -t -m //GBSeq -v GBSeq_accession-version -o $'\t' -v .//GBSeq_source-db -o $'\t' -v './/GBQualifier[GBQualifier_name="isolate"]/GBQualifier_value' -o $'\t' -v GBSeq_definition -o $'\t' -v GBSeq_length -o $'\t' -v GBSeq_create-date -o $'\t' -v './/GBQualifier[GBQualifier_name="collection_date"]/GBQualifier_value' -o $'\t' -v './/GBQualifier[GBQualifier_name="country"]/GBQualifier_value' -o $'\t' -v './/GBReference[last()]//GBAuthor[1]' -o ... -v './/GBReference[last()]//GBAuthor[last()]' -o $'\t' -v '(.//GBReference_title[text()!="Direct Submission"])[last()]' -n>temp.tsv

# get the title of the parent organism from the nucleotide XML and make a short version of the organism title
xml sel -t -v //GBSeq_definition spike1k.nuc.xml|sed -E 's/, (complete|partial|nearly complete) (gene?ome|cds|sequence)| whole genome shotgun sequence| (spike glycoprotein|spike glycoprotein precursor|s protein|surface glycoprotein).*| (spike|s)( protein)?( \(S\))? gene| spike protein gene| S protein \(S\) gene| [A-Za-z0-9] gene;.*|gene;.*| putative .*|[oO][rR][fF]1ab?[ ,].*|; and .*| S protein gene.*| spike protein .*| S gene .*| S[12] protein.*| polyprotein .*|( genomic)? RNA$| genomic RNA.*//g;s/ isolate / /;s/Middle East respiratory syndrome/MERS/;s/Severe acute respiratory syndrome/SARS/;s/ *$//'|paste <(xml sel -t -m //GBSeq -v GBSeq_accession-version -o $'\t' -v GBSeq_definition -n spike1k.nuc.xml) ->temp2.tsv

# generate the final TSV and FASTA files
redate(){ tr A-Z a-z|sed -f <(paste -d/ <(printf s/%s\\n jan feb mar apr may jun jul aug sep oct nov dec) <(printf %s\\n {01..12}/g))|sed -E 's/(..)-(..)-(....)/\3-\2-\1/g;s/(..)-([0-9]{4})/\2-\1/g';}
paste <(cut -f-5 temp.tsv|awk '{sub("^(REFSEQ: )?accession ","",$2)}1' {,O}FS=\\t) <(cut -f6-7 temp.tsv|redate) <(cut -f8- temp.tsv)|awk -F\\t 'NR==FNR{a[$1]=$2"\t"$3;next}$2 in a{print$0"\t"a[$2]}' temp2.tsv -|cat <(printf %s\\n protein_accession organism_accession protein_title isolate_name protein_length genbank_added_date collection_date collection_country oldest_authors oldest_reference organism_title organism_short_title|paste -s -) ->spike1k.tsv
awk -F\\t '{print$1"\t"$12"|"$4"|"$6"|"$7"|"$8"|"$2"|"$10}' spike1k.tsv|awk -F\\t 'NR==FNR{a[$1]=$0;next}$1 in a{print a[$1]"\t"$2}' - <(sed 's/ .*//' spike1k.aln|seqkit fx2tab)|sed $'s/\t/ /'|seqkit tab2fx>spike1k.fa

I uploaded the FASTA file generated by the code above here: https://drive.google.com/uc?export=download&id=1riMniCLMe4Aqg-hj4O6LXQaqvx2gPnHx. And I uploaded the TSV file here: https://drive.google.com/uc?export=download&id=1o79qDLcrbxAppnK15t7tjdn7vKU2H3dT.

Display a part of a protein alignment with colorized amino acids and sequence metadata

curl -Lso spike1k.fa 'https://drive.google.com/uc?export=download&id=1riMniCLMe4Aqg-hj4O6LXQaqvx2gPnHx'
curl -Lso spike1k.tsv 'https://drive.google.com/uc?export=download&id=1o79qDLcrbxAppnK15t7tjdn7vKU2H3dT'

# show the number of amino acid changes of the sequence whose ID is specified as an argument to each sequence from STDIN; ignore positions where either sequence has an X letter
dif2x(){ x=$(cat);awk 'NR==1{split($0,a,"");l=length;next}{split($0,b,"");n=0;for(i=1;i<=l;i++)if(a[i]!="X"&&b[i]!="X"&&a[i]!=b[i])n++;print n}' <(seqkit grep -p "$1"<<<"$x"|seqkit seq -s;seqkit seq -s<<<"$x");}

# colorize amino acids
cola(){ sed -f <(awk -F: '{print"s/"$1"/\033[38;2;0;0;0m\033[48;2;"$2";"$3";"$4"m"$1"\033[0m/g"}' <(printf %s\\n A:242:121:121 C:242:182:121 D:242:242:121 E:182:242:121 F:121:242:151 G:121:242:222 H:121:182:242 I:121:121:242 K:182:121:242 L:242:121:242 M:255:191:191 N:255:223:191 P:255:255:191 Q:223:255:191 R:191:255:207 S:191:255:244 T:191:223:255 V:191:191:255 W:223:191:255 Y:255:191:255 -:140:140:140)) "$@";}

# `tab \\t` is like `column -ts$'\t'` but it doesn't get thrown off by empty fields and you can change the number of spaces that get printed between columns
tab(){ awk '{if(NF>m)m=NF;for(i=1;i<=NF;i++){a[NR][i]=$i;l=length($i);if(l>b[i])b[i]=l}}END{for(h in a){for(i=1;i<=m;i++)printf(i==m?"%s\n":"%-"(b[i]+n)"s",a[h][i])}}' "${1+FS=$1}" "n=${2-1}";}

f=spike1k.fa;beg=1830;end=1860;targ=YP_009825051.1;sub=$(seqkit subseq -r $beg:$end $f);paste <(dif2x $targ<<<"$sub") <(dif2x $targ<$f) <(seqkit seq -n $f)|tab \\t|cut -c-$[COLUMNS-(end-beg+2)]|paste -d\  <(seqkit seq -s<<<"$sub"|cola) -

From the output, you can see that the RRA part of the PRRA insert is included in sequences like "Murine coronavirus strain JHM-WU-Dns2", "Murine coronavirus MHV-3 UNICAMP", "Pipistrellus bat coronavirus HKU5 16S", "BtPa-BetaCoV/GD2013", "Bat coronavirus HKU5-2", "Bat coronavirus HKU5-3", and "Bat coronavirus (BtCoV/A434/2005)". The second and third columns show the number of amino acid changes from Wuhan-Hu-1, where the second column includes only the displayed region and the third column includes the whole genome:

Get a consensus sequence of raw reads which overlap with a given read

The code below downloads the first 100,000 reads from one of the two paired read files from a paper from 2020 titled "In situ structural analysis of SARS-CoV-2 spike reveals flexibility mediated by three hinges". It then makes a local BLAST database of the reads and takes the 100 reads which have the best match to the first read of the file, and it aligns the 100 reads and prints the most common nucleotide value at each position of the alignment:

$ brew install blast seqkit mafft emboss
$ curl ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR445/004/ERR4451134/ERR4451134_1.fastq.gz|seqkit head -n100000|seqkit fq2fa>temp.fa
$ makeblastdb -in temp.fa -dbtype nucl
$ seqkit head -n1 temp.fa|blastn -db temp.fa -outfmt '6 qseqid sseqid pident length bitscore qlen slen stitle qstart qend sstart send'|cut -f2|seqkit grep -f- temp.fa|mafft --quiet ->temp.aln
$ seqkit fx2tab temp.aln|head -n4 # show first 4 aligned reads
ERR4451134.1 M00758:969:CG8RD:1:1102:16602:1915/1   ------------------------ctcataataattagtaatatctctgctatagtaacctgaaagtcaacgagatgaaacatctgttttctcttactgtacaagcaaagcaatattgtcactgctactggtatggtctgtgtttaatttatagttgccaatcctgtagcgact------------------------
ERR4451134.353 M00758:969:CG8RD:1:1102:18404:2089/1 -----------------------cctcataataattagtaatatctctgctatagtaacctgaaagtcaacgagatgacacatctgttatcacttactgtacaagcaaagcactattttcactgctactggaatggtctgtgtttaatttatagttgccaatcctgtagcgac-------------------------
ERR4451134.487 M00758:969:CG8RD:1:1102:16023:2116/1 ---------------------ctcctcataataattagtaatatctctgctatagtaacctgaaagtcaacgagatgaaacatctgttgtcacttactgtacaagcaaagcaatattgtcactgctactggaatggtctgtgtttaatttatagttgccaatcctgtagcg---------------------------
ERR4451134.566 M00758:969:CG8RD:1:1102:15979:2131/1 ------------------------ctcataataattagtaatatctctgctatagtaacctgaaagtcaacgagatgaaacatctgttgtcacttactgtacaagcaaagcaatattgtcactgctactggaatggtctgtgtttaatttatagttgccaatcctgtagcgact------------------------
$ seqkit fx2tab temp.aln|cut -f2|awk -F '' '{for(i=1;i<=length;i++)a[i][$i]++}END{for(i in a){max=0;for(j in a[i])if(a[i][j]>max){max=a[i][j];o=j}printf"%s",o};print""}'|tee consensus
-----------------------cctcataataattagtaatatctctgctatagtaacctgaaagtcaacgagatgaaacatctgttgtcacttactgtacaagcaaagcaatattgtcactgctactggaatggtctgtgtttaatttatagttgccaatcctgtagcgac-------------------------
$ cons temp.aln -outseq /dev/stdout|seqkit seq -w0 # `cons` is from EMBOSS
Create a consensus sequence from a multiple alignment
>EMBOSS_001
nnnnnnnnnnnnnnnnnnnnnnnnCTCATAATAATTAGTAATATCTCTGCTATAGTAACCTGAAAGTCAACGAGATGAAACATCTGTTGTCACTTACTGTACAAGCAAAGCAATATTGTCACTGCTACTGGAATGGTCTGTGTTTAATTTATAGTTGCCAATCCTGTAGCGACTnnnnnnnnnnnnnnnnnnnnnnnn
$ Rscript -e 'writeLines(Biostrings::consensusString(Biostrings::readAAStringSet("temp.aln")))'
------------------------ctcataataattagtaatatctctgctatagtaacctgaaagtcaacgagatgaaacatctgttgtcacttactgtacaagcaaagcaatattgtcactgctactggaatggtctgtgtttaatttatagttgccaatcctgtagcgact------------------------
$ awk 'NR==1{split($0,a,"");l=length;next}{split($0,b,"");n=0;for(i=1;i<=l;i++)if(a[i]!="-"&&b[i]!="-"&&a[i]!=b[i])n++;print n}' consensus <(seqkit seq -s temp.aln)|sort|uniq -c # 415 reads have 0 nucleotide changes from consensus, 61 reads have 1 change, etc.
    414 0
     61 1
     11 2
      5 3
      3 4
      2 5
      2 6
      1 7
      1 8

Calculate percentage identity with awk

This compares the sequence whose ID is specified as an argument against every sequence in the FASTA file from STDIN:

$ pid1(){ x=$(cat);awk 'NR==1{split($0,a,"");l=length;next}{split($0,b,"");d=0;n=0;for(i=1;i<=l;i++)if(a[i]!="-"&&b[i]!="-"){n++;if(a[i]!=b[i])d++}print 100*(1-d/n)}' <(seqkit grep -p "$1"<<<"$x"|seqkit seq -s;seqkit seq -s<<<"$x");}
$ curl -s 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=NC_045512,NC_004718,MN996532,MG772933,MZ937000,MZ190137,KY352407'|mafft --quiet --reorder ->temp.fa
$ pid1 NC_006577.1 <temp.fa|paste - <(seqkit seq -n temp.fa|sed 's/, complete genome//')
100 NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1
96.967  MZ937000.1 Bat coronavirus isolate BANAL-20-52/Laos/2020
96.1983 MN996532.2 Bat coronavirus RaTG13
88.0544 MG772933.1 Bat SARS-like coronavirus isolate bat-SL-CoVZC45
79.8987 NC_004718.3 SARS coronavirus Tor2
76.5423 MZ190137.1 Bat SARS-like coronavirus Khosta-1 strain BtCoV/Khosta-1/Rh/Russia/2020
76.6389 KY352407.1 Severe acute respiratory syndrome-related coronavirus strain BtKY72

This compares each 2-combination of sequences and sorts the results based on percentage identity:

$ seqkit fx2tab temp.fa|sed 's/, complete genome//;s/.* //'|awk -F\\t '{split($2,x,"");l=length($2);for(i=1;i<=l;i++)a[NR][i]=x[i];name[NR]=$1;seqs=NR}END{for(i=1;i<=seqs-1;i++)for(j=i+1;j<=seqs;j++){d=0;n=0;for(k=1;k<=l;k++)if(a[i][k]!="-"&&a[j][k]!="-"){n++;if(a[i][k]!=a[j][k])d++}print 100*(1-d/n)"\t"name[i]"\t"name[j]}}'|sort -rn|column -ts$'\t'
96.967   Wuhan-Hu-1                     BANAL-20-52/Laos/2020
96.1983  Wuhan-Hu-1                     RaTG13
96.0374  BANAL-20-52/Laos/2020          RaTG13
88.0656  BANAL-20-52/Laos/2020          bat-SL-CoVZC45
88.0544  Wuhan-Hu-1                     bat-SL-CoVZC45
87.9844  RaTG13                         bat-SL-CoVZC45
81.7179  BtCoV/Khosta-1/Rh/Russia/2020  BtKY72
81.2542  bat-SL-CoVZC45                 Tor2
79.9074  BANAL-20-52/Laos/2020          Tor2
79.8987  Wuhan-Hu-1                     Tor2
79.7034  RaTG13                         Tor2
78.8248  Tor2                           BtCoV/Khosta-1/Rh/Russia/2020
77.9839  Tor2                           BtKY72
77.1394  bat-SL-CoVZC45                 BtCoV/Khosta-1/Rh/Russia/2020
76.8069  bat-SL-CoVZC45                 BtKY72
76.6909  BANAL-20-52/Laos/2020          BtKY72
76.6389  Wuhan-Hu-1                     BtKY72
76.6371  BANAL-20-52/Laos/2020          BtCoV/Khosta-1/Rh/Russia/2020
76.5523  RaTG13                         BtKY72
76.5423  Wuhan-Hu-1                     BtCoV/Khosta-1/Rh/Russia/2020
76.4696  RaTG13                         BtCoV/Khosta-1/Rh/Russia/2020

Calculate percentage identity with C++

#include <iostream>
#include <vector>
#include <string>
#include <cctype>

using namespace std;

int main(int argc,char**argv){

  string id,buf;
  vector<string>seq,name;

  for(string line;getline(cin,line);){
    if(line.empty())continue;
    if(line[0]=='>'){
      if(!id.empty()){name.push_back(id);transform(buf.begin(),buf.end(),buf.begin(),::tolower);seq.push_back(buf);}
      id=line.substr(1);
      buf.clear();
    }else buf+=line;
  }
  if(!id.empty()){name.push_back(id);transform(buf.begin(),buf.end(),buf.begin(),::tolower);seq.push_back(buf);}

  int seqs=seq.size(),positions=seq[0].size();

  vector<vector<int> >diffs(seqs,vector<int>(seqs));
  vector<vector<int> >totals(seqs,vector<int>(seqs));

  int ind=0;
  for(int i=0;i<seqs-1;i++){
    for(int j=i;j<seqs;j++){
      int diff=0,total=0;
      for(int k=0;k<positions;k++)
        if(seq[i][k]!='-'&&seq[j][k]!='-'){total++;if(seq[i][k]!=seq[j][k])diff++;}
      diffs[i][j]=diff;
      totals[i][j]=total;
    }
  }

  for(int i=0;i<seqs;i++)cout<<"\t"<<name[i];
  cout<<endl;
  for(int i=0;i<seqs;i++){
    cout<<name[i];
    for(int j=0;j<seqs;j++){
      int x=i<j?i:j,y=i<j?j:i;
      int diff=diffs[x][y],total=i==j?1:totals[x][y];
      printf("\t%.4f",100*(1-(float)diff/total));
    }
    cout<<endl;
  }
}
$ curl -s 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=NC_019843,NC_039207,NC_014470,NC_006577'|mafft --quiet --reorder ->temp.fa
$ tab(){ awk '{if(NF>m)m=NF;for(i=1;i<=NF;i++){a[NR][i]=$i;l=length($i);if(l>b[i])b[i]=l}}END{for(h in a){for(i=1;i<=m;i++)printf(i==m?"%s\n":"%-"(b[i]+n)"s",a[h][i])}}' "${1+FS=$1}" "n=${2-1}";}
$ g++ -O3 -o pid pid.cpp
$ sed 's/, complete genome//;s/>.* />/' temp.fa|./pid|tab \\t
                      Wuhan-Hu-1 BANAL-20-52/Laos/2020 RaTG13   bat-SL-CoVZC45 bat-SL-CoVZXC21 Tor2
Wuhan-Hu-1            100.0000   96.9669               96.1983  88.0342        88.0136         79.9149
BANAL-20-52/Laos/2020 96.9669    100.0000              96.0374  88.0455        88.0296         79.9270
RaTG13                96.1983    96.0374               100.0000 87.9676        87.9973         79.7196
bat-SL-CoVZC45        88.0342    88.0455               87.9676  100.0000       97.4706         81.2809
bat-SL-CoVZXC21       88.0136    88.0296               87.9973  97.4706        100.0000        81.3978
Tor2                  79.9149    79.9270               79.7196  81.2809        81.3978         100.0000

The code ignores positions where either sequence has a gap, but if you also want to ignore N bases and degenerate bases, you can replace them with gaps: seqkit replace -isp '[^acgt]' -r -.

Print closest neighbors of sequences as plain text

# download a FASTA file for the thousand closest matches to the spike protein of Wuhan-Hu-1 on BLAST with the taxid of SARS2 excluded
curl -Lso spike1k.fa 'https://drive.google.com/uc?export=download&id=1riMniCLMe4Aqg-hj4O6LXQaqvx2gPnHx'
curl -Lso spike1k.tsv 'https://drive.google.com/uc?export=download&id=1o79qDLcrbxAppnK15t7tjdn7vKU2H3dT'

# omit sequences with too many X letters or gaps and realign the sequences
seqkit seq -s spike1k.fa|tr -d X-|awk '{print length}'|paste - <(seqkit seq -ni spike1k.fa)|awk '$1>=1200'|cut -f2|seqkit grep -f- spike1k.fa|mafft --reorder ->spike1k.aln

# `-a` counts all letters and not just ACGT
snp-dists -a spike1k.aln>spike1k.dist
t=read.table("spike1k.dist",sep="\t",r=1,head=T,quote="",comment.char="",check.names=F)
t2=read.table("spike1k.tsv",sep="\t",r=1,head=T,quote="",comment.char="",check.names=F)

name=apply(t2[rownames(t),c(11,5,6,7)],1,paste,collapse="|")

pick=!grepl("Synthetic|Recombinant|Cloning|Expression",name)
t=t[pick,pick]
name=name[pick]
colnames(t)=name

pick2=grepl("BANAL",name)
r=apply(t[pick2,],1,\(x){h=sort(x)[2:5];paste("  ",h,names(h),collapse="\n")})
writeLines(paste0(name[pick2],"\n",r))

From the output you can see that BANAL-52 is similar to SARS2, BANAL-103 and -236 are similar to Chinese pangolin viruses, and BANAL-116 and -247 are similar to the Thai bat coronavirus RacCS203:

Bat coronavirus BANAL-20-52/Laos/2020|2021-09-19|2020-07-05|Laos
   20 SARS coronavirus 2 Wuhan-Hu-1|2020-01-13|2019-12|China
   28 Bat coronavirus RaTG13|2020-01-29|2013-07-24|China
   29 Bat SARS-like virus BtSY2|2023-01-24|2018|China: Yunnan
   91 Pangolin coronavirus GX_P2V|2022-02-01||
Bat coronavirus BANAL-20-103/Laos/2020|2021-09-19|2020-07-07|Laos
   5 Bat coronavirus BANAL-20-236/Laos/2020|2021-09-19|2020-07-10|Laos
   35 Pangolin coronavirus cDNA8-S|2020-08-01|2019|China
   35 Pangolin coronavirus cDNA18-S|2020-08-01|2019|China
   36 Pangolin coronavirus cDNA16-S|2020-08-01|2019|China
Bat coronavirus BANAL-20-236/Laos/2020|2021-09-19|2020-07-10|Laos
   5 Bat coronavirus BANAL-20-103/Laos/2020|2021-09-19|2020-07-07|Laos
   37 Pangolin coronavirus cDNA8-S|2020-08-01|2019|China
   37 Pangolin coronavirus cDNA18-S|2020-08-01|2019|China
   38 Pangolin coronavirus cDNA16-S|2020-08-01|2019|China
Bat coronavirus BANAL-20-247/Laos/2020|2021-09-19|2020-07-10|Laos
   4 Bat coronavirus BANAL-20-116/Laos/2020|2021-09-19|2020-07-07|Laos
   25 Bat coronavirus RacCS203|2021-01-13|2020-06-19|Thailand
   210 Sarbecovirus sp. HN2021C|2021-09-22|2021-02|China: Hunan
   215 Bat SARS-like coronavirus Rf4092|2017-11-01|2012-09-18|China
Bat coronavirus BANAL-20-116/Laos/2020|2021-09-19|2020-07-07|Laos
   4 Bat coronavirus BANAL-20-247/Laos/2020|2021-09-19|2020-07-10|Laos
   26 Bat coronavirus RacCS203|2021-01-13|2020-06-19|Thailand
   211 Sarbecovirus sp. HN2021C|2021-09-22|2021-02|China: Hunan
   216 Bat SARS-like coronavirus Rf4092|2017-11-01|2012-09-18|China

Make a simplot with ggplot2

library(ggplot2)

t=as.matrix(Biostrings::readAAStringSet("https://pastebin.com/raw/h0nHf55b"))

targ="ZC45"
step=100
window=500
xbreak=2000
ybreak=10

t1=t[rownames(t)!=targ,]
t2=t[targ,]
ncol=ncol(t)
starts=seq(1,ncol,step)

sim=apply(t1,1,\(seq)sapply(starts,\(i){
  start=max(1,i-window%/%2)
  end=min(ncol,i+window%/%2)
  mean(seq[start:end]==t2[start:end])
}))

long=data.frame(V1=starts[row(sim)],V2=colnames(sim)[col(sim)],V3=100*c(sim))
group=factor(long$V2,levels=unique(long$V2))

hue=c(0,210,100,60,300,160,30,250)+15
color=c("black",hcl(hue,60,70),"gray50",hcl(hue,50,40))

rmu=\(x,y)x+(y-x%%y)%%y
xend=rmu(max(long$V1-1),xbreak)+1
ystart=rmu(min(long$V3),-ybreak)
yend=rmu(max(long$V3),ybreak)

ggplot(long,aes(x=V1,y=V3))+
geom_line(aes(color=group),linewidth=.3)+
geom_vline(xintercept=ncol,color="gray90",linewidth=.3,linetype="dashed")+
ggtitle(paste0("Similarity to ",targ," (step ",step,", window ",window,")"))+
scale_x_continuous(limits=c(1,xend),breaks=seq(1,xend,xbreak),expand=expansion(mult=c(.005,.005)))+
scale_y_continuous(limits=c(ystart,yend),breaks=seq(ystart,yend,ybreak),expand=expansion(mult=c(.012,.012)))+
scale_color_manual(values=color)+
guides(color=guide_legend(ncol=4,byrow=T))+
theme(
  axis.text=element_text(size=6,color="black"),
  axis.text.x=element_text(angle=90,vjust=.5,hjust=1),
  axis.ticks=element_blank(),
  axis.ticks.length=unit(0,"cm"),
  axis.title=element_blank(),
  legend.background=element_blank(),
  legend.box.margin=margin(0),
  legend.box.spacing=unit(0,"in"),
  legend.key=element_rect(fill=NA),
  legend.margin=margin(-4,0,-1.5,0),
  legend.position="top",
  ## legend.position=c(.7,.2),
  legend.spacing.x=unit(.03,"in"),
  legend.spacing.y=unit(0,"in"),
  legend.text=element_text(size=7,vjust=.5),
  legend.title=element_blank(),
  panel.background=element_rect(fill="white"),
  panel.grid.major=element_line(linewidth=.2,color="gray90"),
  panel.grid.minor=element_blank(),
  plot.background=element_rect(fill="white"),
  plot.title=element_text(size=9,color="black")
)

ggsave("1.png",width=4.5,height=3)

There's also a JavaScript utility for creating simplots: http://babarlelephant.free-hoster.net/simplotfirstN.html.

Make a simplot with gene annotations

library(ggplot2)

t=as.matrix(Biostrings::readAAStringSet("input.fa"))

gene=read.csv(head=F,text="ORF1a,266,13468
ORF1b,13468,21555
S,21563,25384
ORF3a,25393,26220
E,26245,26472
M,26523,27191
ORF6,27202,27387
ORF7a,27394,27759
ORF8,27894,28259
N,28274,29533
ORF10,29558,29674")

ref=t[grepl("Wuhan-Hu-1",rownames(t)),]
pos=seq_along(ref)-cumsum(ref=="-")
gene[,2]=sapply(gene[,2],\(x)which(pos==x)[1])
gene[,3]=sapply(gene[,3],\(x)which(pos==x)[1])

genebars=c(gene[,2],gene[nrow(gene),2])

rownames(t)=sub("(,| ORF| surface ).*","",rownames(t))
rownames(t)=sub("[^ ]* ","",rownames(t))
rownames(t)=sub(" isolate "," ",rownames(t))
rownames(t)=sub("Severe acute respiratory syndrome coronavirus 2 ","",rownames(t))

targ=rownames(t)[grep("Tor2",rownames(t))]

window=500
step=100

t1=t[rownames(t)!=targ,]
t2=t[targ,]
ncol=ncol(t)
starts=seq(1,ncol,step)

sim=apply(t1,1,\(seq)sapply(starts,\(i){
  start=max(1,i-window%/%2)
  end=min(ncol,i+window%/%2)
  mean((seq[start:end]==t2[start:end])[seq[start:end]!="-"&t2[start:end]!="-"])
}))

xy=data.frame(x=starts[row(sim)],y=100*c(sim),group=colnames(sim)[col(sim)])
xy$group=factor(xy$group,unique(xy$group))

hue=c(0,210,100,60,300,160,30,250)+15
color=c("black",hcl(hue,80,70),"gray50",hcl(hue,60,40))

color=setNames(color[1:nrow(t)],rownames(t))[levels(xy$group)]

candidates=c(sapply(c(1,2,5),\(x)x*10^c(-10:10)))
xbreak=candidates[which.min(abs(candidates-max(xy$x)/10))]
ybreak=candidates[which.min(abs(candidates-(max(xy$y)-min(xy$y))/7))]
xend=max(xy$x)
ystart=ybreak*floor(min(xy$y,na.rm=T)/ybreak)
yend=ybreak*ceiling(max(xy$y,na.rm=T)/ybreak)

legend=data.frame(x=.02*xend,y=.02*(yend-ystart)+rev(seq(ystart+(yend-ystart)*.05,by=(yend-ystart)/24,length.out=nlevels(xy$group))),label=levels(xy$group))

label=data.frame(label=gene[,1],x=rowMeans(gene[,-1]),y=ystart+(yend-ystart)*(c(1,1,1,1,2,3,1,2,3,1,2)-.2)/32)

ggplot(xy)+
# geom_rect(data=gene[c(T,F),],aes(xmin=V2,xmax=V3,ymin=ystart,ymax=yend),fill="gray90")+
geom_vline(xintercept=genebars,color="gray80",linewidth=.25)+
geom_vline(xintercept=c(0,xend),color="gray80",linewidth=.25)+
geom_line(aes(x,y,color=group),linewidth=.25)+
geom_label(data=legend,aes(x,y,label=label),color=color,size=2.4,hjust=0,fill=alpha("white",.7),label.r=unit(0,"lines"),label.padding=unit(.04,"lines"),label.size=0)+
geom_label(data=label,aes(x=x,y=y,label=label),fill=alpha("white",.7),label.r=unit(0,"lines"),label.padding=unit(.04,"lines"),label.size=0,size=1.8,hjust=.5)+
ggtitle(paste0("Similarity to ",targ," (step ",step,", window ",window,")"))+
coord_cartesian(clip="off")+
scale_x_continuous(limits=c(0,xend),breaks=seq(0,xend,xbreak),expand=c(0,0),labels=\(x)ifelse(x>1e3,paste0(x/1e3,"k"),x))+
# scale_x_continuous(limits=c(0,xend),breaks=seq(0,xend,xbreak),expand=c(0,0))+
scale_y_continuous(limits=c(ystart,yend),breaks=seq(ystart,yend,ybreak),expand=c(0,0))+
scale_color_manual(values=color)+
theme(
  axis.text=element_text(size=6,color="black"),
  axis.ticks=element_line(linewidth=.25,color="gray80"),
  axis.ticks.length=unit(.2,"lines"),
  axis.title=element_blank(),
  legend.position="none",
  panel.background=element_rect(fill="white"),
  panel.grid.major=element_blank(),
  panel.grid.major.y=element_line(linewidth=.25,color="gray80"),
  plot.background=element_rect(fill="white"),
  plot.margin=margin(.5,.6,.5,.6,"lines"),
  plot.title=element_text(size=8,color="black")
)

ggsave("1.png",width=5,height=3.5)

Rp22DB159 is one of the new Vietnamese sequences from Hassanin et al. 2023. It's a recombinant virus where some regions are similar to SARS2-like viruses and other regions are similar to SARS1-like viruses, like in ZC45, BtSY2, PrC31, and YN2021. The plot above shows that in Rp22DB159 the S1 is similar to BtSY2 but the ORF1b is similar to PrC31.

Download sarbecovirus sequences from GenBank and make a percentage identity matrix

curl ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh|sh
esearch -db nuccore -query '(viruses[filter] AND sarbecovirus NOT txid2697049) OR NC_045512.2 OR OM240725 OR MZ328294 OR (Bat coronavirus isolate BANAL-20)'|efetch -format fasta|seqkit seq -m 25000 >sarbe0.fa
seqkit grep -nrvp '^[^ ]* [^ ]* [^ ]*:|Patent|MODIFIED SMALL INTERFERING|Modified Microbial|A sensitive and specific test|NOVEL STRAIN|IMAGEABLE|isolate H_SC_|AY274119.3|Tor2 isolate' sarbe0.fa|mafft --thread 4 --reorder ->sarbe.fa
curl https://pastebin.com/raw/pDwYNf1r>pid.cpp;g++ pid.cpp -O3 -o pid

# ignore sequences with 1000 or more N letters or degenerate bases
seqkit seq -s sarbe.fa|tr -d ACGT-|paste - <(seqkit seq -ni sarbe.fa)|awk 'length($1)>1000{print$2}'>sarbe.ignore

# matrix sort grep (sort by first column matching regex and print the value of the column with rownames)
masog()(awk -F\\t 'NR==1{for(i=2;i<=NF;i++)if($i~x)break;next}{print$i,$1}' x="$1" "${2--}"|sort -n)

# ignore other SARS1 sequences except Tor2
masog Tor2, sarbe.pid|awk '$1<100&&$1>98{print$2}'>>sarbe.ignore

# `pid` ignores positions where either sequence has a gap, so N letters and degenerate bases are replaced with gaps
seqkit grep -vf sarbe.ignore sarbe.fa|seqkit replace -isp '[^ACGT-]' -r -|./pid>sarbe.pid

I uploaded the output files here: f/sarbe.fa.xz, f/sarbe.pid.

Fill in N bases from the consensus sequence of related viruses

The sequence of the Guangdong pangolin virus GD/P2S is available from GISAID but not GenBank. The sequence has about 30% N bases, but if you ignore positions with an N letter, degenerate base, or a gap, then GD/P2S has about 99.6-99.7% identity with MP789 and GD/P79-9 and about 99.1% identity with GD/P44-9. Therefore in order to fill in the N letters and degenerate bases, you can replace them with the consensus sequence of other pangolin viruses:

$ curl https://pastebin.com/raw/AuXrLcLx|tr -d \\r>gdp2s.fa
$ seqkit subseq -r1:200 gdp2s.fa
>hCoV-19/pangolin/Guangdong/P2S/2019
TTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAA
TAACTAATTACTGKCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCATACCTAGGTTTCGTCCGGGTGTGACCG
AAAGGTAAGATGGAGAGCCT
$ curl https://sars2.net/f/sarbe.fa.xz|xz -dc>sarbe.fa
$ seqkit grep -nrp 'GD/P44-9|GD/P44-9|MT121216' sarbe.fa|cat gdp2s.fa -|mafft --thread 4 ->pang.fa
[...]
$ seqkit replace -isp '[^acgtn-]' -r n pang.fa|seqkit seq -s|awk -F '' '{for(i=1;i<=NF;i++)if($i!="n")a[i][$i]++}END{for(i in a){max=0;for(j in a[i])if(a[i][j]>max){max=a[i][j];o=j}printf"%s",o};print""}'|cat - <(seqkit seq -s pang.fa|sed 1q)|awk -F '' 'NR==1{split($0,a,"");next}{for(i=1;i<=NF;i++)printf"%s",$i=="n"?a[i]:$i}'|(echo \>filled_in;tr -d -)|seqkit subseq -r1:200
>filled_in
ttaaaatctgtgtggctgtcactcggctgcatgcttagtgcactcacgcagtataattaa
taactaattactgkcgttgacaggacacgagtaactcgtctatcttctgctgcttagact
cggtggagatattgttagaactaatatatcgcatacctaggtttcgtccgggtgtgaccg
aaaggtaagatggagagcct

Plot average nucleotide changes from Wuhan-Hu-1 by clade and month

This downloads metadata about a global subsample of about 14,000 SARS2 sequences from NextStrain:

curl https://data.nextstrain.org/files/ncov/open/{africa,asia,europe,north-america,oceania,south-america}/metadata.tsv.xz|gzip -dc|awk 'NR==1||$1!="strain"'>continent.tsv
library(ggplot2)

meta=read.table("continent.tsv",header=T,sep="\t",quote="")

pick=meta$divergence<200&meta$length>29e3&meta$Nextstrain_clade!=""&meta$Nextstrain_clade!="recombinant"
pick=pick&!(grepl("Omicron",meta$Nextstrain_clade)&as.Date(meta$date)<as.Date("2021-10-1"))
pick=pick&!(grepl("19B",meta$Nextstrain_clade)&as.Date(meta$date)>as.Date("2021-10-1"))
pick=pick&!(grepl("19A",meta$Nextstrain_clade)&as.Date(meta$date)>as.Date("2021-10-1"))
pick=pick&!(grepl("20A",meta$Nextstrain_clade)&as.Date(meta$date)>as.Date("2021-08-1"))
pick=pick&!(grepl("20G",meta$Nextstrain_clade)&as.Date(meta$date)<as.Date("2020-07-1"))
pick=pick&!(grepl("20B",meta$Nextstrain_clade)&as.Date(meta$date)>as.Date("2021-10-1"))
pick=pick&!(grepl("21H",meta$Nextstrain_clade)&as.Date(meta$date)<as.Date("2020-10-1"))
pick=pick&!(grepl("21J",meta$Nextstrain_clade)&as.Date(meta$date)<as.Date("2020-10-1"))
pick=pick&!(grepl("Gamma",meta$Nextstrain_clade)&as.Date(meta$date)<as.Date("2020-12-1"))
meta=meta[pick,]

month=tapply(meta$divergence,sub("-..$","",meta$date),mean)
xy=data.frame(x=as.Date(paste0(names(month),"-1")),y=month)

t=aggregate(meta$divergence,list(sub("-..$","",meta$date),meta$Nextstrain_clade),mean)
count=aggregate(meta$divergence,list(sub("-..$","",meta$date),meta$Nextstrain_clade),length)
t=t[count[,3]>1,]

colnames(t)=paste0("V",1:3)
t[,1]=as.Date(paste0(t[,1],"-1"))
group=factor(t[,2],unique(t[,2]))

hue=c(0,30,60,90,150,210,250,310)+15
color=rep(c(hcl(hue,60,80),hcl(hue,100,40)),3)

xstart=as.Date("2019-12-1")
xend=as.Date("2023-08-01")
xbreak=seq(xstart,xend,"2 months")
ystart=0
yend=120
ybreak=seq(ystart,yend,10)

ggplot(t,aes(x=V1,y=V3,color=V2,fill=V2,linetype=V2))+
geom_vline(xintercept=c(xstart,xend),color="gray75",linewidth=.35,lineend="square")+
geom_hline(yintercept=c(ystart,yend),color="gray75",linewidth=.35,lineend="square")+
geom_point(size=.5)+
geom_line(linewidth=.5)+
labs(title="Monthly average divergence from Wuhan-Hu-1 by strain among 14,358 samples from NextStrain",x=NULL,y="Nucleotide changes from Wuhan-Hu-1")+
scale_x_date(limits=c(xstart,xend),breaks=xbreak,expand=expansion(0),date_labels="%b %y")+
scale_y_continuous(limits=c(ystart,yend),breaks=ybreak,expand=expansion(0))+
scale_color_manual(values=color)+
scale_fill_manual(values=color)+
scale_linetype_manual(values=rep(c(1,2,5),each=16))+
coord_cartesian(clip="off")+
guides(color=guide_legend(nrow=9))+
theme(
  axis.text.x=element_text(angle=90,vjust=.5,hjust=1),
  axis.text=element_text(size=7,color="black"),
  axis.ticks=element_line(linewidth=.35,color="gray75"),
  axis.ticks.length=unit(.2,"lines"),
  axis.title=element_text(size=8),
  legend.background=element_blank(),
  legend.box.just="center",
  legend.box.margin=margin(0,unit="cm"),
  legend.box.spacing=unit(0,"in"),
  legend.direction="vertical",
  legend.justification="left",
  legend.key=element_rect(fill=NA),
  legend.position=c(.0,.73),
  legend.text=element_text(size=7,vjust=.5),
  legend.title=element_blank(),
  panel.grid=element_blank(),
  panel.background=element_rect(fill="white"),
  plot.margin=margin(.4,.6,.4,.6,"lines"),
  plot.title=element_text(size=9)
)

ggsave("1.png",width=6,height=4.5)

Even though the script filtered out many samples with incorrect collection date, some samples with an incorrect collection date remained, like one Mu sample from July 2020. But you can still see that for example in December 2020, the line for Alpha starts a lot higher than the average height of the lines for other variants in November:

Find longest RNA virus refseqs

Wikipedia says that almost all known RNA viruses are contained within the realm of Riboviria: "As of May 2020, all known RNA viruses encoding an RNA-directed RNA polymerase are believed to form a monophyletic group, known as the realm Riboviria.[4] The majority of such RNA viruses fall into the kingdom Orthornavirae and the rest have a positioning not yet defined.[5] The realm does not contain all RNA viruses: Deltavirus, Asunviroidae, and Pospiviroidae are taxa of RNA viruses that were mistakenly included in 2019,[a] but corrected in 2020.[6]" The genus Deltavirus consists solely of hepatitis delta which has an extremely short genome, and Asunviroidae and Pospiviroidae are families of viroids that have short genomes.

I wanted to search which RNA viruses had the longest genomes, so I searched the NCBI's nucleotide database for all reference sequences classified under the realm of Riboviria. [https://www.ncbi.nlm.nih.gov/nuccore/?term=%22Riboviria%22%5BOrganism%5D+AND+refseq%5Bfilter%5D] You can download a FASTA file of the results by clicking "Send to", setting the format to FASTA, and clicking "Create File". Or alternatively you can run esearch -db nuccore -query '"Riboviria"[Organism] AND refseq[filter]'|efetch -format fasta.

There was a total of 9,116 results. The longest sequence among the results was 41,178 bases long and titled "Planarian secretory cell nidovirus isolate UIUC". It was published in a paper from 2018 titled "A planarian nidovirus expands the limits of RNA genome size". The longest coronavirus was "Beluga whale coronavirus SW1" which was 31,686 bases long and came on 8th place:

$ curl ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh|sh
$ esearch -db nuccore -query '"Riboviria"[Organism] AND refseq[filter]'|efetch -format fasta>riboviridae.fa
$ seqkit fx2tab riboviridae.fa|awk -F\\t '{print length($2),$1}'|sort -nr|head
41178 NC_040361.1 Planarian secretory cell nidovirus isolate UIUC, complete genome
36652 NC_076697.1 MAG: Pacific salmon nidovirus isolate H14 ORF1a polyprotein (QKQ71_gp1), ORF1b polyprotein (QKQ71_gp2), LAP1C-like protein (QKQ71_gp3), spike protein (QKQ71_gp4), envelope (QKQ71_gp5), membrane protein (QKQ71_gp6), and nucleoprotein (QKQ71_gp7) genes, complete cds
36144 NC_076911.1 Veiled chameleon serpentovirus B, complete sequence
35906 NC_040711.1 Aplysia californica nido-like virus isolate EK, complete genome
33452 NC_024709.1 Ball python nidovirus strain 07-53, complete genome
32753 NC_076657.1 Morelia viridis nidovirus isolate BH171/14-7, complete genome
32399 NC_035465.1 Morelia viridis nidovirus strain S14-1323_MVNV, complete genome
31686 NC_010646.1 Beluga Whale coronavirus SW1, complete genome
31537 NC_076912.1 Veiled chameleon serpentovirus A, complete sequence
31526 AC_000192.1 Murine hepatitis virus strain JHM, complete genome

Download SRA runs from European Nucleotide Archive

Downloading reads from SRA with fastq-dump is so slow that people have developed tools like fasterq-dump and parallel-fastq-dump which run multiple fastq-dump processes in parallel. However fasterq-dump doesn't support downloading gzipped reads. An easy way to download gzipped reads fast is to just use ENA instead:

$ enal()(printf %s\\n "${@-$(cat)}"|while read x;do curl -s "https://www.ebi.ac.uk/ena/portal/api/filereport?accession=$x&result=read_run&fields=fastq_ftp"|sed 1d|cut -f2|tr \; \\n|sed s,^,ftp://,;done) # ENA list
$ enad()(enal "$@"|xargs wget) # ENA download
$ enal SRR10971381
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR109/081/SRR10971381/SRR10971381_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR109/081/SRR10971381/SRR10971381_2.fastq.gz

Download a CSV file for metadata of SRA search results

$ curl ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh|sh
[...]
$ esearch -db sra -query 'pangolin OR manis'|efetch -format runinfo>pangolin.runinfo
$ sed '1n;/Manis/!d' pangolin.runinfo|head -n3
Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
SRR24150370,2023-04-13 00:39:14,2023-04-12 23:17:46,1320057,356415390,1320057,270,106,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos4/sra-pub-zq-3/SRR024/24150/SRR24150370/SRR24150370.lite.1,SRX19948497,HC-HKU06,RAD-Seq,Restriction Digest,GENOMIC,PAIRED,0,0,ILLUMINA,Illumina NovaSeq 6000,SRP432345,PRJNA954554,,954554,SRS17296698,SAMN34161674,simple,9974,Manis javanica,HKU06,,,,,female,,no,,,,,THE UNIVERSITY OF HONG KONG,SRA1619804,,public,3C0FCC08646788EEAD35C7134007B7B5,26E77FA1DF1CC34214421679C1BED294
SRR24150359,2023-04-13 00:39:14,2023-04-12 23:17:44,837476,226118520,837476,270,68,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos4/sra-pub-zq-3/SRR024/24150/SRR24150359/SRR24150359.lite.1,SRX19948508,LC-HKU07,RAD-Seq,Restriction Digest,GENOMIC,PAIRED,0,0,ILLUMINA,Illumina NovaSeq 6000,SRP432345,PRJNA954554,,954554,SRS17296709,SAMN34161675,simple,9974,Manis javanica,HKU07,,,,,male,,no,,,,,THE UNIVERSITY OF HONG KONG,SRA1619804,,public,B34A9AC17DAB6A628733E087D7F5D80B,F1E59AB9CC3872A2B867863FFDE81792

This downloads FASTQ files for all runs in a BioProject:

esearch -db sra -query PRJNA605983|efetch -format runinfo|sed 1d|cut -d, -f1|parallel fastq-dump --gzip --split-3

Find longest repeats within a sequence

This code finds the longest non-inverted repeat within a sequence, and it prints the sequence of the repeat along with the length, starting position, and ending position of each instance of the repeat:

$ printf %s\\n \>sv40_promoter GGTGTGGAAAGTCCCCAGGCTCCCCAGCAGGCAGAAGTATGCAAAGCATGCATCTCAATTAGTCAGCAACCAGGTGTGGAAAGTCCCCAGGCTCCCCAGCAGGCAGAAGTATGCAAAGCATGCATCTCAATTAGTCAGCAACCATAGTCCCGCCCCTAACTCCGCCCATCCCGCCCCTAACTCCGCCCAGTTCCGCCCATTCTCCGCCCCATGGCTGACTAATTTTTTTTATTTATGCAGAGGCCGAGGCCGCCTCGGCCTCTGAGCTATTCCAGAAGTAGTGAGGAGGCTTTTTTGGAGGCCTAGGCTTTTGCAAA>sv40.fa
$ seqkit seq -s sv40.fa|awk '{for(i=1;;i++){delete a;for(j=1;j<=length-i+1;j++)a[substr($0,j,i)][j];found=0;for(x in a){if(length(a[x])>1){found=1;for(y in a[x])print i,y,y+length(x)-1,x}}if(!found)exit}}'|sort -srnk1,1|awk 'NR>1&&$1!=x{exit}{x=$1}1'
72 1 72 GGTGTGGAAAGTCCCCAGGCTCCCCAGCAGGCAGAAGTATGCAAAGCATGCATCTCAATTAGTCAGCAACCA
72 73 144 GGTGTGGAAAGTCCCCAGGCTCCCCAGCAGGCAGAAGTATGCAAAGCATGCATCTCAATTAGTCAGCAACCA

From the output above, you can see that the SV40 promoter sequence that was found in Pfizer vaccines starts with a tandem repeat where the same 72-base segment is repeated twice in a row.

Find GISAID samples with lowest mutation distance to RaTG13

The code below shows how many mutations closer each GISAID sample is to RaTG13 than to Wuhan-Hu-1. There is one sample from Guangdong which has five mutations relative to Wuhan-Hu-1 but all of them are shared with RaTG13. The five mutations are the three mutations of proCoV2/WA1 in addition to two additional mutations, even though the collection date of the sample is in late February, so the two additional mutations might be reversions that occurred due to chance:

$ curl https://sars2.net/f/gisaid2020.tsv.xz|xz -dc>gisaid2020.tsv
$ grep RaTG13 gisaid2020.tsv|cut -f12|tr , \\n|awk -F\\t 'NR==FNR{a[$0];next}$11<10{split($12,b,",");n=0;for(i in b)n+=b[i]in a?1:-1;print n FS$0}' - gisaid2020.tsv|awk '$1>=3'|sort -rn|cut -f1,2,4-8,12-13|tr \\t \||head
5|EPI_ISL_413892|2020-03-09|2020-02-28|A|China|Guangdong|5|C8782T,C18060T,C21711T,C24382T,T28144C
3|EPI_ISL_855195|2021-01-19|2020-03-15|A|USA|New York|3|C8782T,C18060T,T28144C
3|EPI_ISL_671375|2020-12-02|2020-03-20|A|Ireland|Kerry|3|C8782T,C18060T,T28144C
3|EPI_ISL_654866|2020-11-25|2020-01-31|A|Vietnam|Ho Chi Minh City|3|C8782T,T28144C,C29095T
3|EPI_ISL_653255|2020-11-23|2020-03-04|A.1|USA|Florida|7|C6286T,C8782T,C17010T,C17747T,A17858G,C18060T,T28144C
3|EPI_ISL_653254|2020-11-23|2020-03-04|A.1|USA|Florida|7|C6286T,C8782T,C17010T,C17747T,A17858G,C18060T,T28144C
3|EPI_ISL_604189|2020-10-30|2020-03-14|A.1|USA|California|7|C6286T,C8782T,C17010T,C17747T,A17858G,C18060T,T28144C
3|EPI_ISL_596386|2020-10-23|2020-01-19|A|USA|Washington|3|C8782T,C18060T,T28144C
3|EPI_ISL_582763|2020-10-16|2020-03-27|A.5|Spain|Basque Country|3|C8683T,C8782T,T28144C
3|EPI_ISL_522942|2020-08-26|2020-08-06|A|Mexico|Mexico City|3|C8782T,C18060T,T28144C

Make a heatmap for number of nucleotide changes to a target sequence within each fixed-length block of an alignment

brew install seqkit mafft
wget https:/sars2.net/f/pid.cpp;g++ pid.cpp -O3 -o pid

# percentage identity matrix of 529 sarbecovirus sequences
wget https://sars2.net/f/sarbe.pid

# matrix sort grep (for a matrix with row names and column names, print the values of a column whose name matches a substring and sort by the column)
masog()(awk -F\\t 'NR==1{for(i=2;i<=NF;i++)if($i~x)break;next}{print$i,$1}' x="$1" "${2--}"|sort -n)

# efetch multi (download list of accessions from STDIN)
emu()(curl -sd "id=$(paste -sd, -)" "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=${1:-nuccore}&rettype=${2:-fasta}&retmode=$3")

# align spike proteins of Tor2 and sequences with 95-98% identity to Tor2
masog Tor2, sarbe.pid|awk '$1==100||($1<=98&&$1>=95){print$2}'|emu '' fasta_cds_na|seqkit grep -nrp 'gene=S|spike|surface'|sed 's/_cds_.*//;s/lcl|//'|seqkit fx2tab|awk 'NR==FNR{a[$1]=$0;next}{print">"a[$1]"\n"$2}' <(cut -f1 sarbe.pid|cut -d, -f1) -|mafft --thread 4 --quiet ->temp.fa;./pid<temp.fa>temp.pid
# install.packages("BiocManager");BiocManager::install("Biostrings")

library(ggplot2)
library(colorspace)
library(Biostrings)

t=readDNAStringSet("temp.fa")
names(t)=sub("^.*? ","",sub(",.*","",names(t)))

targ=grep("Tor2",names(t),value=T)

step=300

max=width(t)[1]
ranges=lapply(seq(1,max,step),\(i)c(i,min(i+step-1,max)))
heat=100*sapply(ranges,\(x)as.matrix(stringDist(subseq(t,x[1],x[2]),"hamming"))[targ,]/(x[2]-x[1]))
# heat=100*sapply(ranges,\(x){block=t[,x[1]:x[2]];rowSums(block[targ,]!="-"&block!="-"&block[targ,]==block)/ncol(block)}) # don't count positions where either sequence has a gap

# colnames(heat)=sapply(ranges,paste,collapse="-") # display nucleotide positions instead of amino acid positions
colnames(heat)=apply(ranges,1,\(x)paste0((x[1]-1)/3+1,"-",x[2]/3))

dist=100-read.table("temp.pid",sep="\t",row.names=1,header=T)

pick=!grepl("Mutant|related coronavirus isolate",rownames(heat))
heat=heat[pick,]
dist=dist[pick,pick]

hc=as.hclust(reorder(as.dendrogram(hclust(as.dist(dist))),cmdscale(dist,1)))

rownames(heat)=sub("Severe acute respiratory syndrome","SARS",rownames(heat))
rownames(heat)=sub(" (ORF|surface).*","",rownames(heat))
rownames(heat)=sub(" isolate "," ",rownames(heat))

pheatmap::pheatmap(
  heat,
  filename="0.png",
  clustering_callback=\(...)hc,
  cluster_cols=F,
  legend=F,
  cellwidth=18,
  cellheight=14,
  fontsize=9,
  treeheight_row=80,
  treeheight_column=80,
  border_color=NA,
  display_numbers=T,
  number_format="%.0f",
  fontsize_number=8,
  number_color=matrix(ifelse(heat>.8*max(heat),"white","black"),nrow(heat))[hc$order,],
  na_col="white",
  breaks=seq(0,max(heat),,256),
  colorRampPalette(hex(HSV(c(210,210,210,160,110,60,40,20,0,0,0),c(0,.25,rep(.5,9)),c(rep(1,9),.5,0))))(256)
)

# system(paste0("convert -gravity north \\( -splice 0x26 -pointsize 46 label:'Target: ",targ," (percentage of nucleotide changes in each ",step," nt block)' \\) 0.png -append 1.png"))
system(paste0("convert -gravity north \\( -splice 0x26 -pointsize 46 label:'Target: Tor2 (percentage of nucleotide changes in each ",(step/3)," aa block of spike protein)' \\) 0.png -append 1.png"))

The heatmap above shows that you can divide the viruses that are the closest to SARS1 to four groups based on the N-terminal domain and receptor binding motif. WIV1 has a different NTD but similar RBM, Rs4231 has a different RBM but similar NTD, in SHC014 both NTD and RBM are different, and in WIV16 both the NTD and RBM are similar.

This image from a patent by Ralph Baric et al. shows that the NTD is located approximately at positions 1-318 and the RBM is located at approximately positions 438-509: [https://patents.google.com/patent/US9884895B2/en]

Calculate distance to Wuhan-Hu-1 in RdRp region

Valentin Bruttel has said that because the RdRp of RaTG13 has only 4 amino acid changes from Wuhan-Hu-1 so that there's about 99.5% amino acid identity, the rest of the genome of RaTG13 may have been artificially manipulated to have a higher distance to SARS2, because the whole genome of RaTG13 has only about 96.2% identity with SARS2 on the nucleotide level. [https://x.com/VBruttel/status/1722617170663010438, https://www.youtube.com/watch?v=EuuY94tsbls&t=1h1m58s]

However RdRp is a conserved region, so the number of amino acid changes from Wuhan-Hu-1 in the RdRp is only 1 in all five BANAL viruses and RpYN06, and it's 3 in some of the new Vietnamese sequences from Hassanin et al. 2023, and it's only 34 even in Tor2. So since the RdRp of Wuhan-Hu-1 is 932 aa long, the RdRp of Tor2 has about 96.4% amino acid identity to Wuhan-Hu-1. (The RdRp might have slightly different starting and ending positions in different sarbecoviruses, but I'm using the positions corresponding to Wuhan-Hu-1 in my calculation.)

Here the first column shows the number of amino acid changes from Wuhan-Hu-1 in the RdRp, the second column shows the number of nucleotide changes in the RdRp, and the third column shows the nucleotide percentage identity with Wuhan-Hu-1 in the whole genome:

$ wget -q https://sars2.net/f/sarbe.{pid,fa.xz}
$ range=$(seqkit grep -nrp Wuhan-Hu-1 sarbe.fa.xz|seqkit seq -s|awk -F '' '{for(i=1;i<=length;i++){if($i=="-")gaps++;if(i-gaps==13442)printf"%s:",i;if(i-gaps==16237)print i}}')
$ awk -F\\t 'NR==1{for(i=2;i<=NF;i++)if($i~/Hu-1/)break;next}$1!=x{print$i,$1}' sarbe.pid|awk '($1>=90&&$1<100)||/Tor2/{print$2}'|seqkit grep -f- sarbe.fa.xz|cat <(seqkit grep -nrp Hu-1 sarbe.fa.xz) -|seqkit subseq -r$range|mafft --thread 4 ->temp.fa
[...]
$ x=$(seqkit seq -n temp.fa|sed 's/,.*//;s/ \(ORF\|surface\).*//;s/ isolate / /'|paste - <(seqkit seq -s temp.fa|sed 's/./&&/26')|seqkit tab2fx)
$ dist1()(seqkit fx2tab "$@"|awk -F\\t 'NR==1{split($2,a,"");l=length(a);next}{split($2,b,"");d=0;for(i=1;i<=l;i++)if(a[i]!="-"&&b[i]!=""&&a[i]!=b[i])d++;print d,$1}')
$ dist1<<<"$x"|sed $'s/ /\t/;s/ /\t/'|paste <(seqkit translate<<<"$x"|dist1|cut -d\  -f1) -|sed 's/,.*//;s/ \(ORF\|surface\).*//;s/ isolate / /'|awk 'NR==FNR{a[$2]=$1;next}{$3=sprintf("%.1f",a[$3])}1' {,O}FS=\\t <(awk -F\\t 'NR==1{for(i=2;i<=NF;i++)if($i~/Hu-1/)break;next}{sub(/ .*/,"",$1);print$i FS$1}' sarbe.pid) -|sort -nk2|column -ts$'\t'
1   43   96.0  Bat coronavirus BANAL-20-236/Laos/2020
1   44   97.0  Bat coronavirus BANAL-20-52/Laos/2020
1   46   94.8  Betacoronavirus sp. RpYN06 strain bat/Yunnan/RpYN06/2020
1   48   93.9  Bat coronavirus BANAL-20-116/Laos/2020
1   48   93.9  Bat coronavirus BANAL-20-247/Laos/2020
1   48   96.1  Bat coronavirus BANAL-20-103/Laos/2020
4   61   96.2  Bat coronavirus RaTG13
7   85   92.8  Horseshoe bat sarbecovirus Ra22QT106
7   85   92.8  Horseshoe bat sarbecovirus Ra22QT137
7   85   92.8  Horseshoe bat sarbecovirus Ra22QT77
8   86   92.8  Horseshoe bat sarbecovirus Ra22QT135
3   94   90.6  Bat coronavirus RacCS253
3   94   90.9  Bat coronavirus RacCS271
3   94   91.9  Bat coronavirus RacCS203
5   241  90.4  MAG: Pangolin coronavirus GD/P79-9/2019
5   242  90.3  Pangolin coronavirus MP789
34  320  79.7  SARS coronavirus Tor2
34  329  92.3  Bat SARS-like virus BtSY2
34  332  91.0  Sarbecovirus sp. PrC31
34  333  92.2  Horseshoe bat sarbecovirus Rp22DB159
37  335  90.1  Sarbecovirus sp. YN2021

The RdRp of SARS2 is very distant from BtSY2, PrC31, and YN2021, but that's because they are recombinant viruses where some regions are similar to SARS2-like viruses and other regions are similar to SARS1-like viruses, but the RdRp is one of the regions that is more similar to SARS1.

ORF1ab has a ribosomal slippage where the frame is shifted one base backwards at the start of ORF1b, so that the last base of ORF1a gets duplicated as the first base of ORF1b. In Wuhan-Hu-1 the first 26 bases of the RdRp come from ORF1a and the rest comes from ORF1b, so the code above inserts the last base of ORF1a to position 27, because otherwise the amino acid translation of the RdRp gets messed up:

$ seqkit subseq -r 13442:16237 sars2.fa|seqkit translate|seqkit subseq -r1:60 # the translation is messed up because of ribosomal slippage
>MN908947.3 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
SADAQSFLNGFAV*VQPVLHRAAQALVLMSYTGLLTSTMIK*LVLLNS*KLIVVASKKRT
$ seqkit seq -s sars2.fa|cut -c 13442-16237|(echo \>a;sed 's/./&&/26')|seqkit translate|seqkit subseq -r1:60 # insert last base of ORF1a to start of ORF1b
>a
SADAQSFLKRVCGVSAARLTPCGTGTSTDVVYRAFDIYNDKVAGFAKFLKTNCCRFQEKD
$ curl -s 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&rettype=fasta&id=YP_009725307'|seqkit subseq -r1:60 # compare to RdRp sequence from GenBank
>YP_009725307.1 RNA-dependent RNA polymerase [Severe acute respiratory syndrome coronavirus 2]
SADAQSFLNRVCGVSAARLTPCGTGTSTDVVYRAFDIYNDKVAGFAKFLKTNCCRFQEKD

Search for ORFs on reverse strand in SARS2 and Pfizer vaccine contigs

When Kevin McKernan sequenced Pfizer vaccine vials, he found that the reverse strand of the spike protein sequence had a 1,252-aa open reading frame. [https://anandamide.substack.com/p/what-do-you-make-of-this-sars1-patent] In an RNA sequence it wouldn't matter but the vaccines were contaminated with double-stranded DNA. McKernan has been calling the long ORF on the reverse strand the "spidroin" because its closest hit on UniProt was a spidroin protein which is the main protein in spider silk, even though the similarity was not very high and the match was almost certainly accidental.

If you for search for ORFs in all three frames of the reverse strand of Wuhan-Hu-1, the longest ORF is only 150 amino acids long:

$ curl -s 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN908947.3'>sars2.fa
$ r=$(seqkit seq -rp sars2.fa|seqkit seq -s);for x in $r ${r:1} ${r:2};do sed 's/.../& /g'<<<$x|grep -Po '(ATG|CTG|TTG).*?(?=(TAA|TAG|TGA))'|tr -d \ ;done|awk '{print">"NR}1'|seqkit translate|seqkit seq -gm4|seqkit stat
file  format  type     num_seqs  sum_len  min_len  avg_len  max_len
-     FASTA   Protein       669   13,036        4     19.5      150

However the spike sequence in the Pfizer jabs was codon-optimized, which eliminated some of the stop codons on the reverse strand.

This PDF includes a full sequence for a "spike-encoding contig assembled from BioNTech/Pfizer BNT-162b2 vaccine": https://github.com/NAalytics/Assemblies-of-putative-SARS-CoV2-spike-encoding-mRNA-sequences-for-vaccines-BNT-162b2-and-mRNA-1273/blob/main/Assemblies+of+putative+SARS-CoV2-spike-encoding+mRNA+sequences+for+vaccines+BNT-162b2+and+mRNA-1273.docx.pdf. When I searched for ORFs on all 3 frames on the reverse strand, my longest ORF was 1,341 aa:

$ echo $'>pfizerspike\nGAGAATAAACTAGTATTCTTCTGGTCCCCACAGACTCAGAGAGAACCCGCCACCATGTTCGTGTTCCTGGTGCTGCTGCCTCTGGTGTCCAGCCAGTGTGTGAACCTGACCACCAGAACACAGCTGCCTCCAGCCTACACCAACAGCTTTACCAGAGGCGTGTACTACCCCGACAAGGTGTTCAGATCCAGCGTGCTGCACTCTACCCAGGACCTGTTCCTGCCTTTCTTCAGCAACGTGACCTGGTTCCACGCCATCCACGTGTCCGGCACCAATGGCACCAAGAGATTCGACAACCCCGTGCTGCCCTTCAACGACGGGGTGTACTTTGCCAGCACCGAGAAGTCCAACATCATCAGAGGCTGGATCTTCGGCACCACACTGGACAGCAAGACCCAGAGCCTGCTGATCGTGAACAACGCCACCAACGTGGTCATCAAAGTGTGCGAGTTCCAGTTCTGCAACGACCCCTTCCTGGGCGTCTACTACCACAAGAACAACAAGAGCTGGATGGAAAGCGAGTTCCGGGTGTACAGCAGCGCCAACAACTGCACCTTCGAGTACGTGTCCCAGCCTTTCCTGATGGACCTGGAAGGCAAGCAGGGCAACTTCAAGAACCTGCGCGAGTTCGTGTTTAAGAACATCGACGGCTACTTCAAGATCTACAGCAAGCACACCCCTATCAACCTCGTGCGGGATCTGCCTCAGGGCTTCTCTGCTCTGGAACCCCTGGTGGATCTGCCCATCGGCATCAACATCACCCGGTTTCAGACACTGCTGGCCCTGCACAGAAGCTACCTGACACCTGGCGATAGCAGCAGCGGATGGACAGCTGGTGCCGCCGCTTACTATGTGGGCTACCTGCAGCCTAGAACCTTCCTGCTGAAGTACAACGAGAACGGCACCATCACCGACGCCGTGGATTGTGCTCTGGATCCTCTGAGCGAGACAAAGTGCACCCTGAAGTCCTTCACCGTGGAAAAGGGCATCTACCAGACCAGCAACTTCCGGGTGCAGCCCACCGAATCCATCGTGCGGTTCCCCAATATCACCAATCTGTGCCCCTTCGGCGAGGTGTTCAATGCCACCAGATTCGCCTCTGTGTACGCCTGGAACCGGAAGCGGATCAGCAATTGCGTGGCCGACTACTCCGTGCTGTACAACTCCGCCAGCTTCAGCACCTTCAAGTGCTACGGCGTGTCCCCTACCAAGCTGAACGACCTGTGCTTCACAAACGTGTACGCCGACAGCTTCGTGATCCGGGGAGATGAAGTGCGGCAGATTGCCCCTGGACAGACAGGCAAGATCGCCGACTACAACTACAAGCTGCCCGACGACTTCACCGGCTGTGTGATTGCCTGGAACAGCAACAACCTGGACTCCAAAGTCGGCGGCAACTACAATTACCTGTACCGGCTGTTCCGGAAGTCCAATCTGAAGCCCTTCGAGCGGGACATCTCCACCGAGATCTATCAGGCCGGCAGCACCCCTTGTAACGGCGTGGAAGGCTTCAACTGCTACTTCCCACTGCAGTCCTACGGCTTTCAGCCCACAAATGGCGTGGGCTATCAGCCCTACAGAGTGGTGGTGCTGAGCTTCGAACTGCTGCATGCCCCTGCCACAGTGTGCGGCCCTAAGAAAAGCACCAATCTCGTGAAGAACAAATGCGTGAACTTCAACTTCAACGGCCTGACCGGCACCGGCGTGCTGACAGAGAGCAACAAGAAGTTCCTGCCATTCCAGCAGTTTGGCCGGGATATCGCCGATACCACAGACGCCGTTAGAGATCCCCAGACACTGGAAATCCTGGACATCACCCCTTGCAGCTTCGGCGGAGTGTCTGTGATCACCCCTGGCACCAACACCAGCAATCAGGTGGCAGTGCTGTACCAGGACGTGAACTGTACCGAAGTGCCCGTGGCCATTCACGCCGATCAGCTGACACCTACATGGCGGGTGTACTCCACCGGCAGCAATGTGTTTCAGACCAGAGCCGGCTGTCTGATCGGAGCCGAGCACGTGAACAATAGCTACGAGTGCGACATCCCCATCGGCGCTGGAATCTGCGCCAGCTACCAGACACAGACAAACAGCCCTCGGAGAGCCAGAAGCGTGGCCAGCCAGAGCATCATTGCCTACACAATGTCTCTGGGCGCCGAGAACAGCGTGGCCTACTCCAACAACTCTATCGCTATCCCCACCAACTTCACCATCAGCGTGACCACAGAGATCCTGCCTGTGTCCATGACCAAGACCAGCGTGGACTGCACCATGTACATCTGCGGCGATTCCACCGAGTGCTCCAACCTGCTGCTGCAGTACGGCAGCTTCTGCACCCAGCTGAATAGAGCCCTGACAGGGATCGCCGTGGAACAGGACAAGAACACCCAAGAGGTGTTCGCCCAAGTGAAGCAGATCTACAAGACCCCTCCTATCAAGGACTTCGGCGGCTTCAATTTCAGCCAGATTCTGCCCGATCCTAGCAAGCCCAGCAAGCGGAGCTTCATCGAGGACCTGCTGTTCAACAAAGTGACACTGGCCGACGCCGGCTTCATCAAGCAGTATGGCGATTGTCTGGGCGACATTGCCGCCAGGGATCTGATTTGCGCCCAGAAGTTTAACGGACTGACAGTGCTGCCTCCTCTGCTGACCGATGAGATGATCGCCCAGTACACATCTGCCCTGCTGGCCGGCACAATCACAAGCGGCTGGACATTTGGAGCAGGCGCCGCTCTGCAGATCCCCTTTGCTATGCAGATGGCCTACCGGTTCAACGGCATCGGAGTGACCCAGAATGTGCTGTACGAGAACCAGAAGCTGATCGCCAACCAGTTCAACAGCGCCATCGGCAAGATCCAGGACAGCCTGAGCAGCACAGCAAGCGCCCTGGGAAAGCTGCAGGACGTGGTCAACCAGAATGCCCAGGCACTGAACACCCTGGTCAAGCAGCTGTCCTCCAACTTCGGCGCCATCAGCTCTGTGCTGAACGATATCCTGAGCAGACTGGACCCTCCTGAGGCCGAGGTGCAGATCGACAGACTGATCACAGGCAGACTGCAGAGCCTCCAGACATACGTGACCCAGCAGCTGATCAGAGCCGCCGAGATTAGAGCCTCTGCCAATCTGGCCGCCACCAAGATGTCTGAGTGTGTGCTGGGCCAGAGCAAGAGAGTGGACTTTTGCGGCAAGGGCTACCACCTGATGAGCTTCCCTCAGTCTGCCCCTCACGGCGTGGTGTTTCTGCACGTGACATATGTGCCCGCTCAAGAGAAGAATTTCACCACCGCTCCAGCCATCTGCCACGACGGCAAAGCCCACTTTCCTAGAGAAGGCGTGTTCGTGTCCAACGGCACCCATTGGTTCGTGACACAGCGGAACTTCTACGAGCCCCAGATCATCACCACCGACAACACCTTCGTGTCTGGCAACTGCGACGTCGTGATCGGCATTGTGAACAATACCGTGTACGACCCTCTGCAGCCCGAGCTGGACAGCTTCAAAGAGGAACTGGACAAGTACTTTAAGAACCACACAAGCCCCGACGTGGACCTGGGCGATATCAGCGGAATCAATGCCAGCGTCGTGAACATCCAGAAAGAGATCGACCGGCTGAACGAGGTGGCCAAGAATCTGAACGAGAGCCTGATCGACCTGCAAGAACTGGGGAAGTACGAGCAGTACATCAAGTGGCCCTGGTACATCTGGCTGGGCTTTATCGCCGGACTGATTGCCATCGTGATGGTCACAATCATGCTGTGTTGCATGACCAGCTGCTGTAGCTGCCTGAAGGGCTGTTGTAGCTGTGGCAGCTGCTGCAAGTTCGACGAGGACGATTCTGAGCCCGTGCTGAAGGGCGTGAAACTGCACTACACATGATGACTCGAGCTGGTACTGCATGCACGCAATGCTAGCTGCCCCTTTCCCGTCCTGGGTACCCCGAGTCTCCCCCGACCTCGGGTCCCAGGTATGCTCCCACCTCCACCTGCCCCACTCACCACCTCTGCTAGTTCCAGACACCTCCCAAGCACGCAGCAATGCAGCTCAAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAAACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACACCCTGGAGCTAGCA'>pfizerspike.fa
$ r=$(seqkit seq -rp pfizerspike.fa|seqkit seq -s);for x in $r ${r:1} ${r:2};do sed 's/.../& /g'<<<$x|grep -Po '(ATG|CTG|TTG).*?(?=(TAA|TAG|TGA))'|tr -d \ ;done|awk '{print">"NR}1'|seqkit translate|seqkit seq -gm4|seqkit stat
file  format  type     num_seqs  sum_len  min_len  avg_len  max_len
-     FASTA   Protein        53    2,903        4     54.8    1,341

In McKernan's longest contig assembled from Pfizer vial 1, the longest ORF on the reverse strand was 1,271 aa: [https://osf.io/b9t7m/; link to contigs on page 28]

$ r=$(seqkit sort -lr PbiV1-cs-Megahit-assem.fa|seqkit head -n1|seqkit seq -rp|seqkit seq -s);for x in $r ${r:1} ${r:2};do sed 's/.../& /g'<<<$x|grep -Po '(ATG|CTG|TTG).*?(?=(TAA|TAG|TGA))'|tr -d \ ;done|awk '{print">"NR}1'|seqkit translate|seqkit seq -gm4|seqkit stat
file  format  type     num_seqs  sum_len  min_len  avg_len  max_len
-     FASTA   Protein        69    4,588        4     66.5    1,271

Retrieve STAT results from SRA Database Backend API

NCBI's Sequence Read Archive has a little-known "SRA Database Backend" API: https://trace.ncbi.nlm.nih.gov/Traces/sra-db-be/. It can be used to retrieve the full STAT results of an SRA run, which are not displayed anywhere on the SRA's website.

This displays the most abundant leaf nodes in the metagenomic sequencing run of Wuhan-Hu-1:

$ curl -Ls "https://trace.ncbi.nlm.nih.gov/Traces/sra-db-be/run_taxonomy?acc=SRR10971381&cluster_name=public">SRR10971381.stat
$ jq -r '[.[]|.tax_table[]|.parent]as$par|[.[]|.tax_table[]|select(.tax_id as$x|$par|index($x)|not)]|sort_by(-.total_count)[]|((.total_count|tostring)+";"+.org)' SRR10971381.stat|head
171042;Prevotella salivae F0493
155602;Prevotella veroralis F0319
89078;Prevotella scopos JCM 17725
57886;Prevotella melaninogenica D18
54238;Severe acute respiratory syndrome coronavirus 2
51814;Leptotrichia sp. oral taxon 212
41272;Prevotella nanceiensis DSM 19126 = JCM 15639
33716;Homo sapiens
28813;Prevotella veroralis DSM 19559 = JCM 6290
23601;Leptotrichia hongkongensis

This displays viral leaf nodes along with the chain of ancestors of each node:

$ jq -r '.[]|.tax_table[]|[.org,.total_count,.tax_id,.parent]|@tsv' SRR10971381.stat|awk -F\\t '{out[$3]=$0;parent[$3]=$4;name[$3]=$1;nonleaf[$4]}END{for(i in out){j=parent[i];chain="";do{chain=(chain?chain":":"")name[j];j=parent[j]}while(j!="");print out[i]"\t"!(i in nonleaf)"\t"chain}}'|sort -nr|grep Virus|sort -t$'\t' -rnk2|awk -F\\t '$5{print$2 FS$1" - "$6}'|head -n20|column -ts$'\t'
54238  Severe acute respiratory syndrome coronavirus 2 - Severe acute respiratory syndrome-related coronavirus:Sarbecovirus:Betacoronavirus:Orthocoronavirinae:Coronaviridae:Cornidovirineae:Nidovirales:Riboviria:Viruses
693    CRESS virus sp. - CRESS viruses:unclassified ssDNA viruses:unclassified DNA viruses:unclassified viruses:Viruses
159    uncultured virus - environmental samples:Viruses
20     Parvovirus NIH-CQV - unclassified Parvovirinae:Parvovirinae:Parvoviridae:Viruses
15     Curvibacter phage P26059B - unclassified Autographivirinae:Autographivirinae:Podoviridae:Caudovirales:Viruses
13     Pseudomonas phage HU1 - unclassified Podoviridae:Podoviridae:Caudovirales:Viruses
12     Bat coronavirus RaTG13 - Severe acute respiratory syndrome-related coronavirus:Sarbecovirus:Betacoronavirus:Orthocoronavirinae:Coronaviridae:Cornidovirineae:Nidovirales:Riboviria:Viruses
9      Iodobacter phage PhiPLPE - Iodobacter virus PLPE:Iodovirus:Myoviridae:Caudovirales:Viruses
9      Bat SARS-like coronavirus - Severe acute respiratory syndrome-related coronavirus:Sarbecovirus:Betacoronavirus:Orthocoronavirinae:Coronaviridae:Cornidovirineae:Nidovirales:Riboviria:Viruses
8      Rhodoferax phage P26218 - unclassified Podoviridae:Podoviridae:Caudovirales:Viruses
8      Prokaryotic dsDNA virus sp. - unclassified dsDNA viruses:unclassified DNA viruses:unclassified viruses:Viruses
8      Pangolin coronavirus - unclassified Betacoronavirus:Betacoronavirus:Orthocoronavirinae:Coronaviridae:Cornidovirineae:Nidovirales:Riboviria:Viruses
5      uncultured cyanophage - environmental samples:unclassified bacterial viruses:unclassified viruses:Viruses
5      Rimavirus - Siphoviridae:Caudovirales:Viruses
4      Genomoviridae sp. - unclassified Genomoviridae:Genomoviridae:Viruses
3      Streptococcus phage Javan363 - unclassified Siphoviridae:Siphoviridae:Caudovirales:Viruses
3      Sewage-associated circular DNA virus-14 - unclassified viruses:Viruses
3      RDHV-like virus SF1 - unclassified viruses:Viruses
3      Paramecium bursaria Chlorella virus NW665.2 - unclassified Chlorovirus:Chlorovirus:Phycodnaviridae:Viruses
3      Microviridae sp. - unclassified Microviridae:Microviridae:Viruses

Find most common viruses in wastewater metagenomes

I tried looking at the STAT results of 1000 random SRA runs which matched the phrase "wastewater metagenome 2023[dp]". These were the most common leaf nodes in the STAT hits. SARS2 has by far the highest number of reads but it's probably because in many runs they were specifically targeting SARS2:

$ curl ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh|sh
[...]
$ esearch -db sra -query 'wastewater 2023[dp]'|efetch -format runinfo>waste.runinfo
$ mkdir waste;sed 1d waste.runinfo|gshuf -n1000|cut -d, -f1|parallel -j10 -q curl -m 30 -s 'https://trace.ncbi.nlm.nih.gov/Traces/sra-db-be/run_taxonomy?acc={}&cluster_name=public' -o waste/{}.stat
$ for f in waste/*;do jq -r '.[]|.tax_table[]|[.org,.total_count,.tax_id,.parent]|@tsv' $f|awk -F\\t '{out[$3]=$0;parent[$3]=$4;name[$3]=$1;nonleaf[$4]}END{for(i in out){j=parent[i];chain="";do{chain=(chain?chain":":"")name[j];j=parent[j]}while(j!="");print out[i]"\t"!(i in nonleaf)"\t"chain}}'|awk -F\\t '$5{print$2 FS$1" - "$6}';done>temp
$ grep Viruses temp|awk -F\\t '{a[$2]+=$1}END{for(i in a)print a[i]"\t"i}'|sort -rn|head -n20|column -ts$'\t'
139686660  Severe acute respiratory syndrome coronavirus 2 - Severe acute respiratory syndrome-related coronavirus:Sarbecovirus:Betacoronavirus:Orthocoronavirinae:Coronaviridae:Cornidovirineae:Nidovirales:Pisoniviricetes:Pisuviricota:Orthornavirae:Riboviria:Viruses
321282     Caudoviricetes sp. - unclassified Caudoviricetes:Caudoviricetes:Uroviricota:Heunggongvirae:Duplodnaviria:Viruses
131071     Sapovirus Hu/Chiba/010451S/2001 - Sapovirus isolates:Sapporo virus:Sapovirus:Caliciviridae:Picornavirales:Pisoniviricetes:Pisuviricota:Orthornavirae:Riboviria:Viruses
100819     EBPR siphovirus 1 - unclassified Caudoviricetes:Caudoviricetes:Uroviricota:Heunggongvirae:Duplodnaviria:Viruses
85839      Sapovirus GI.1 - Sapovirus GI:Sapporo virus:Sapovirus:Caliciviridae:Picornavirales:Pisoniviricetes:Pisuviricota:Orthornavirae:Riboviria:Viruses
54246      Bacteriophage sp. - unclassified bacterial viruses:Viruses
22293      Microviridae sp. - unclassified Microviridae:Microviridae:Petitvirales:Malgrandaviricetes:Phixviricota:Sangervirae:Monodnaviria:Viruses
20054      Sapovirus GI.2 - Sapovirus GI:Sapporo virus:Sapovirus:Caliciviridae:Picornavirales:Pisoniviricetes:Pisuviricota:Orthornavirae:Riboviria:Viruses
12342      Vibrio phage VvAW1 - unclassified Caudoviricetes:Caudoviricetes:Uroviricota:Heunggongvirae:Duplodnaviria:Viruses
9327       Gordonia phage GMA7 - Getseptimavirus GMA7:Getseptimavirus:Caudoviricetes:Uroviricota:Heunggongvirae:Duplodnaviria:Viruses
8646       Sapovirus Hu/S07026/2006/DNK - Sapovirus isolates:Sapporo virus:Sapovirus:Caliciviridae:Picornavirales:Pisoniviricetes:Pisuviricota:Orthornavirae:Riboviria:Viruses
7626       Mycobacterium phage Myrna - Myrnavirus myrna:Myrnavirus:Ceeclamvirinae:Caudoviricetes:Uroviricota:Heunggongvirae:Duplodnaviria:Viruses
5031       Microvirus sp. - unclassified Microviridae:Microviridae:Petitvirales:Malgrandaviricetes:Phixviricota:Sangervirae:Monodnaviria:Viruses
4684       Caulobacter phage CcrRogue - Poindextervirus rogue:Poindextervirus:Dolichocephalovirinae:Caudoviricetes:Uroviricota:Heunggongvirae:Duplodnaviria:Viruses
4269       Tombunodavirus UC1 - unclassified viruses:Viruses
4185       Aeromonas phage phiA005 - unclassified Caudoviricetes:Caudoviricetes:Uroviricota:Heunggongvirae:Duplodnaviria:Viruses
3827       Sphaerotilus phage vB_SnaP-R1 - unclassified Krylovirinae:Krylovirinae:Autographiviridae:Caudoviricetes:Uroviricota:Heunggongvirae:Duplodnaviria:Viruses
2891       uncultured human fecal virus - environmental samples:Viruses
2762       Brevundimonas phage vB_BsubS-Delta - unclassified Caudoviricetes:Caudoviricetes:Uroviricota:Heunggongvirae:Duplodnaviria:Viruses
2722       Murine coronavirus - Embecovirus:Betacoronavirus:Orthocoronavirinae:Coronaviridae:Cornidovirineae:Nidovirales:Pisoniviricetes:Pisuviricota:Orthornavirae:Riboviria:Viruses

Get consensus set of mutations in a variant

This downloads metadata for a global subsample of about 3000 SARS2 sequences from NextStrain, and it selects mutations which appear in over half of all sequences where the variant is listed as Alpha:

$ curl https://data.nextstrain.org/files/ncov/open/global/metadata.tsv.xz|gzip -dc>global.tsv
$ awk -F\\t '$34=="Alpha"' global.tsv|cut -f56|awk '{split($0,z,",");for(i in z)a[z[i]]++}END{for(i in a)if(a[i]>NR/2)print substr(i,2),i}'|sort -n|cut -d\  -f2
C241T
C913T
C3037T
C3267T
C5388A
C5986T
T6954C
C14408T
C14676T
C15279T
T16176C
A23063T
C23271A
A23403G
C23604A
C23709T
T24506G
G24914C
C27972T
G28048T
A28111G
G28280C
A28281T
T28282A
G28881A
G28882A
G28883C
C28977T

Find ORFs with seqkit

I used only ATG as a start codon but I'm not sure, because all CDS sequences of SARS2 start with ATG and Gemini said that RNA viruses generally use ATG as a start codon, but I'm not sure if I should've also included CTG or TTG as start codons.

This also includes overlapping ORFs where the ending position of a previous ORF is after the starting position of the next ORF:

$ curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN908947'>sars2.fa
$ seqkit locate -Prp 'ATG(...)*?(TAA|TAG|TGA)' sars2.fa|cut -f5-7|awk '{if(length($3)>50)$3=substr($3,1,50)"..."}1'|awk '{$2=$2" "(NR==1?"length":$2-$1+1)}1'|column -t
start  end    length  matched
107    136    30      ATGCTTAGTGCACTCACGCAGTATAATTAA
266    13483  13218   ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAG...
13526  13567  42      ATGTCGTATACAGGGCTTTTGACATCTACAATGATAAAGTAG
13625  13636  12      ATGACAATTTAA
13685  13759  75      ATGAAGAAACAATTTATAATTTACTTAAGGATTGTCCAGCTGTTGCTAAA...
13768  21555  7788    ATGGTACCACATATATCACGTCAACGTCTTACTAAATACACAATGGCAGA...
21536  25384  3849    ATGTTCTTGTTAACAACTAAACGAACAATGTTTGTTTTTCTTGTTTTATT...
25332  25448  117     ATGAAGACGACTCTGAGCCAGTGCTCAAAGGAGTCAAATTACATTACACA...
25393  26220  828     ATGGATTTGTTTATGAGAATCTTCACAATTGGAACTGTAACTTTGAAGCA...
26183  26281  99      ATGATGAACCGACGACGACTACTAGCGTGCCTTTGTAAGCACAAGCTGAT...
26245  26472  228     ATGTACTCATTCGTTTCGGAAGAGACAGGTACGTTAATAGTTAATAGCGT...
26523  27191  669     ATGGCAGATTCCAACGGTACTATTACCGTTGAAGAGCTTAAAAAGCTCCT...
27202  27387  186     ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAAT...
27359  27397  39      ATGAAGAGCAACCAATGGAGATTGATTAAACGAACATGA
27394  27759  366     ATGAAAATTATTCTTTTCTTGGCACTGATAACACTCGCTACTTGTGAGCT...
27756  27887  132     ATGATTGAACTTTCATTAATTGACTTCTATTTGTGCTTTTTAGCCTTTCT...
27868  27897  30      ATGAAACTTGTCACGCCTAAACGAACATGA
27894  28259  366     ATGAAATTTCTTGTTTTCTTAGGAATCATCACAACTGTAGCTGCATTTCA...
28274  29533  1260    ATGTCTGATAATGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTT...
29538  29570  33      ATGCAGACCACACAAGGCAGATGGGCTATATAA
29558  29674  117     ATGGGCTATATAAACGTTTTCGCTTTTCCGTTTACGATATATAGTCTACT...
29771  29806  36      ATGCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAA
29806  29844  39      ATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGA

Highlight positions in an alignment which differ from the consensus sequence

# efetch multi (download accessions from STDIN from GenBank)
emu()(curl -sd "id=$(paste -sd,)" "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=${1:-nuccore}&rettype=${2:-fasta}&retmode=$3")

# download envelope proteins of viruses with less than 10% distance to Wuhan-Hu-1 in the whole genome
awk -F\\t 'NR==1{for(i=2;i<=NF;i++)if($i~/Wuhan-Hu-1/)break;next}$i>90{print$1}' sarbe.pid|cut -d\  -f1|emu '' fasta_cds_aa|seqkit grep -nrp 'gene=E|protein=small envelo|protein=envelope'>env.aa

# awk consensus
acons()(awk -F '' '{for(i=1;i<=NF;i++)a[i][$i]++}END{for(i in a){max=0;for(j in a[i])if(a[i][j]>max){max=a[i][j];o=j}printf"%s",o};print""}')

# highlight positions that are different from the first row
cohl()(awk 'NR==1{split($0,a,"");next}{split($0,b,"");for(i=1;i<=length;i++)printf"%s",b[i]==a[i]?b[i]:"\033[31m"b[i]"\033[0m";print""}' "$@")

# highlight positions that are different from the consensus using input with one sequence per line
cohl2()(x=$(cat);acons<<<"$x"|cohl - <(printf %s\\n "$x"))

# highlight positions that are different from the consensus using FASTA input
cohl3()(x=$(seqkit fx2tab);paste -d\   <(cut -f1 <<<"$x") <(cut -f2<<<"$x"|cohl2))

seqkit fx2tab env.aa|sed $'s/lcl|//;s/_prot_[^\t]*//'|awk -F\\t 'NR==FNR{a[$1]=$2;next}{print">"a[$1]"\n"$2}' <(seqkit seq -n sarbe.fa|sed 's/,.*//;s/ \(ORF\|surface\).*//;s/ .* /\t/') -|mafft --quiet -|cohl3|column -t

Show amino acid alignment next to nucleotide alignment and highlight positions that are different from the consensus

# awk consensus
acons()(awk -F '' '{for(i=1;i<=NF;i++)a[i][$i]++}END{for(i in a){max=0;for(j in a[i])if(a[i][j]>max){max=a[i][j];o=j}printf"%s",o};print""}')

# highlight positions that are different from the first row
cohl()(awk 'NR==1{split($0,a,"");next}{split($0,b,"");for(i=1;i<=length;i++)printf"%s",b[i]==a[i]?b[i]:"\033[31m"b[i]"\033[0m";print""}' "$@")

# highlight positions that are different from the consensus using input with one sequence per line
cohl2()(x=$(cat);acons<<<"$x"|cohl - <(printf %s\\n "$x"))

brew install seqkit bowtie2
curl -sL sars2.net/f/sarbe.fa.xz|gzip -dc>sarbe.fa;seqkit seq -g sarbe.fa>sarbe.ungap
curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta_cds_na&id=NC_045512'>sars2.na
bow()([[ -e $1.1.bt2 ]]||bowtie2-build --threads 3 "$1" "$1";bowtie2 --no-unal -p3 -x "$@")

seqkit grep -nrp =S sars2.na|seqkit subseq -r $[438*3-2]:$[508*3]|bow sarbe.ungap -fU- -a --score-min L,-1,-1|grep -v ^@|ruby -ane'puts [$F[2],$F[3],$F[3].to_i+$F[5].scan(/\d+(?=[MD])/).map(&:to_i).sum-1]*" "'|while read l m n;do seqkit grep -p $l sarbe.ungap|seqkit subseq -r $m:$n;done|seqkit grep -nrvp 'MP789 genomic sequence'|sed -E $'s/(,| surface| ORF1| genomic sequence).*//;s/.* />/'>rbm.fa

seqkit translate rbm.fa|seqkit seq -s|sed 's/./ & /g'|cohl2|sed s/31m/35m/g|paste -d\\n - <(seqkit seq -s rbm.fa|cohl2)|paste -d\  - <(seqkit seq -n rbm.fa|sed 'x;p;x')

Make a heatmap for the percentage of nucleotide changes in different blocks of the genome

# matrix sort grep (in a TSV file with rownames and colnames, pick first column whose name
# matches substring and print rownames and the column sorted by the value of the column)
masog()(awk -F\\t 'NR==1{for(i=2;i<=NF;i++)if($i~x)break;next}{print$i,$1}' x="$1" "${2--}"|sort -n)

# make a TSV file for percentage identity between aligned sequences
curl -Ls sars2.net/f/pid.cpp>pid.cpp;g++ pid.cpp -O3 -o pid

(masog ZC45 sarbe.pid|awk '$1>88{print$2}'|awk 'NR==FNR{a[$0];next}$0 in a' <(awk -F\\t 'NR>1{for(i=2;i<NR;i++)if($i>98&&$1!~/BANAL|ZC45|ZXC21/)next;print$1}' sarbe.pid|cut -d\  -f1);seqkit grep -nrp 'GD/P79-9/2019|BANAL|Tor2|MT121216.1' sarbe.fa|seqkit seq -ni)|seqkit grep -f- sarbe.fa|mafft --quiet --thread 4 ->nuc.fa;./pid<nuc.fa>nuc.pid
# install.packages("BiocManager");BiocManager::install("Biostrings")

library(ggplot2)
library(colorspace)
library(Biostrings)

t=readDNAStringSet("nuc.fa")
names(t)=sub("^.*? ","",sub(",.*","",names(t)))

gene=read.csv(header=F,text="ORF1a,266,13468
ORF1b,13468,21555
S,21563,25384
ORF3a,25393,26220
E,26245,26472
M,26523,27191
ORF6,27202,27387
ORF7a,27394,27759
ORF8,27894,28259
N,28274,29533
ORF10,29558,29674")

spike=read.csv(header=F,text="NTD,13,312
RBD,319,528
RBM,438,508
S1,1,686
S2,709,1273")

gene=rbind(cbind(spike[1],(spike[2]-1)*3+gene[3,2],(spike[3]-1)*3+2+gene[3,2]),gene)

m=as.matrix(t)
ref=m[grepl("Wuhan-Hu-1",rownames(m)),]
pos=seq_along(ref)-cumsum(ref=="-")
gene[,2]=sapply(gene[,2],\(x)which(pos==x)[1])
gene[,3]=sapply(gene[,3],\(x)which(pos==x)[1])
ranges=c(gene[1,3]+.5,gene[-1,2]-.5,gene[1,2],gene[nrow(gene),ncol(gene)])

targname="BtSY2"
targ=grep(targname,names(t),value=T)

step=1e3

heat=100*apply(gene[-1],1,\(x)as.matrix(stringDist(subseq(t,x[1],x[2]),"hamming"))[targ,]/(x[2]-x[1]+1))

# # don't count positions where either sequence has a gap
# heat=100*apply(gene[-1],1,\(x){block=ifelse(m=="-",NA,m)[,x[1]:x[2]];1-colMeans(block[targ,]==t(block),na.rm=T)})

colnames(heat)=gene[,1]

dist=100-read.table("nuc.pid",sep="\t",row.names=1,header=T)

hc=as.hclust(reorder(as.dendrogram(hclust(as.dist(dist))),cmdscale(dist,1)))

rownames(heat)=sub("Severe acute respiratory syndrome","SARS",rownames(heat))
rownames(heat)=sub(" (ORF|surface).*","",rownames(heat))
rownames(heat)=sub(" isolate "," ",rownames(heat))

pheatmap::pheatmap(heat,filename="0.png",legend=F,
  clustering_callback=\(...)hc,cluster_cols=F,
  gaps_col=nrow(spike),
  cellwidth=17,cellheight=17,fontsize=9,treeheight_row=80,fontsize_number=8,treeheight_column=80,
  border_color=NA,na_col="white",
  display_numbers=round(heat),
  number_color=matrix(ifelse(heat>.8*max(heat),"white","black"),nrow(heat))[hc$order,],
  breaks=seq(0,max(heat),,256),
  colorRampPalette(hex(HSV(c(210,210,210,160,110,60,40,20,0,0,0),c(0,.25,rep(.5,9)),c(rep(1,9),.5,0))))(256)
)

system(paste0("convert -gravity north \\( -splice 0x26 -pointsize 46 label:'Target: ",targname," (percentage of nucleotide changes)' \\) 0.png -append 1.png"))

Make a line plot of cumulative mutations compared to a target sequence

curl -sL sars2.net/f/sarbe.fa.xz|gzip -dc>sarbe.fa
wget sars2.net/f/sarbe.pid
masog()(awk -F\\t 'NR==1{for(i=2;i<=NF;i++)if($i~x)break;next}{print$i,$1}' x="$1" "${2--}"|sort -n)
masog Hu-1 sarbe.pid|awk '$1>91{print$2}'|seqkit grep -f- sarbe.fa|mafft --thread 4 --quiet ->cum.fa
# install.packages("BiocManager");BiocManager::install("Biostrings")

library(ggplot2)

t=as.matrix(Biostrings::readAAStringSet("cum.fa"))

t[t=="n"]="-"

gene=read.csv(head=F,text="ORF1a,266,13468
ORF1b,13468,21555
S,21563,25384
ORF3a,25393,26220
E,26245,26472
M,26523,27191
ORF6,27202,27387
ORF7a,27394,27759
ORF8,27894,28259
N,28274,29533
ORF10,29558,29674")
geneh=c(1,1,1,1,2,3,4,1,2,3,4)

ref=t[grepl("Hu-1",rownames(t)),]
pos=1:length(ref)-cumsum(ref=="-")
gene[,2]=sapply(gene[,2],\(x)which(pos==x)[1])
gene[,3]=sapply(gene[,3],\(x)which(pos==x)[1])

t[,1:(gene[1,2]-1)]="N"
t[,gene[nrow(gene),3]:ncol(t)]="N"

genebars=c(gene[,2],gene[nrow(gene),3])

rownames(t)=sub("(,| ORF| orf| surface ).*","",rownames(t))
rownames(t)=sub("[^ ]* ","",rownames(t))
rownames(t)=sub(" isolate "," ",rownames(t))
rownames(t)=sub("Severe acute respiratory syndrome coronavirus 2 ","",rownames(t))

targ=rownames(t)[grep("Hu-1",rownames(t))]

m=apply(t(t[!rownames(t)==targ,])!=t[targ,],2,cumsum)

m=m[,order(-m[nrow(m),])]

xy=data.frame(x=1:nrow(m),y=c(m),z=rep(colnames(m),each=nrow(m)))
xy$z=factor(xy$z,unique(xy$z))

hue=c(0,210,100,60,300,160,25,250)+15
color=c("black",hcl(hue,120,60),"gray50",hcl(hue,120,30)[-c(6,7)],hcl(hue,40,80))

color=setNames(color[1:nrow(t)],rownames(t))[levels(xy$z)]

candidates=c(sapply(c(1,2,5),\(x)x*10^c(-10:10)))
xbreak=candidates[which.min(abs(candidates-max(xy$x)/10))]
ybreak=candidates[which.min(abs(candidates-(max(xy$y)-min(xy$y))/7))]
xend=max(xy$x)
ystart=ybreak*floor(min(xy$y,na.rm=T)/ybreak)
yend=ybreak*ceiling(max(xy$y,na.rm=T)/ybreak)

legend=data.frame(x=.02*xend,y=seq(yend*.9,,-yend/20,nlevels(xy$z)),label=levels(xy$z))

label=data.frame(label=gene[,1],x=rowMeans(gene[,-1]),y=yend-yend*(geneh-.2)/22)

ggplot(xy)+
geom_vline(xintercept=genebars,color="gray80",linewidth=.25)+
geom_vline(xintercept=c(0,xend),color="gray80",linewidth=.25)+
geom_hline(yintercept=c(ystart,yend),color="gray80",linewidth=.25)+
geom_line(aes(x,y,color=z),linewidth=.25)+
geom_label(data=legend,aes(x,y,label=label),color=color,size=2.2,hjust=0,fill=alpha("white",.85),label.r=unit(0,"lines"),label.padding=unit(.04,"lines"),label.size=0)+
geom_label(data=label,aes(x=x,y=y,label=label),fill=alpha("white",.85),label.r=unit(0,"lines"),label.padding=unit(.04,"lines"),label.size=0,size=2,hjust=.5)+
ggtitle(paste0("Cumulative mutations from ",targ," (UTRs masked)"))+
coord_cartesian(clip="off",expand=F)+
scale_x_continuous(limits=c(0,xend),breaks=seq(0,xend,xbreak),labels=\(x)ifelse(x>1e3,paste0(x/1e3,"k"),x),position="top")+
scale_y_continuous(limits=c(ystart,yend),breaks=seq(ystart,yend,ybreak),labels=\(x)ifelse(x>=1e3,paste0(x/1e3,"k"),x))+
scale_color_manual(values=color)+
theme(axis.text=element_text(size=6,color="black"),
  axis.ticks=element_line(linewidth=.25,color="gray80"),
  axis.ticks.length=unit(.2,"lines"),
  axis.title=element_blank(),
  legend.position="none",
  panel.grid=element_blank(),
  panel.background=element_rect(fill="white"),
  plot.margin=margin(.5,.6,.5,.6,"lines"),
  plot.title=element_text(size=8,color="black"))
ggsave("1.png",width=5,height=2.9,dpi=450)

If you plot the cumulative number of mutations between two sarbecoviruses, usually there is an increase in the slope of mutations at the start of the spike protein (because the NTD is more prone to mutations than the end of ORF1b).

But if SARS2 is compared to DB159, BtSY2, RaTG13, or BANAL-52, there is no increase in the slope, which suggests that there was a recombination event. ZC45, DB59, and BtSY2 are recombinants of SARS2-like and SARS1-like viruses, where the ORF1b is more similar to SARS1-like viruses. But then it's weird that there seems to be another recombination event in the S1, and the slope of cumulative mutations doesn't even change between the ORF1b and S1 (except in ZC45).

Make a SNAP plot of cumulative dN and dS for concatenated coding sequences

# download sarbecovirus sequences in order to look up accessions by searching for the name
curl -sL sars2.net/f/sarbe.fa.xz|xz -dc>sarbe.fa

# efetch multi
emu()(curl -sd "id=$(paste -sd,)" "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=${1:-nuccore}&rettype=${2:-fasta}&retmode=$3")

# install SNAP
mkdir -p ~/bin;curl https://www.hiv.lanl.gov/repository/aids-db/PROGS/Snap/SNAP.pl>~/bin/snap;chmod +x ~/bin/snap

# SNAP requires whitespace-delimited input with ID in first column and sequence in second column
sna()(snap <(sed 's/ .*//' "$@"|seqkit fx2tab))

# convert the results of a pairwise SNAP run to a simple CSV format
sna3()(sna "$@";(echo position,synonymous,nonsynonymous,indel;awk '/^[0-9]/{print$1,$7?$7:0,$8?$8:0,$2=="indel"}' OFS=, $(ls -t codons.*|head -n1))>snap.csv)

# codon align (align by translated amino acid sequence and insert gap triplets to nucleotide sequence)
coda()(x=$(cat);seqkit seq -g<<<"$x"|seqkit translate|mafft --quiet --thread 4 -|seqkit seq -s|awk -F '' 'NR==FNR{gaps=0;for(i=1;i<=NF;i++)if($i=="-")a[NR][i-(gaps++)]++;next}{for(i=1;i<=NF;i+=3){i2=(i-1)/3+1;for(j=1;j<=a[FNR][i2];j++)printf"---";printf"%s",substr($0,i,3)};print""}' - <(seqkit seq -gs<<<"$x")|paste <(seqkit seq -n<<<"$x") -|seqkit tab2fx)

# seqkit grep (`-r` enables regex, `-n` searches in whole name and not only first field, `-p` specifies pattern)
seg()(seqkit grep -inrp "$@")
segn()(seqkit grep -inrvp "$@")

# fasta_cds_na is nucleotide coding sequences
seqkit grep -nrp Rs3367,RsSHC014 sarbe.fa|seqkit seq -ni|emu '' fasta_cds_na>temp

# see if example both ORF1ab and ORF1a are included so ORF1a has to be excluded
seqkit seq -n temp

# remove ORF1a, do codon alignment of concatenated CDS, and run SNAP
seqkit grep -nrvp '\b1a\b' temp|awk -F '_cds_' '!/>/||!a[$1]++'|coda|sna3
library(ggplot2);library(colorspace)

# # Wuhan-Hu-1
# gene=read.csv(text="name,start,end
# 1a,266,13483
# 1b,13468,21555
# S,21563,25384
# 3a,25393,26220
# E,26245,26472
# M,26523,27191
# 6,27202,27387
# 7a,27394,27759
# 7b,27756,27887
# 8,27894,28259
# N,28274,29533
# 10,29558,29674")
# gene$height=c(1,1,1,1,2,3,1,2,3,1,2,3)

# # Tor2
# gene=read.csv(text="name,start,end
# 1a,266,13483
# 1b,13468,21555
# S,21563,25384
# 3a,25393,26220
# E,26245,26472
# M,26523,27191
# 6,27202,27387
# 7a,27394,27759
# 7b,27756,27887
# 8,27894,28259
# N,28274,29533
# 10,29558,29674")
# gene$height=c(1,1,1,1,2,3,4,1,2,3,4,1)

# SHC014
gene=read.csv(text="name,start,end
1a,265,13413
1b,13398,21485
S,21492,25262
3a,25271,26095
3b,25692,26036
E,26120,26350
M,26401,27066
6,27077,27268
7a,27276,27644
7b,27641,27775
8,27782,28147
N,28162,29430")
gene$height=c(1,1,1,1,2,3,4,1,2,3,4,1)

gene$length=gene$end-gene$start+1
start=cumsum(c(0,gene$length/3))+1
gene$start=head(start,-1)
gene$end=start[-1]-1

t=read.csv("snap.csv")
xy=data.frame(x=t[,1],y=c(apply(t[-1],2,cumsum)),z=rep(colnames(t)[-1],each=nrow(t)))
xy$z=factor(xy$z,levels=unique(xy$z))

xstart=1;xend=nrow(t);ystart=0
cand=c(sapply(c(1,2,5),\(x)x*10^c(0:5)))
xstep=cand[which.min(abs(cand-max(xy$x)/8))]
ystep=cand[which.min(abs(cand-max(xy$y)/6))]
yend=ystep*ceiling(max(xy$y)*1.2/ystep)

ggplot(xy)+
geom_vline(color="gray70",xintercept=c(xstart,xend),linewidth=.2,lineend="square")+
geom_vline(color="gray70",xintercept=gene$start,linewidth=.2,lineend="square")+
geom_hline(color="gray70",yintercept=c(ystart,yend),linewidth=.2,lineend="square")+
geom_line(aes(x,y,color=z),linewidth=.4)+
geom_label(data=gene,hjust=.5,vjust=.5,aes(x=rowMeans(cbind(start,end)),label=name,y=yend-yend*(height-.5)*.06),size=2,fill=alpha("white",.85),label.r=unit(0,"lines"),label.padding=unit(.04,"lines"),label.size=0)+
labs(title="Concatenated CDS: Rs3367 vs RsSHC014",x=NULL,y=NULL)+
coord_cartesian(clip="off",expand=F)+
scale_x_continuous(limits=c(1,xend),breaks=seq(1,xend,xstep),position="top")+
scale_y_continuous(limits=c(0,yend),breaks=seq(0,yend,ystep))+
scale_color_manual(values=c("#00cc00","red","blue"))+
guides(colour=guide_legend(override.aes=list(linewidth=.4)))+
theme(axis.text=element_text(size=5.8,color="black"),
  axis.ticks.length=unit(.15,"lines"),
  axis.ticks=element_line(linewidth=.2,color="gray70"),
  legend.background=element_blank(),
  legend.key=element_rect(fill=alpha("white",0)),
  legend.spacing.x=unit(.15,"lines"),
  legend.key.size=unit(.8,"lines"),
  legend.position="bottom",
  legend.justification=c(1,0),
  legend.margin=margin(-.7,0,-.2,0,"lines"),
  legend.text=element_text(size=6.5,vjust=.5),
  legend.title=element_blank(),
  panel.background=element_rect(fill="white"),
  panel.grid=element_blank(),
  plot.margin=margin(.3,.4,.3,.3,"lines"),
  plot.title=element_text(size=7,color="black",margin=margin(.1,0,.5,0,"lines")))
ggsave("1.png",width=3.3,height=2.3,dpi=450)