This HTML file addresses misconceptions about the sequencing and assembly of Wuhan-Hu-1, which is the common reference genome of SARS-CoV-2.
In the paper by Wu et al. where Wuhan-Hu-1 was described, the authors wrote that when they did metagenomic sequencing for a sample of lung fluid from a COVID patient, they got a total of 384,096 contigs when they did de-novo assembly for the reads using MEGAHIT, and they used the longest contig as the initial reference genome of SARS2. Some no-virus people thought it meant that the other contigs were alternative candidates for the genome of SARS2, but actually the other contigs mostly consisted of fragments of the human genome or bacterial genomes. In Supplementary Table 2 which shows the best match on BLAST for the 50 most abundant contigs, the best match is bacterial for all other contigs except the contig for SARS2.
Wu et al.'s longest MEGAHIT contig happened to be the contig for SARS2 by chance, because there were no bacterial contigs longer than 30,000 bases even though one bacterial contig listed in Supplementary Table 1 was longer than 10,000 bases. When I ran MEGAHIT for Wu et al.'s raw reads, my second-longest contig was a 16,036-base contig for Leptotrichia hongkongensis and my third-longest contig was a 13,656-base contig for Veillonella parvula. However in other cases MEGAHIT will of course produce longer contigs for bacteria, and for example when I used MEGAHIT to assemble metagenomic reads of the bat sarbecovirus RmYN02, my longest contig was a contig for E. coli which was about 50,000 bases long, and the contig for RmYN02 was only the 4th-longest contig.
When I ran MEGAHIT Wu et al.'s raw reads, I got a total of about 30,000 contigs, but when I aligned my contigs against a set of about 15,000 virus reference sequences, I got only 8 contigs which matched viruses. I got only one contig which matched SARS2. And I got 4 contigs which matched the human endogenous retrovirus K113, but the contigs probably just came from the human genome since the sequences of HERVs are incorporated into the human genome. And I got one short contig which matched a Streptococcus phage and two short contigs which matched a parvovirus, both of which also had a handful of reads in the STAT results so they probably represent genuine matches to viruses. So among approximately 30,000 contigs, there may have been only 4 contigs which came from viruses. The reason why my number of contigs was an order of magnitude lower than the number reported by Wu et al. is probably because almost all human reads were masked with N bases in the reads uploaded to SRA, so my contigs included only a small number of contigs for fragments of the human genome.
About half of Wu et al.'s reads consist of only N bases, but it's probably human reads were masked because of NCBI's privacy policy. The NCBI has even published a command-line utility called Human Scrubber, which is used to replace human reads with N bases in raw reads that are submitted to the Sequence Read Archive. Wu et al. wrote that about 24 million reads out of about 57 million reads remained after they filtered out human reads, which roughly matches the reads at the SRA where there's about 27 million reads that consist of only N bases out of a total of about 57 million reads.
The first version of Wuhan-Hu-1 submitted to GenBank was 30,474 bases long but the current version is only 29,903 bases long. That's because the first version accidentally included a 618-base segment of human DNA at the 3' end, where the first 20 bases of the segment are part of both the actual genome of SARS2 and the human genome. The error was soon fixed in the second version of Wuhan-Hu-1 which was published at GenBank two days after the first version, and the error was even mentioned by Eddie Holmes on Twitter. I have found similar assembly errors in other sarbecovirus sequences, like RfGB02, Sin3408, Rs7907, Rs7931, GX-P1E, and Ra7909. So the reason why the original 30,474-base contig cannot be reproduced from the reads at the SRA is because almost all human reads were masked with N bases at SRA.
When the anonymous mathematician from Hamburg used BBMap to align Wu
et al.'s raw reads against the reference genomes of SARS2 and other
viruses, he got a large number of reads to align against the reference
genomes of HIV-1, measles, and hepatitis delta, but that's because he
ran BBMap with the parameter minratio=0. which used
extremely loose alignment criteria. Among the reads which aligned
against HIV-1, the average error rate was about 40% and the most common
read length 37 bases, but almost all of the reads would've remained
unaligned if he would've ran BBMap with the default parameters which
tolerate a much lower number of errors. The error rate was not mentioned
in his paper but I was able to calculate it by running his code myself.
However for the reads which aligned against SARS2, the most common
length was 151 bases and the average error rate of the 151-base reads
was only about 2%.
In the paper by the mathematician from Hamburg, there's plots for each virus reference sequence which show how many reads of each length aligned against the reference. The plot for SARS2 has a bimodal distribution with one peak around length 151 and another peak around length 35-40, but the plots for other viruses only have a single peak around length 35-40, which consists of reads which did not actually match the reference genome and which would have remained unaligned if he ran BBMap with the default parameters. But in the plot for SARS2, the y-axis is cut off in a deceptive way so you cannot see that the peak for reads around length 151 is much higher than the peaks around length 35-40 in the plots for other viruses.
Wu et al. wrote that the longest contig they got with Trinity was
only 11,760 bases long, but that's because Trinity split the genome of
SARS2 into multiple shorter contigs, which probably either overlapped or
had only small gaps in between. In order to merge the contigs, Wu et
al. could've aligned all contigs against a related virus like ZC45 or
SARS1, and they could've selected the most common base at each spot of
the alignment. And if any gaps remained, Wu et al. could've filled in
the gaps by using segments on either end of the gap as PCR primers and
by sequencing the PCR amplicons. By default Trinity and MEGAHIT both
require a region of a genome to be covered by at least two contigs
before the contigs get merged, so Wu et al. may have also gotten Trinity
to merge the shorter contigs into a single contig if they would've added
the option -- which reduces the minimum coverage
requirement from 2 to 1.
When I ran Trinity on Wu et al.'s raw reads without trimming, my longest contig was 29,879 bases long, and it was otherwise identical to the third version of Wuhan-Hu-1 except it included a 5-base insertion at the 5' end, it had one nucleotide change in the middle, and it was missing the last 29 bases of the poly(A) tail. When I ran Trinity again after trimming the reads, my longest contig was 31,241 bases long because there was a 1,379-base insertion in the middle which consisted of two segments that were identical to two different parts of Wuhan-Hu-1. The reason why Wu et al. failed to get a single complete contig with Trinity may have been because they used a different method to trim the reads, or because they used a different version of Trinity, or because Trinity was confused by the human reads which had been masked in my Trinity run.
In the current 29,903-base version of Wuhan-Hu-1, the poly(A) tail is 33 bases long, but the full poly(A) tail cannot be assembled from the metagenomic raw reads because the full poly(A) tail is not included in the raw reads. Wu et al. wrote that they sequenced the ends of the genome with RACE (rapid amplification of cDNA ends), which is a method similar to PCR which only requires specifying a single primer instead of two primers, so it can be used to amplify the end of a genome by specifying a single segment near the end as the primer. In October 2020 when the sequence of RaTG13 was updated at GenBank so that a new 15-base segment was added to the 5' end, some people from DRASTIC were suspecting fraud because the 15-base segment was not included in the metagenomic raw reads for RaTG13, but the segment was however included in the RACE amplicon sequences for RaTG13 which were only published later in 2021. But I believe the RACE amplicon sequences of Wuhan-Hu-1 have never been published.
When Wu et al. wrote that they determined the viral genome organization of WHCV by aligning it against SARS-CoV Tor2 and bat SL-CoVZC45, they meant that after they had already assembled the complete genome of SARS2, they aligned it against SARS1 and ZC45 in order to make annotating the proteins easier. So if for example they got one open reading frame that started around position 21,500 and ended around position 25,000, they knew that it was the spike protein if it covered roughly the same region as the spike protein of SARS1. But they did not mean that they aligned the raw reads of SARS2 against a SARS1 genome in order to do reference-based assembly.
USMortality did an experiment where he downloaded the reference genomes of various viruses, he generated simulated short reads for the genomes, and he tried to assemble the reads back together with MEGAHIT, but he failed to get complete contigs for a couple of viruses, like HKU1, HIV-1, HIV-2, and porcine adenovirus. But that's because the reference genome of HKU1 contains a region where the same 30-base segment is repeated 14 times in a row. And in HIV-1 and HIV-2 there's a long terminal repeat where a long segment at the 5' end of the genome is repeated at the 3' end of the genome. And porcine adenovirus contains a tandem repeat where the same 724-base segment is repeated twice in a row. Also when USMortality did another experiment where he mixed together simulated reads from multiple different viruses and he ran MEGAHIT on the mixed reads, he failed to get complete contigs for SARS2 and SARS1, but that's because the contigs were split at a spot where there's a 74-base segment that is identical in the reference genomes of SARS2 and SARS1, and when I repeated his experiment, I was able to get complete contigs for SARS2 and SARS1 when I simply increased the maximum k-value of MEGAHIT from 141 to 161.
One reason why USMortality says that the genome has not been validated is that in all of Wu et al.'s reads which align against the very start of the 5' end of Wuhan-Hu-1, there are 3-5 extra bases at the start of the read that usually end with one or more G bases. However the G bases are part of the template-switching oligo which anneals to complementary C bases that are added to the 5' end of the RNA strand by reverse transcriptase. USMortality also got two extra G bases at the start of one of his MEGAHIT contigs, but I was able to get rid of them by trimming 5 bases from the ends of reads.
The standard reference genome of SARS2 is known as Wuhan-Hu-1, and it
was described in a paper by Wu et al. titled "A new
coronavirus associated with human respiratory disease in China".
[https://
Apart from a single nucleotide change in the N gene, the only differences between the three versions are at the beginning of the genome in the 5' UTR and at the end of the genome in the 3' UTR:
$brew install mafft [...] $ curl -s ' https:// eutils. ncbi. nlm. nih. gov/ entrez/ eutils/ efetch. fcgi? db=nuccore& rettype=fasta& id=MN908947. '{ 1, 2, 3} > wuhu. fa $ mafft -- globalpair -- clustalout -- thread 4 wuhu. fa [...] CLUSTAL format alignment by MAFFT G- INS- 1 (v7. 490) MN908947. 1 c g g t g a c g c a t a c a a a a c a t t c c c a c c a t a c c t t c c c a g g t a a c a a a c c a a c c a a c t t t c MN908947. 2 c g g t g a c g c a t a c a a a a c a t t c c c a c c a t a c c t t c c c a g g t a a c a a a c c a a c c a a c t t t c MN908947. 3 ----------- attaaaggtt----- t a t a c c t t c c c a g g t a a c a a a c c a a c c a a c t t t c *. *** .** .********************************* [... 29, 160 bases omitted] MN908947. 1 t g a c c t a c a c a g c t g c c a t c a a a t t g g a t g a c a a a g a t c c a a a t t t c a a a g a t c a