The abstract of a Brazilian paper from 2021 said: [https://
Human sewage from Florianopolis (Santa Catarina, Brazil) was analyzed for severe acute respiratory syndrome coronavirus-2 (SARS-CoV2) from October 2019 until March 2020. Twenty five ml of sewage samples were clarified and viruses concentrated using a glycine buffer method coupled with polyethylene glycol precipitation, and viral RNA extracted using a commercial kit. SARS-CoV-2 RNA was detected by RT-qPCR using oligonucleotides targeting N1, S and two RdRp regions. The results of all positive samples were further confirmed by a different RT-qPCR system in an independent laboratory. S and RdRp amplicons were sequenced to confirm identity with SARS-CoV-2. Genome sequencing was performed using two strategies; a sequence-independent single-primer amplification (SISPA) approach, and by direct metagenomics using Illumina's NGS. SARS-CoV-2 RNA was detected on 27th November 2019 (5.49 ± 0.02 log10 SARS-CoV-2 genome copies (GC) L−1), detection being confirmed by an independent laboratory and genome sequencing analysis. The samples in the subsequent three events were positive by all RT-qPCR assays; these positive results were also confirmed by an independent laboratory. The average load was 5.83 ± 0.12 log10 SARS-CoV-2 GC L−1, ranging from 5.49 ± 0.02 log10 GC L−1 (27th November 2019) to 6.68 ± 0.02 log10 GC L−1 (4th March 2020). Our findings demonstrate that SARS-CoV-2 was likely circulating undetected in the community in Brazil since November 2019, earlier than the first reported case in the Americas (21st January 2020).
Guy Gadboit downloaded the raw reads from the paper and wrote a
Twitter thread about them. [https://
Gadboit said that he only got 7% coverage for SARS2 even though the paper said that they got 25% coverage. However I got about 15% coverage:
$enad()( printf %s\\ n "${@-$( cat)} "| while IFS= read x; do curl -s " https:// www. ebi. ac. uk/ ena/ portal/ api/ filereport? accession=$ x& result=read_ run& fields=fastq_ ftp"| sed 1d| cut -f2| tr \; \\ n; done| sed s,^, ftp://,| xargs wget -q) $ x=SRR13152144; enad $x $ fastp -35 -i $x\_ 1. fastq. gz -I $x\_ 2. fastq. gz -o $x\_ 1. fq. gz -O $x\_ 2. fq. gz $ minimap2 -a -- sam- hit- only sars2. fa $x\_[ 12]. fq. gz| samtools sort ->$ x. bam $ samtools coverage $x. bam| column -t # rname startpos endpos numreads covbases coverage meandepth meanbaseq meanmapq MN908947. 3 1 29903 92 4348 14. 5403 0. 293649 36. 7 43. 5
Gadboit said that he didn't see any other lineage-defining locations except D614G (A23403G) and C8782. However I actually didn't get any read that covered the position of D614G regardless of whether I trimmed reads or not. I got two reads for 8782 and two reads for 28144, but they both had the lineage B allele. But I didn't get any reads that covered the position of the third mutation of WA1 (18060) or the A0 mutation (29095):
$samtools mpileup -f sars2. fa SRR13152144. bam| ruby -alne' next unless[ 8782, 18060, 28144, 29095, 23403]. include?($ F[ 1]. to_ i); next if$ F[ 3] ==0||$ F[ 4] =="* "; s=$ F[ 4]. upcase. gsub(/[.,]/,$ F[ 2]). gsub(/\^./, " "); s2=s. dup; s. enum_ for(: scan,/[+-]\ d+/). each{ m=Regexp. last_ match; m. begin( 0). upto( m. end( 0)+ s[ m. begin( 0)+ 1, m. end( 0)]. to_ i){| i| s2[ i] =" "}}; puts $F[ 1]+ " "+$ F[ 2]+ " "+ " ACGT". chars. map{| x| q=$ F[ 5]. chars. values_ at(* s2. chars. each_ index. select{| i| s2[ i] ==x}). map{| c| c. ord- 33}; q. size. to_ s+( q. size==0? " ": "[ "+ "%. 0f"%( q. sum* 1. 0/ q. size)+ "] ")}* " " '| sed s/^/$ x\ /| sort -n|( echo pos ref A C G T; cat)| column -t pos ref A C G T 8782 C 0 2[ 34] 0 0 28144 T 0 0 0 2[ 38]
I got 4 different mutations with MAPQ over 10. The MAPQ of each mutation was around 50, because there were two high-quality reads that supported each mutation:
$x=SRR13152144; samtools index $x. bam; bcftools mpileup -f sars2. fa $x. bam| bcftools call -mv| grep -v ^\#\#| cut -f- 6| column -t # CHROM POS ID REF ALT QUAL MN908947. 3 3392 . G C 3. 22451 MN908947. 3 5706 . A G 48. 4146 MN908947. 3 11437 . TAA TA 5. 7555 MN908947. 3 14408 . C T 48. 4146 MN908947. 3 14422 . C T 46. 4146 MN908947. 3 14895 . T G 47. 4146
I don't understand why G3392C gets a MAPQ of only 3.2, even though
it's supported by two reads with high quality like the other mutations.
The quality of both bases is 39 (H,
which is 72 minus 33):
$samtools mpileup -f sars2. fa $x. bam| awk '$ 2~/^( 3392| 5706| 14408| 14422| 14895)$/ '| cut -f2-| column -t 3392 G 2 Cc HH 5706 A 2 Gg HH 14408 C 2 Tt HH 14422 C 2 Tt HF 14895 T 2 Gg HG
In a nearly complete set of GISAID submissions with a collection date in 2020, I didn't find any sample with all 4 mutations or even 3 out of 4 mutations, but the two earliest samples with 2 out of the 4 mutations were both from Santa Catarina in Brazil (which was the same state where the wastewater samples were collected):
$curl -Ls sars2. net/ f/ gisaid2020. tsv. xz| xz -dc> gisaid2020. tsv $ printf %s\\ n A5706G C14408T C14422T T14895G| awk -F\\ t ' NR==FNR{ a[$ 0]; next}{ n=0; split($ 12, z, ", "); for( i in z) if( z[ i] in a) n++; if( n> 1) print n"\ t"$ 0} ' - gisaid2020. tsv| sort -k5| head| cut -f1- 8, 11- 14| tr \\ t \| 2| EPI_ ISL_ 541370| hCoV- 19/ Brazil/ SC- FIOCRUZ- 770/ 2020| 2020- 09- 22| 2020- 03- 12| B. 1. 1. 33| Brazil| Santa Catarina| Human| 10| C241T, C3037T, A5706G, C14408T, A23403G, T27299C, G28881A, G28882A, G28883C, T29148C|| 2| EPI_ ISL_ 476278| hCoV- 19/ Brazil/ SC- L16- CD314/ 2020| 2020- 06- 25| 2020- 03- 24| B. 1. 1. 33| Brazil| Santa Catarina| Human| 11| C241T, C3037T, A5706G, C14408T, G18803T, A23403G, T27299C, G28881A, G28882A, G28883C, T29148C|| 2| EPI_ ISL_ 470601| hCoV- 19/ Brazil/ PA- 0227/ 2020| 2020- 06- 18| 2020- 04- 01| B. 1. 1. 33| Brazil| Para| Human| 11| C241T, C3037T, A5706G, C14408T, A23403G, T27299C, A28119T, G28881A, G28882A, G28883C, T29148C|| 2| EPI_ ISL_ 2959070| hCoV- 19/ USA/ MI- MDHHS- SC26889/ 2020| 2021- 07- 15| 2020- 04- 02| B. 1| USA| Michigan| Human| 9| C180T, C241T, C1059T, C3037T, C14408T, C14422T, A23403G, G25563T, G28314A|| 2| EPI_ ISL_ 447105| hCoV- 19/ USA/ MI- MDHHS- SC20308/ 2020| 2020- 05- 16| 2020- 04- 02| B. 1| USA| Michigan| Human| 9| C180T, C241T, C1059T, C3037T, C14408T, C14422T, A23403G, G25563T, G28314A|| 2| EPI_ ISL_ 525869| hCoV- 19/ USA/ OR- OHSU- 0648/ 2020| 2020- 09- 01| 2020- 06- 16| B. 1. 134| USA| Oregon| Human| 9| C241T, C3037T, A5706G, C8818T, A12579G, C14408T, C15952T, G20980A, A23403G|| 2| EPI_ ISL_ 525840|