Consensus sequences for each Pango lineage
This repository contains semi-automatically generated prototype sequences for each Pango lineage. These sequences are not real sequences in databases but are algorithmically constructed consensus sequences that try to represent the common ancestor sequence of that lineage. They are based on the sequences designated in the cov-lineages/pango-designation repository. There is some manual curation involved to overwrite erroneous sites - errors happen when a lineage's designated sequences have dropout or reversions. The algorithm used to create these sequences has a high threshold to allow reversions, almost all sequences need to be reverted, otherwise it's assumed the reversions are an artefact.
Please be aware that due to the semi-automatic generation of these synthetic sequences, they can contain errors.
The data in this repo is automatically updated every day.
For lineages that are derived from BA.2*, BA.4* and BA.5*, there has been significant curation/overwriting, but for previous lineages the amount of curation is very limited. So do expect errors.
If you find errors, please open an issue here. The same is true if you have ideas what other data about lineages you would like to be included here.
The sequences contained here are the ones used in Nextclade reference trees and produced by code contained in the nextclade_data_workflows/sars-cov-2 repository.
The repository contains:
data/
folder of this repository. You can easily decompress them after downloading with zstd -d data/pango-consensus-sequences_nuc.fasta.zst
. You can pick a single sequence of interest using seqkit grep -r -p "B.1.1.7" pango_lineages.fasta
.:
data/pango-consensus-sequences_nuc.fasta.zst
data/pango-consensus-sequences_S.fasta.zst
etc.BQ.1
and dicts with the following values per sequence:
lineage
: Same as key, so that one can treat dict as list and not need to use the keyunaliased
: Unaliased pango lineage name, e.g. B.1.1.529.5.3.1.1.1.1
parent
: The direct parent lineage, e.g. BE.1.1.1
(or the first parent for which there is a consensus sequence if the direct parent doesn't have one)children
: Array of child lineages that are present in the dataset, e.g. ["BQ.1.1", "BQ.1.2"]
nextstrainClade
: The Nextstrain clade this lineage belongs to, e.g. 22E
nucSubstitutions
: Array of nucleotide substitution strings, e.g. ["C241T","C8782T"]
aaSubstitutions
: Array of amino acid substitution strings, e.g. ["S:L452R", ORF1b:P1427"]
nucDeletions
: Array of nucleotide deletion strings, e.g. ["21983-21991"]
aaDeletions
: Array of amino acid deletion strings, e.g. ["S:Y144-"]
nucSubstitutionsNew
: Like nucSubstitutions
, but only includes substitutions that are new with respect to the parent lineage (can be considered as defining mutations of this lineage). Same format as nucSubstitutions
.aaSubstitutionsNew
: Like nucSubstitutionsNew
but for amino acid substitutions. Same format as aaSubstitutions
.nucDeletionsNew
: Like nucSubstitutions
, but for nucleotide deletions. Same format as nucDeletions
.aaDeletionsNew
: Like nucDelertionsNew
, but for amino acid deletions. Same format as aaDeletions
.nucSubstitutionsReverted
: Like nucSubstitutions
, but only includes substitutions that are present in the parent but not in this lineageaaSubstitutionsReverted
: Like nucSubstitutionsReverted
, but for amino acid substitutionsnucDeletionsReverted
: Like nucSubstitutionsReverted
, but for nucleotide deletions. Same format as nucDeletions
.aaDeletionsReverted
: Like nucDelertionsReverted
, but for amino acid deletions. Same format as aaDeletions
.frameShifts
: Array of frame shift strings, e.g. ["S:142-143"]
designationDate
: Designation date of this lineage, e.g. 2022-12-01
or null
if not available486F->S->L
this gives rise to both a reversion and a new substitution. This might be changed in the future.If you need to be sure that a sequence is correct, e.g. when you're creating a Spike protein for an experiment, please double check using the pango designation issue (if such an issue exists) and the annotation on the Usher tree - which is independently curated by @AngieHinrichs. Also, see https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/utils/otto/sarscov2phylo/pango.clade-mutations.tsv for the paths extracted from the Usher tree for each lineage.