Cython bindings and Python interface to FAMSA, an algorithm for ultra-scale multiple sequence alignments.
GPL-3.0 License
Cython bindings and Python interface to FAMSA, an algorithm for ultra-scale multiple sequence alignments.
⚠️ This package is based on FAMSA 2.
FAMSA is a method published in 2016 by Deorowicz et al.[1] for large-scale multiple sequence alignments. It uses state-of-the-art time and memory optimizations as well as a fast guide tree heuristic to reach very high performance and accuracy.
PyFAMSA is a Python module that provides bindings to FAMSA using Cython. It implements a user-friendly, Pythonic interface to align protein sequences using different parameters and access results directly. It interacts with the FAMSA library interface, which has the following advantages:
Aligner
.scoring-matrices
libraryPyFAMSA can be installed directly from PyPI, which hosts some pre-built wheels for the x86-64 architecture (Linux/OSX) and the Aarch64 architecture (Linux only), as well as the code required to compile from source with Cython:
$ pip install pyfamsa
Otherwise, PyFAMSA is also available as a Bioconda package:
$ conda install -c bioconda pyfamsa
Otherwise, have a look at the Installation page of the online documentation
Let's create some sequences in memory, align them using the UPGMA method, (without any heuristic), and simply print the alignment on screen:
from pyfamsa import Aligner, Sequence
sequences = [
Sequence(b"Sp8", b"GLGKVIVYGIVLGTKSDQFSNWVVWLFPWNGLQIHMMGII"),
Sequence(b"Sp10", b"DPAVLFVIMLGTITKFSSEWFFAWLGLEINMMVII"),
Sequence(b"Sp26", b"AAAAAAAAALLTYLGLFLGTDYENFAAAAANAWLGLEINMMAQI"),
Sequence(b"Sp6", b"ASGAILTLGIYLFTLCAVISVSWYLAWLGLEINMMAII"),
Sequence(b"Sp17", b"FAYTAPDLLLIGFLLKTVATFGDTWFQLWQGLDLNKMPVF"),
Sequence(b"Sp33", b"PTILNIAGLHMETDINFSLAWFQAWGGLEINKQAIL"),
]
aligner = Aligner(guide_tree="upgma")
msa = aligner.align(sequences)
for sequence in msa:
print(sequence.id.decode().ljust(10), sequence.sequence.decode())
This should output the following:
Sp10 --------DPAVLFVIMLGTIT-KFS--SEWFFAWLGLEINMMVII
Sp17 ---FAYTAPDLLLIGFLLKTVA-TFG--DTWFQLWQGLDLNKMPVF
Sp26 AAAAAAAAALLTYLGLFLGTDYENFA--AAAANAWLGLEINMMAQI
Sp33 -------PTILNIAGLHMETDI-NFS--LAWFQAWGGLEINKQAIL
Sp6 ------ASGAILTLGIYLFTLCAVIS--VSWYLAWLGLEINMMAII
Sp8 ------GLGKVIVYGIVLGTKSDQFSNWVVWLFPWNGLQIHMMGII
Aligner
objects are thread-safe, and the align
method is re-entrant. You
could batch process several alignments in parallel using a
ThreadPool
with a single
aligner object:
import glob
import multiprocessing.pool
import Bio.SeqIO
from pyfamsa import Aligner, Sequence
families = [
[ Sequence(r.id.encode(), r.seq.encode()) for r in Bio.SeqIO.parse(file, "fasta") ]
for file in glob.glob("pyfamsa/tests/data/*.faa")
]
aligner = Aligner()
with multiprocessing.pool.ThreadPool() as pool:
alignments = pool.map(aligner.align, families)
Done with your protein alignment? You may be interested in trimming it: in that
case, you could use the pytrimal
Python
package, which wraps trimAl 2.0. Or perhaps
you want to build a HMM from the alignment? Then maybe have a look at
pyhmmer
, a Python package which
wraps HMMER.
Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.
Contributions are more than welcome! See
CONTRIBUTING.md
for more details.
This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.
This library is provided under the GNU General Public License v3.0. FAMSA is developed by the
REFRESH Bioinformatics Group and is
distributed under the terms of the GPLv3 as well. See vendor/FAMSA/LICENSE
for more information. In addition, FAMSA vendors several libraries for
compatibility, all of which are redistributed with PyFAMSA under their own
terms: atomic_wait
(MIT License), mimalloc
(MIT License), libdeflate
(MIT License), Boost (Boost Software License).
This project is in no way not affiliated, sponsored, or otherwise endorsed by the FAMSA authors. It was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.