yabul
¶
Yet Another Bioinformatics Utility Library
Functions¶
read_fasta()
: Parse a fasta file to a pandas DataFrame.write_fasta()
: Write sequences to a FASTA.align_pair()
: Align two protein or DNA sequences.
-
yabul.
read_fasta
(filename)[source]¶ Parse a fasta file to a pandas DataFrame.
Compression is supported (via pandas read_csv) and is inferred by extension: ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’.
- Parameters
filename (string) –
- Returns
pandas.DataFrame with columns “description” and “sequence”. The index of the
DataFrame is the “sequence ID”, i.e. the first space-separated token of the
description.
-
yabul.
write_fasta
(filename, sequences)[source]¶ Write sequences to a FASTA.
- Parameters
filename (string) – File to write. If it ends with ‘.gz’ the file will be gzip compressed.
sequences (iterable of (name, sequence) pairs) – Sequences to write. Both name and sequence should be strings.
-
yabul.
align_pair
(query_seq, reference_seq, local=False, gap_open_penalty=11, gap_extension_penality=1, substitution_matrix='blosum62', alignment_function=None)[source]¶ Align two protein or DNA sequences.
By default, a protein substitution matrix (blosum62) is used. If you are aligning DNA or RNA, you should use a nucleotide substitution matrix by passing, for example, substitution_matrix=”dnafull”.
This is a thin wrapper over the Parasail library implementation.
Returns a pandas.Series with the results of the alignment.
- Parameters
query_seq (string) – First sequence to align
reference_seq (string) – Second sequence to align.
local (boolean) –
If True, a local alignment is performed using the Smith-Waterman algorithm. This means that gaps at the beginning or end of the sequences are not penalized, and only the part of the sequences that align are returned.
If False, a global alignment is performed using the Needleman-Wunsch algorithm. This means that the two sequences will be aligned in their entirety.
gap_open_penalty (int) – Penality for starting a gap
gap_extension_penality (int) – Penalty for extending a gap
substitution_matrix (string) –
Name of substitution matrix. Examples: “blosum62”, “blosum90”, “dnafull”, “pam100”. If you are aligning DNA or RNA you should use a nucleotide substitution matrix, such as “dnafull”.
Full list of supported matrices: https://github.com/jeffdaily/parasail/tree/master/parasail/matrices
alignment_function (function) – Advanced use. If you know the underlying parasail alignment function you would like to use, you can pass it here. Otherwise a reasonable default is used.
- Returns
- querystring
Aligned query sequence
- referencestring
Aligned reference sequence
- correspondencestring
Characters (similar to BLAST “midline”) indicating the correspondence between query and reference strings.
- scoreint
Alignment score. Higher indicates a better alignment.
- Return type
pandas.Series with keys