yabul

Yet Another Bioinformatics Utility Library

Functions

yabul.read_fasta(filename)[source]

Parse a fasta file to a pandas DataFrame.

Compression is supported (via pandas read_csv) and is inferred by extension: ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’.

Parameters

filename (string) –

Returns

  • pandas.DataFrame with columns “description” and “sequence”. The index of the

  • DataFrame is the “sequence ID”, i.e. the first space-separated token of the

  • description.

yabul.write_fasta(filename, sequences)[source]

Write sequences to a FASTA.

Parameters
  • filename (string) – File to write. If it ends with ‘.gz’ the file will be gzip compressed.

  • sequences (iterable of (name, sequence) pairs) – Sequences to write. Both name and sequence should be strings.

yabul.align_pair(query_seq, reference_seq, local=False, gap_open_penalty=11, gap_extension_penality=1, substitution_matrix='blosum62', alignment_function=None)[source]

Align two protein or DNA sequences.

By default, a protein substitution matrix (blosum62) is used. If you are aligning DNA or RNA, you should use a nucleotide substitution matrix by passing, for example, substitution_matrix=”dnafull”.

This is a thin wrapper over the Parasail library implementation.

Returns a pandas.Series with the results of the alignment.

Parameters
  • query_seq (string) – First sequence to align

  • reference_seq (string) – Second sequence to align.

  • local (boolean) –

    If True, a local alignment is performed using the Smith-Waterman algorithm. This means that gaps at the beginning or end of the sequences are not penalized, and only the part of the sequences that align are returned.

    If False, a global alignment is performed using the Needleman-Wunsch algorithm. This means that the two sequences will be aligned in their entirety.

  • gap_open_penalty (int) – Penality for starting a gap

  • gap_extension_penality (int) – Penalty for extending a gap

  • substitution_matrix (string) –

    Name of substitution matrix. Examples: “blosum62”, “blosum90”, “dnafull”, “pam100”. If you are aligning DNA or RNA you should use a nucleotide substitution matrix, such as “dnafull”.

    Full list of supported matrices: https://github.com/jeffdaily/parasail/tree/master/parasail/matrices

  • alignment_function (function) – Advanced use. If you know the underlying parasail alignment function you would like to use, you can pass it here. Otherwise a reasonable default is used.

Returns

querystring

Aligned query sequence

referencestring

Aligned reference sequence

correspondencestring

Characters (similar to BLAST “midline”) indicating the correspondence between query and reference strings.

scoreint

Alignment score. Higher indicates a better alignment.

Return type

pandas.Series with keys