Profile picture

BioPython Demo

Last updated: June 10th, 20212021-06-10Project preview

rmotr


filo_virion

Biopython: Multiple Sequence Alignments (MSAs)

When we want to align more than 2 sequences.

MSA algorithms are much more complicated (than pairwise alignments) and thus more computationally expensive.

We have a couple options. We can:

  1. import sequence alignment files performed using an external file for analysis in our notebooks; or
  2. use BioPython to run the applications here.

We're going to use Clustal Omega and do it right here in the notebook because it documents all the code:

https://pypi.org/project/clustalo/

We are still interested in the glycoprotein, but let's back up a bit in the phylogeny and look at the glycoprotein's similarity to other viruses.

"We have concentrated our efforts on the GP gene to understand EBOV evolution for two reasons: first, there are more GP sequences available than any other gene sequence; and second, the proteins it encodes interact directly with the host immune system and therefore are expected to evolve by positive selection." https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4968807/#pone.0160410.s001

The GP is of notable interest:

  • unique editing site
  • causes variants in the resulting GP
    • one of which has been linked to high virulence
  • ssGP
  • sSP
  • SP

rxts directly with host immune system so good for looking at the evolution of the virus in response to its affects on the human population. Also know as the spike glycoprotein.

For the most part we peform alignments using established, external programs.

purple-divider

There are plenty of alignment programs out there, but we are going to use clustalo, which is pretty established for performing alignments of multiple sequences.

Remember, you can always ask for -help

In [157]:
! clustalo -help
WARNING: Your old-style command-line options were converted to:  clustalo -h -o clustal.aln --outfmt=clustal -v --force
Clustal Omega - 1.2.4 (AndreaGiacomo)

If you like Clustal-Omega please cite:
 Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG.
 Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.
 Mol Syst Biol. 2011 Oct 11;7:539. doi: 10.1038/msb.2011.75. PMID: 21988835.
If you don't like Clustal-Omega, please let us know why (and cite us anyway).

Check http://www.clustal.org for more information and updates.

Usage: clustalo [-hv] [-i {<file>,-}] [--hmm-in=<file>]... [--hmm-batch=<file>] [--dealign] [--profile1=<file>] [--profile2=<file>] [--is-profile] [-t {Protein, RNA, DNA}] [--infmt={a2m=fa[sta],clu[stal],msf,phy[lip],selex,st[ockholm],vie[nna]}] [--distmat-in=<file>] [--distmat-out=<file>] [--guidetree-in=<file>] [--guidetree-out=<file>] [--pileup] [--full] [--full-iter] [--cluster-size=<n>] [--clustering-out=<file>] [--trans=<n>] [--posterior-out=<file>] [--use-kimura] [--percent-id] [-o {file,-}] [--outfmt={a2m=fa[sta],clu[stal],msf,phy[lip],selex,st[ockholm],vie[nna]}] [--residuenumber] [--wrap=<n>] [--output-order={input-order,tree-order}] [--iterations=<n>] [--max-guidetree-iterations=<n>] [--max-hmm-iterations=<n>] [--maxnumseq=<n>] [--maxseqlen=<l>] [--auto] [--threads=<n>] [--pseudo=<file>] [-l <file>] [--version] [--long-version] [--force] [--MAC-RAM=<n>]

A typical invocation would be: clustalo -i my-in-seqs.fa -o my-out-seqs.fa -v
See below for a list of all options.
                            
Sequence Input:
  -i, --in, --infile={<file>,-} Multiple sequence input file (- for stdin)
  --hmm-in=<file>           HMM input files
  --hmm-batch=<file>        specify HMMs for individual sequences
  --dealign                 Dealign input sequences
  --profile1, --p1=<file>   Pre-aligned multiple sequence file (aligned columns will be kept fix)
  --profile2, --p2=<file>   Pre-aligned multiple sequence file (aligned columns will be kept fix)
  --is-profile              disable check if profile, force profile (default no)
  -t, --seqtype={Protein, RNA, DNA} Force a sequence type (default: auto)
  --infmt={a2m=fa[sta],clu[stal],msf,phy[lip],selex,st[ockholm],vie[nna]} Forced sequence input file format (default: auto)
                            
Clustering:
  --distmat-in=<file>       Pairwise distance matrix input file (skips distance computation)
  --distmat-out=<file>      Pairwise distance matrix output file
  --guidetree-in=<file>     Guide tree input file (skips distance computation and guide-tree clustering step)
  --guidetree-out=<file>    Guide tree output file
  --pileup                  Sequentially align sequences
  --full                    Use full distance matrix for guide-tree calculation (might be slow; mBed is default)
  --full-iter               Use full distance matrix for guide-tree calculation during iteration (might be slowish; mBed is default)
  --cluster-size=<n>        soft maximum of sequences in sub-clusters
  --clustering-out=<file>   Clustering output file
  --trans=<n>               use transitivity (default: 0)
  --posterior-out=<file>    Posterior probability output file
  --use-kimura              use Kimura distance correction for aligned sequences (default no)
  --percent-id              convert distances into percent identities (default no)
                            
Alignment Output:
  -o, --out, --outfile={file,-} Multiple sequence alignment output file (default: stdout)
  --outfmt={a2m=fa[sta],clu[stal],msf,phy[lip],selex,st[ockholm],vie[nna]} MSA output file format (default: fasta)
  --residuenumber, --resno  in Clustal format print residue numbers (default no)
  --wrap=<n>                number of residues before line-wrap in output
  --output-order={input-order,tree-order} MSA output order like in input/guide-tree
                            
Iteration:
  --iterations, --iter=<n>  Number of (combined guide-tree/HMM) iterations
  --max-guidetree-iterations=<n> Maximum number of guidetree iterations
  --max-hmm-iterations=<n>  Maximum number of HMM iterations
                            
Limits (will exit early, if exceeded):
  --maxnumseq=<n>           Maximum allowed number of sequences
  --maxseqlen=<l>           Maximum allowed sequence length
                            
Miscellaneous:
  --auto                    Set options automatically (might overwrite some of your options)
  --threads=<n>             Number of processors to use
  --pseudo=<file>           Input file for pseudo-count parameters
  -l, --log=<file>          Log all non-essential output to this file
  -h, --help                Print this help and exit
  -v, --verbose             Verbose output (increases if given multiple times)
  --version                 Print version information and exit
  --long-version            Print long version information and exit
  --force                   Force file overwriting
In [69]:
! cat data/GP.fasta
>1976 AF086833.2
GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTTTTCCAGAGTAGGGGT
CGTCAGGTCCTTTTCAATCGTGTAACCAAAATAAACTCCACTAGAAGGATATTGTGGGGCAACAACACAA
TGGGCGTTACAGGAATATTGCAGTTACCTCGTGATCGATTCAAGAGGACATCATTCTTTCTTTGGGTAAT
TATCCTTTTCCAAAGAACATTTTCCATCCCACTTGGAGTCATCCACAATAGCACATTACAGGTTAGTGAT
GTCGACAAACTAGTTTGTCGTGACAAACTGTCATCCACAAATCAATTGAGATCAGTTGGACTGAATCTCG
AAGGGAATGGAGTGGCAACTGACGTGCCATCTGCAACTAAAAGATGGGGCTTCAGGTCCGGTGTCCCACC
AAAGGTGGTCAATTATGAAGCTGGTGAATGGGCTGAAAACTGCTACAATCTTGAAATCAAAAAACCTGAC
GGGAGTGAGTGTCTACCAGCAGCGCCAGACGGGATTCGGGGCTTCCCCCGGTGCCGGTATGTGCACAAAG
TATCAGGAACGGGACCGTGTGCCGGAGACTTTGCCTTCCATAAAGAGGGTGCTTTCTTCCTGTATGATCG
ACTTGCTTCCACAGTTATCTACCGAGGAACGACTTTCGCTGAAGGTGTCGTTGCATTTCTGATACTGCCC
CAAGCTAAGAAGGACTTCTTCAGCTCACACCCCTTGAGAGAGCCGGTCAATGCAACGGAGGACCCGTCTA
GTGGCTACTATTCTACCACAATTAGATATCAGGCTACCGGTTTTGGAACCAATGAGACAGAGTACTTGTT
CGAGGTTGACAATTTGACCTACGTCCAACTTGAATCAAGATTCACACCACAGTTTCTGCTCCAGCTGAAT
GAGACAATATATACAAGTGGGAAAAGGAGCAATACCACGGGAAAACTAATTTGGAAGGTCAACCCCGAAA
TTGATACAACAATCGGGGAGTGGGCCTTCTGGGAAACTAAAAAAACCTCACTAGAAAAATTCGCAGTGAA
GAGTTGTCTTTCACAGTTGTATCAAACGGAGCCAAAAACATCAGTGGTCAGAGTCCGGCGCGAACTTCTT
CCGACCCAGGGACCAACACAACAACTGAAGACCACAAAATCATGGCTTCAGAAAATTCCTCTGCAATGGT
TCAAGTGCACAGTCAAGGAAGGGAAGCTGCAGTGTCGCATCTAACAACCCTTGCCACAATCTCCACGAGT
CCCCAATCCCTCACAACCAAACCAGGTCCGGACAACAGCACCCATAATACACCCGTGTATAAACTTGACA
TCTCTGAGGCAACTCAAGTTGAACAACATCACCGCAGAACAGACAACGACAGCACAGCCTCCGACACTCC
CTCTGCCACGACCGCAGCCGGACCCCCAAAAGCAGAGAACACCAACACGAGCAAGAGCACTGACTTCCTG
GACCCCGCCACCACAACAAGTCCCCAAAACCACAGCGAGACCGCTGGCAACAACAACACTCATCACCAAG
ATACCGGAGAAGAGAGTGCCAGCAGCGGGAAGCTAGGCTTAATTACCAATACTATTGCTGGAGTCGCAGG
ACTGATCACAGGCGGGAGAAGAACTCGAAGAGAAGCAATTGTCAATGCTCAACCCAAATGCAACCCTAAT
TTACATTACTGGACTACTCAGGATGAAGGTGCTGCAATCGGACTGGCCTGGATACCATATTTCGGGCCAG
CAGCCGAGGGAATTTACATAGAGGGGCTAATGCACAATCAAGATGGTTTAATCTGTGGGTTGAGACAGCT
GGCCAACGAGACGACTCAAGCTCTTCAACTGTTCCTGAGAGCCACAACTGAGCTACGCACCTTTTCAATC
CTCAACCGTAAGGCAATTGATTTCTTGCTGCAGCGATGGGGCGGCACATGCCACATTCTGGGACCGGACT
GCTGTATCGAACCACATGATTGGACCAAGAACATAACAGACAAAATTGATCAGATTATTCATGATTTTGT
TGATAAAACCCTTCCGGACCAGGGGGACAATGACAATTGGTGGACAGGATGGAGACAATGGATACCGGCA
GGTATTGGAGTTACAGGCGTTATAATTGCAGTTATCGCTTTATTCTGTATATGCAAATTTGTCTTTTAGT
TTTTCTTCAGATTGCTTCATGGAAAAGCTCAGCCTCAAATCAATGAAACCAGGATTTAATTATATGGATT
ACTTGAATCTAAGATTACTTGACAAATGATAATATAATACACTGGAGCTTTAAACATAGCCAATGTGATT
CTAACTCCTTTAAACTCACAGTTAATCATAAACAAGGTTTGACATCAATCTAGTTATCTCTTTGAGAATG
ATAAACTTGATGAAGATTAAGAAAAA

>2004 AY526100.1 
ACTTCACTAGAAGGATATTGTGGGGCAACAACACAATGGGTGTTACAGGAATATTGCAGTTACCTCGTGA
TCGATTCAAGAGGACATCATTCTTTCTTTGGGTAATTATCCTTTTCCAAAGAACATTTTCCATCCCACTT
GGAGTCATCCACAATAGCACATTACAGGTTAGTGATGTCGACAAACTGGTTTGCCGTGACAAACTGTCAT
CCACGAATCAATTGAGATCAGTTGGACTGAATCTCGGAGGGAATGGAGTGGCAACTGACGTGCCATCTGC
AACTAAAAGATGGGGCTTCAGGTCCGGTGTCCCACCAAAAGTGGTCAATTATGAAGCTGGTGAATGGGCT
GAAAACTGCTACAATCTTGAAATCAAAAAACCTGACGGGAGTGAGTGTCTACCAGCAGCGCCAGACGGGA
TTCGGGGCTTCCCCCGGTGCCGGTATGTGCACAAAGTATCAGGAACGGGACCGTGTGCCGGAGACTTTGC
CTTCCACAAAGAGGGTGCTTTCTTCCTGTATGATCGACTTGCTTCCACAGTTTTCTACCGAGGAACGACT
TTCGCTGAAGGTGTCGTGGCATTTCTGATACTGCCCCAAGCTAAGAAGGACTTCTTCAGCTCACACCCTT
TGAGAGAGCCGGTCAATGCAACGGAGGACCCGTCTAGTGGCTACTATTCTACCACAATTAAATATCAGGC
TACCGGCTTTGGAACCAATGAGACAGAGTATTTGTTCGAGGTTGACAATTTGACCTACGTCCAACTTGAA
TCAAGATTCACACCACAGTTTCTGGTCCAGCTGAATGAGACAATATATACAAGTGGGAAAAGGAGCAATA
CCACGGGAAAACTAATTTGGAAGGTCAACCCCGAAATTGATACAACAATCGGGGAGTGGGCCTTCTGGGA
AACTAAAAAAACCTCACTAGAAAAATTCGCAGTGAAGAGTTGTCTTTCACAGCTGTATCAAACAGAGCCA
AAAACATCAGTGGTCAGAGTCCGGCGCGAACTTCTTCCGACCCAGGGACCAACACAACAACTGAAGACCA
CAAAATCATGGCTTCAGAAAATTCCTCTGCAATGGTTCAAGTGCACAGTCAAGGAAGGGAAGCTGCAGTG
TCGCATCTGACAACCCCTGCCACAATGTCCACGAGTCTTCAACCCCCCACAACCAAACCAGGTCCGGACA
ACAGCACCCAAAATACACCCGTGTATAAACTTGACATCTCTGAGGCAACTCAAGTTGAACAACATCACCG
CAGAACAGACTACGCCAGCACAACCTCCGACACTCCCCCCGCCACGACCGCAGCCGGACCCCTAAAAGCA
GAGAACACCAACACGAGCAAGGGCACTGACCTCCTGGACCCCGCCACCACAACAAGTCCCCAAAACCACA
GCGAGACCGCTGGCAACAACAACACTCATCACCAAGATACCGGAGAAGAGAGTACCAGCAGCGGGAAGCT
AGGCTGAATTACCAATACTATTGCTGGAGTCGCAGGACTGATCACAGGCGGGAGAAGAACTCGAAGAGAT
GCAATTGTCAATGCTCAACCCAAATGCAACCCTAATTTACATTGCTGGACTACTCAGGATGAAGGTGCTG
CAATCGGACTGGCCTGGATACCATATTTCGGGCCAGCAGCCGAGGGAATTTACACAGAGGGGCTGATGCA
CAAACAAGATGGTTTAATCTGTGGGTTGAGACAGCTGGCCAACGAGACGACTCAAGCTCTTCAACTATTC
CTGAGAGCCACAACCGAGCTACGCACCTTTTCAATCCTCAACCGTAAGGCAATTGATTTCTTGCTGCAGC
GATGGGGCGGCACATGCCACATTTTGGGACCGGACTGCTGTATCGAACCACATGATTGGACTAAGAACAT
AACGGACAAAATTGATCAGATTATTCATGATTTTGTTGATAAAACCCTTCCGGACCAGGGGGACAATGAC
AATTGGTGGACAGGATGGAGACAGTGGATACCGGCAGGTATTGGAGTTACAGGCGTTATAATTGCAGTTA
TCGCTTTATTCTGTATATGCAAATTTGTCTTTTAGTTTTTCTTCAGATTGCTTCATGGCAAAGCTCAGCC
TCAAATCAATGAAACCAGGATTTAATTATATGGATTACTTGAATCTAAGATTACTTGACAAATGATAATG
TAA

>05/2014 KM034550
GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCAGAGTAGGGGT
CATCAGGTCCTTTTCAATTGGATAACCAAAATAAGCTTCACTAGAAGGATATTGTGAGGCGACAACACAA
TGGGTGTTACAGGAATATTGCAGTTACCTCGTGATCGATTCAAGAGGACATCATTCTTTCTTTGGGTAAT
TATCCTTTTCCAAAGAACATTTTCCATCCCGCTTGGAGTTATCCACAATAGTACATTACAGGTTAGTGAT
GTCGACAAACTAGTTTGTCGTGACAAACTGTCATCCACAAATCAATTGAGATCAGTTGGACTGAATCTCG
AGGGGAATGGAGTGGCAACTGACGTGCCATCTGTGACTAAAAGATGGGGCTTCAGGTCCGGTGTCCCACC
AAAGGTGGTCAATTATGAAGCTGGTGAATGGGCTGAAAACTGCTACAATCTTGAAATCAAAAAACCTGAC
GGGAGTGAGTGTCTACCAGCAGCGCCAGACGGGATTCGGGGCTTCCCCCGGTGCCGGTATGTGCACAAAG
TATCAGGAACGGGACCATGTGCCGGAGACTTTGCCTTCCACAAAGAGGGTGCTTTCTTCCTGTATGATCG
ACTTGCTTCCACAGTTATCTACCGAGGAACGACTTTCGCTGAAGGTGTCGTTGCATTTCTGATACTGCCC
CAAGCTAAGAAGGACTTCTTCAGCTCACACCCCTTGAGAGAGCCGGTCAATGCAACGGAGGACCCGTCGA
GTGGCTATTATTCTACCACAATTAGATATCAGGCTACCGGTTTTGGAACTAATGAGACAGAGTACTTGTT
CGAGGTTGACAATTTGACCTACGTCCAACTTGAATCAAGATTCACACCACAGTTTCTGCTCCAGCTGAAT
GAGACAATATATGCAAGTGGGAAGAGGAGCAACACCACGGGAAAACTAATTTGGAAGGTCAACCCCGAAA
TTGATACAACAATCGGGGAGTGGGCCTTCTGGGAAACTAAAAAAACCTCACTAGAAAAATTCGCAGTGAA
GAGTTGTCTTTCACAGCTGTATCAAACGGACCCAAAAACATCAGTGGTCAGAGTCCGGCGCGAACTTCTT
CCGACCCAGAGACCAACACAACAAATGAAGACCACAAAATCATGGCTTCAGAAAATTCCTCTGCAATGGT
TCAAGTGCACAGTCAAGGAAGGAAAGCTGCAGTGTCGCATCTGACAACCCTTGCCACAATCTCCACGAGT
CCTCAACCTCCCACAACCAAAACAGGTCCGGACAACAGCACCCATAATACACCCGTGTATAAACTTGACA
TCTCTGAGGCAACTCAAGTTGGACAACATCACCGTAGAGCAGACAACGACAGCACAGCCTCCGACACTCC
CCCCGCCACGACCGCAGCCGGACCCTTAAAAGCAGAGAACACCAACACGAGTAAGAGCGCTGACTCCCTG
GACCTCGCCACCACGACAAGCCCCCAAAACTACAGCGAGACTGCTGGCAACAACAACACTCATCACCAAG
ATACCGGAGAAGAGAGTGCCAGCAGCGGGAAGCTAGGCTTAATTACCAATACTATTGCTGGAGTAGCAGG
ACTGATCACAGGCGGGAGAAGGACTCGAAGAGAAGTAATTGTCAATGCTCAACCCAAATGCAACCCCAAT
TTACATTACTGGACTACTCAGGATGAAGGTGCTGCAATCGGATTGGCCTGGATACCATATTTCGGGCCAG
CAGCCGAAGGAATTTACACAGAGGGGCTAATGCACAACCAAGATGGTTTAATCTGTGGGTTGAGGCAGCT
GGCCAACGAAACGACTCAAGCTCTCCAACTGTTCCTGAGAGCCACAACTGAGCTGCGAACCTTTTCAATC
CTCAACCGTAAGGCAATTGACTTCCTGCTGCAGCGATGGGGTGGCACATGCCACATTTTGGGACCGGACT
GCTGTATCGAACCACATGATTGGACCAAGAACATAACAGACAAAATTGATCAGATTATTCATGATTTTGT
TGATAAAACCCTTCCGGACCAGGGGGACAATGACAATTGGTGGACAGGATGGAGACAATGGATACCGGCA
GGTATTGGAGTTACAGGTGTTATAATTGCAGTTATCGCTTTATTCTGTATATGCAAATTTGTCTTTTAGT
CTTTCTTCAGATTGTTTCACGGCAAAACTCAACCTCAAATCAATGAAACTAGGATTTAATTATATGAATC
ACTTGAATCTAAGATTACTTGACAAATGATAACATAATACACTGGAGCTTCAAACATAGCCAATGTGATT
CTAACTCCTTTAAACTCACAGTTAATCATAAACAAGGTTTGACATCAATCTAGCTATATCTTTAAGAATG
ATAAACTTGATGAAGATTAAGAAAAA

>08/2014 KP178538
GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCAGAGTAGGGGT
CATCAGGTCCTTTTCAATTGGATAACCAAAATAAGCTTCACTAGAAGGATATTGTGAGGCGACAACACAA
TGGGTGTTACAGGAATATTGCAGTTACCTCGTGATCGATTCAAGAGGACATCATTCTTTCTTTGGGTAAT
TATCCTTTTCCAAAGAACATTTTCCATCCCGCTTGGAGTTATCCACAATAGTACATTACAGGTTAGTGAT
GTCGACAAACTAGTTTGTCGTGACAAACTGTCATCCACAAATCAATTGAGATCAGTTGGACTGAATCTCG
AGGGGAATGGAGTGGCAACTGACGTGCCATCTGTGACTAAAAGATGGGGCTTCAGGTCCGGTGTCCCACC
AAAGGTGGTCAATTATGAAGCTGGTGAATGGGCTGAAAACTGCTACAATCTTGAAATCAAAAAACCTGAC
GGGAGTGAGTGTCTACCAGCAGCGCCAGACGGGATTCGGGGCTTCCCCCGGTGCCGGTATGTGCACAAAG
TATCAGGAACGGGACCATGTGCCGGAGACTTTGCCTTCCACAAAGAGGGTGCTTTCTTCCTGTATGATCG
ACTTGCTTCCACAGTTATCTACCGAGGAACGACTTTCGCTGAAGGTGTCGTTGCATTTCTGATACTGCCC
CAAGCTAAGAAGGACTTCTTCAGCTCACACCCCTTGAGAGAGCCGGTCAATGCAACGGAGGACCCGTCGA
GTGGCTATTATTCTACCACAATTAGATATCAGGCTACCGGTTTTGGAACTAATGAGACAGAGTACTTGTT
CGAGGTTGACAATTTGACCTACGTCCAACTTGAATCAAGATTCACACCACAGTTTCTGCTCCAGCTGAAT
GAGACAATATATGCAAGTGGGAAGAGGAGCAACACCACGGGAAAACTAATTTGGAAGGTCAACCCCGAAA
TTGATACAACAATCGGGGAGTGGGCCTTCTGGGAAACTAAAAAAACCTCACTAGAAAAATTCGCAGTGAA
GAGTTGTCTTTCACAGCTGTATCAAACGGACCCAAAAACATCAGTGGTCAGAGTCCGGCGCGAACTTCTT
CCGACCCAGAGACCAACACAACAAATGAAGACCACAAAATCATGGCTTCAGAAAATTCCTCTGCAATGGT
TCAAGTGCACAGTCAAGGAAGGAAAGCTGCAGTGTCGCATCTGACAACCCTTGCCACAATCTCCACGAGT
CCTCAACCTCCCACAACCAAAACAGGTCCGGACAACAGCACCCATAATACACCCGTGTATAAACTTGACA
TCTCTGAGGCAACTCAAGTTGGACAACATCACCGTAGAGCAGACAACGACAGCACAGCCTCCGACACTCC
CCCCGCCACGACCGCAGCCGGACCCTTAAAAGCAGAGAACACCAACACGAGTAAGAGCGCTGACTCCCTG
GACCTCGCCACCACGACAAGCCCCCAAAACTACAGCGAGACTGCTGGCAACAACAACACTCATCACCAAG
ATACCGGAGAAGAGAGTGCCAGCAGCGGGAAGCTAGGCTTAATTACCAATACTATTGCTGGAGTAGCAGG
ACTGATCACAGGCGGGAGAAGGACTCGAAGAGAAGTAATTGTCAATGCTCAACCCAAATGCAACCCCAAT
TTACATTACTGGACTACTCAGGATGAAGGTGCTGCAATCGGATTGGCCTGGATACCATATTTCGGGCCAG
CAGCCGAGGGAATTTACACAGAGGGGCTAATGCACAACCAAGATGGTTTAATCTGTGGGTTGAGGCAGCT
GGCCAACGAAACGACTCAAGCTCTCCAACTGTTCCTGAGAGCCACAACTGAGCTGCGAACCTTTTCAATC
CTCAACCGTAAGGCAATTGACTTCCTGCTGCAGCGATGGGGTGGCACATGCCACATTTTGGGACCGGACT
GCTGTATCGAACCACATGATTGGACCAAGAACATAACAGACAAAATTGATCAGATTATTCATGATTTTGT
TGATAAAACCCTTCCGGACCAGGGGGACAATGACAATTGGTGGACAGGATGGAGACAATGGATACCGGCA
GGTATTGGAGTTACAGGTGTTATAATTGCAGTTATCGCTTTATTCTGTATATGCAAATTTGTCTTTTAGT
CTTTCTTCAGATTGTTTCACGGCAAAACTCAACCTCAAATCAATGAAACTAGGATTTAATTATATGAATC
ACTTGAATCTAAGATTACTTGACAAATGATAACATAATACACTGGAGCTTCAAACATAGCCAATGTGATT
CTAACTCCTTTAAACTCACAGTTAATCATAAACAAGGTTTGACATCAATCTAGCTATATCTTTAAGAATG
ATAAACTTGATGAAGATTAAGAAAAA

>10/2014 KP759598
GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCAGAGTAGGGGT
CATCAGGTCCTTTTCAATTGGATAACCAAAATAAGCTTCACTAGAAGGATATTGTGAGGCGACAACACAA
TGGGTGTTACAGGAATATTGCAGTTACCTCGTGATCGATTCAAGAGGACATCATTCTTTCTTTGGGTAAT
TATCCTTTTCCAAAGAACATTTTCCATCCCGCTTGGAGTTATCCACAATAGTACATTACAGGTTAGTGAT
GTCGACAAACTAGTTTGTCGTGACAAACTGTCATCCACAAATCAATTGAGATCAGTTGGACTGAATCTCG
AGGGGAATGGAGTGGCAACTGACGTGCCATCTGTGACTAAAAGATGGGGCTTCAGGTCCGGTGTCCCACC
AAAGGTGGTCAATTATGAAGCTGGTGAATGGGCTGAAAACTGCTACAATCTTGAAATCAAAAAACCTGAC
GGGAGTGAGTGTCTACCAGCAGCGCCAGACGGGATTCGGGGCTTCCCCCGGTGCCGGTATGTGCACAAAG
TATCAGGAACGGGACCATGTGCCGGAGACTTTGCCTTCCACAAAGAGGGTGCTTTCTTCCTGTATGATCG
ACTTGCTTCCACAGTTATCTACCGAGGAACGACTTTCGCTGAAGGTGTCGTTGCATTTCTGATACTGCCC
CAAGCTAAGAAGGACTTCTTCAGCTCACACCCCTTGAGAGAGCCGGTCAATGCAACGGAGGACCCGTCGA
GTGGCTATTATTCTACCACAATTAGATATCAGGCTACCGGTTTTGGAACTAATGAGGCAGAGTACTTGTT
CGAGGTTGACAATTTGACCTACGTCCAACTTGAATCAAGATTCACACCACAGTTTCTGCTCCAGCTGAAT
GAGACAATATATGCAAGTGGGAAGAGGAGCAACACCACGGGAAAACTAATTTGGAAGGTCAACCCCGAAA
TTGATACAACAATCGGGGAGTGGGCCTTCTGGGAAACTAAAAAAACCTCACTAGAAAAATTCGCAGTGAA
GAGTTGTCTTTCACAGCTGTATCAAACGGACCCAAAAACATCAGTGGTCAGAGTCCGGCGCGAACTTCTT
CCGACCCAGAGACCAACACAACAAATGAAGACCACAAAAACATGGCTTCAGAAAATTCCTCTGCAATGGT
TCAAGTGCACAGTCAAGGAAGGAAAGCTGCAGTGTCGCATCTGACAACCCTTGCCACAATCTCCACGAGT
CCTCAACCTCCCACAACCAAAACAGGTCCGGACAACAGCACCCATAATACACCCGTGTATAAACTTGACA
TCTCTGAGGCAACTCAAGTTGGACAACATCACCGTAGAGCAGACAACGACAGCACAGCCTCCGACACTCC
CCCCGCCACGACCGCAGCCGGACCCTTAAAAGCAGAGAACACCAACACGAGTAAGAGCGCTGACTCCCTG
GACCTCGCCACCACGACAAGCCCCCAAAACTACAGCGAGACTGCTGGCAACAACAACACTCATCACCAAG
ATACCGGAGAAGAGAGTGCCAGCAGCGGGAAGCTAGGCTTAATTACCAATACTATTGCTGGAGTAGCAGG
ACTGATCACAGGCGGGAGAAGGACTCGAAGAGAAGTAATTGTCAATGCTCAACCCAAATGCAACCCCAAT
TTACATTACTGGACTACTCAGGATGAAGGTGCTGCAATCGGATTGGCCTGGATACCATATTTCGGGCCAG
CAGCCGAAGGAATTTACACAGAGGGGCTAATGCACAACCAAGATGGTTTAATCTGTGGGTTGAGGCAGCT
GGCCAACGAAACGACTCAAGCTCTCCAACTGTTCCTGAGAGCCACAACTGAGCTGCGAACCTTTTCAATC
CTCAACCGTAAGGCAATTGACTTCCTGCTGCAGCGATGGGGTGGCACATGCCACATTTTGGGACCGGACT
GCTGTATCGAACCACATGATTGGACCAAGAACATAACAGACAAAATTGATCAGATTATTCATGATTTTGT
TGATAAAACCCTTCCGGACCAGGGGGACAATGACAATTGGTGGACAGGATGGAGACAATGGATACCGGCA
GGTATTGGAGTTACAGGTGTTATAATTGCAGTTATCGCTTTATTCTGTATATGCAAATTTGTCTTTTAGT
CTTTCTTCAGATTGTTTCACGGCAAAACTCAACCTCAAATCAATGAAACTAGGATTTAATTATATGAATC
ACTTGAATCTAAGATTACTTGACAAATGATAACATAATACACTGGAGCTTCAAACATAGCCAATGTGATT
CTAACTCCTTTAAACTCACAGTTAATCATAAACAAGGTTTGACATCAATCTAGCTATATCTTTAAGAATG
ATAAACTTGATGAAGATTAAGAAAAA

>11/2014 KP759704
GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCAGAGTAGGGGT
CATCAGGTCCTTTTCAATTGGATAACCAAAATAAGCTTCACTAGAAGGATATTGTGAGGCGACAACACAA
TGGGTGTTACAGGAATATTGCAGTTACCTCGTGATCGATTCAAGAGGACATCATTCTTTCTTTGGGTAAT
TATCCTTTTCCAAAGAACATTTTCCATCCCGCTTGGAGTTATCCACAATAGTACATTACAGGTTAGTGAT
GTCGACAAACTAGTTTGTCGTGACAAACTGTCATCCACAAATCAATTGAGATCAGTTGGACTGAATCTCG
AGGGGAATGGAGTGGCAACTGACGTGCCATCTGTGACTAAAAGATGGGGCTTCAGGTCCGGTGTCCCACC
AAAGGTGGTCAATTATGAAGCTGGTGAATGGGCTGAAAACTGCTACAATCTTGAAATCAAAAAACCTGAC
GGGAGTGAGTGTCTACCAGCAGCGCCAGACGGGATTCGGGGCTTCCCCCGGTGCCGGTATGTGCACAAAG
TATCAGGAACGGGACCATGTGCCGGAGACTTTGCCTTCCACAAAGAGGGTGCTTTCTTCCTGTATGATCG
ACTTGCTTCCACAGTTATCTACCGAGGAACGACTTTCGCTGAAGGTGTCGTTGCATTTCTGATACTGCCC
CAAGCTAAGAAGGACTTCTTCAGCTCACACCCCTTGAGAGAGCCGGTCAATGCAACGGAGGACCCGTCGA
GTGGCTATTATTCTACCACAATTAGATATCAGGCTACCGGTTTTGGAACTAATGAGACAGAGTACTTGTT
CGAGGTTGACAATTTGACCTACGTCCAACTTGAATCAAGATTCACACCACAGTTTCTGCTCCAGCTGAAT
GAGACAATATATGCAAGTGGGAAGAGGAGCAACACCACGGGAAAACTAATTTGGAAGGTCAACCCCGAAA
TTGATACAACAATCGGGGAGTGGGCCTTCTGGGAAACTAAAAAAACCTCACTAGAAAAATTCGCAGTGAA
GAGTTGTCTTTCACAGCTGTATCAAACGGACCCAAAAACATCAGTGGTCAGAGTCCGGCGCGAACTTCTT
CCGACCCAGAGACCAACACAACAAATGAAGACCACAAAATCATGGCTTCAGAAAATTCCTCTGCAATGGT
TCAAGTGCACAGTCAAGGAAGGAAAGCTGCAGTGTCGCATCTGACAACCCTTGCCACAATCTCCACGAGT
CCTCAACCTCCCACAACCAAAACAGGTCCGGACAACAGCACCCATAATACACCCGTGTATAAACTTGACA
TCTCTGAGGCAACTCAAGTTGGACAACATCACCGTAGAGCAGACAACGACAGCACAGCCTCCGACACTCC
CCCGGCCACGACCGCAGCCGGACCCTTAAAAGCAGAGAACACCAACACGAGTAAGAGCGCTGACTCCCTG
GACCTCGCCACCACGACAAGCCCCCAAAACTACAGCGAGACTGCTGGCAACAACAACACTCATCACCAAG
ATACCGGAGAAGAGAGTGCCAGCAGCGGGAAGCTAGGCTTAATTACCAATACTATTGCTGGAGTAGCAGG
ACTGATCACAGGCGGGAGAAGGACTCGAAGAGAAGTAATTGTCAATGCTCAACCCAAATGCAACCCCAAT
TTACATTACTGGACTACTCAGGATGAAGGTGCTGCAATCGGATTGGCCTGGATACCATATTTCGGGCCAG
CAGCCGAAGGAATTTACACAGAGGGGCTAATGCACAACCAAGATGGTTTAATCTGTGGGTTGAGGCAGCT
GGCCAACGAAACGACTCAAGCTCTCCAACTGTTCCTGAGAGCCACAACTGAGCTGCGAACCTTTTCAATC
CTCAACCGTAAGGCAATTGACTTCCTGCTGCAGCGATGGGGTGGCACATGCCACATTTTGGGACCGGACT
GCTGTATCGAACCACATGATTGGACCAAGAACATAACAGACAAAATTGATCAGATTATTCATGATTTTGT
TGATAAAACCCTTCCGGACCAGGGGGACAATGACAATTGGTGGACAGGATGGAGACAATGGATACCGGCA
GGTATTGGAGTTACAGGTGTTATAATTGCAGTTATCGCTTTATTCTGTATATGCAAATTTGTCTTTTAGT
CTTTCTTCAGATTGTTTCACGGCAAAACTCAACCTCAAATCAATGAAACTAGGATTTAATTATATGAATC
ACTTGAATCTAAGATTACTTGACAAATGATAACATAATACACTGGAGCTTCAAACATAGCCAATGTGATT
CTAACTCCTTTAAACTCACAGTTAATCATAAACAAGGTTTGACATCAATCTAGCTATATCTTTAAGAATG
ATAAACTTGATGAAGATTAAGAAAAA

>2007 KC242785
GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTTTTCCAGAGTAGGGGT
CATCAGGTCCTTTTCAATCGTATAACCAAAGTAAACTTCACTAGAAGGATATTGTGGGGCAACAACACAA
TGGGTGTCACAGGAATATTGCAGTTACCTCGTGATCGATTCAAGAGGACATCATTCTTTCTTTGGGTAAT
TATCCTTTTCCAAAGAACATTTTCCATCCCACTTGGAGTCATCCACAATAGCACATTACAGGTTAGTGAT
GTCGACAAACTAGTTTGTCGTGACAAACTGTCATCCACAAATCAATTGAGATCAGTTGGACTGAATCTCG
AAGGGAATGGAGTGGCAACTGATGTGCCATCTGCAACTAAAAGATGGGGCTTCAGGTCCGGTGTCCCACC
AAAGGTGGTCAATTATGAAGCTGGTGAATGGGCTGAAAACTGCTACAATCTTGAAATCAAAAAACCTGAC
GGGAGTGAGTGTCTACCAGCAGCGCCAGACGGGATTCGGGGCTTCCCCCGGTGCCGGTATGTGCACAAAG
TATCAGGAACAGGACCGTGTGCCGGAGACTTTGCCTTCCACAAAGAGGGTGCTTTCTTCCTGTATGATCG
ACTTGCTTCCACAGTTATTTACCGAGGGACGACTTTCGCTGAAGGTGTCGTTGCATTTCTGATACTGCCC
CAAGCTAAGAAGGACTTCTTCAGCTCACACCCCTTGAGAGAGCCGGTCAATGCAACGGAGGACCCGTCTA
GTGGCTACTATTCTACCACAATTAGATATCAGGCTACCGGTTTTGGAACCAATGAGACAGAGTACTTGTT
CGAGGTTGACAATTTGACCTACGTCCAACTTGAATCAAGATTCACACCACAGTTTCTGCTCCAGCTGAAT
GAGACAATATATGCAAGTGGGAAAAGGAGCAACACCACGGGAAAACTAATTTGGAAAGTCAACCCCGAAA
TTGATACAACAATCGGGGAGTGGGCCTTCTGGGAAACTAAAAAAACCTCACTAGAAAAATTCGCAGTGAA
GAGTTGTCTTTCACAGCTGTATCAAACGGAGCCAAAAACCTCAGTGGTCAGAGTCCGGCGCGAACTTCTT
CCGACCCAAAGACCAACACAACAACTGAAGACCACAAAATCGTGGCTTCAGAAAATTCCTCTGCAATGGT
TCAAGTGCACAGTCAAGGAAGGGAAGCTGCAGTGTCGCATCTGACAACCCTTGCCACAATCTCCACGAGT
CCTCAACCCCCCACAACCAAACCAGGTCCGGACAACAGCACTTATAATACACCCGTATATAAACTTGACA
CCTCTGAGGCAACTCAAGTTGAACAACATCACCGCAGAACAGACAACGACAGCACAGCCTCCGACACTCC
CCCCGCCACGACCGCAGCCGGACACCCAAAAGCAGAGAACACCAACACGAGCAAGAGCGCTGACTCCCTG
GACCCCGCCACCACGACAAGTCCCCCAAACCACAGCGAGACCGCTGGCAACAACAACACTCATCACCAAG
ATACCGGAGAAGAGAGTGCCAGCAGCGGGAAGCTAGGCTTAATTACCAATACTATTGCTGGAGTCGCAGG
ACTGATCACAGGCGGGAGAAGAACTCGAAGAGAAGCAATTGTCAATGCTCAACCCAAATGCAACCCTAAC
TTACATTACTGGACTACTCAGGATGAAGGTGCTGCAATCGGATTGGCCTGGATACCATATTTCGGGCCAG
CAGCCGAGGGAATTTACACAGAGGGGCTAATGCACAATCAAGATGGTTTAATCTGTGGGTTGAGGCAGCT
GGCCAACGAGACGACTCAAGCTCTTCAACTGTTCCTGAGAGCTACAACTGAGCTACGCACCTTTTCAATC
CTCAACCGTAAGGCAATTGATTTCTTGCTGCAGCGATGGGGCGGCACATGCCATATTTTGGGACCGGACT
GCTGTATCGAACCACATGATTGGACCAAGAACATAACAGACAAAATTGATCAGATTATTCATGATTTTGT
TGATAAAACCCTTCCGGACCAGGGGGACAATGACAATTGGTGGACAGGATGGAGACAATGGATACCGGCA
GGTATTGGAGTTACAGGCGTTATAATTGCAGTTATCGCTTTATTCTGTATATGCAAATTTGTCTTTTAGT
TTTTCTTCAGATTGCTTCATGGCAAAGCTCAGCCTCAAATCAATGAAATTAGGATTTAATTATATGGATC
ACTTGAATCTAAGATTACTTGACAAATGATAATATAATACACTGGAGCTTCAAACATAGCCAATGTGATT
CTAACTCCTTTAAACTCACAGTTAATCATAAACAAGGTTTGACATCAATCTAGTTATATCTTTGAGAATG
ATAAACTTGATGAAGATTAAGAAAAA

>2003 KF113528
GATGAAGATTAAGCCGACAGTGAGCGCAATCTTCATCTCTCTTAGATTATTTGTTTTCCAGAGTAGGGGT
CATCAGGTCCTTTCCAATCATATAACCAAAATAAACTTCACTAGAAGGATATTGTGAGGCAACAACACAA
TGGGTATTACAGGAATATTGCAGTTACCTCGTGATCGATTCAAGAGGACATCATTCTTTCTTTGGGTAAT
TATCCTTTTCCAAAGAACATTTTCCATCCCACTTGGAGTCATCCACAATAGCACATTACAAGTTAGTGAT
GTCGACAAACTAGTTTGTCGTGACAAACTGTCATCCACAAATCAATTGAGATCAGTTGGACTGAATCTCG
AAGGGAATGGAGTGGCAACTGACGTGCCATCTGCAACTAAAAGATGGGGCTTCAGGTCCGGTGTCCCTCC
AAAGGTGGTCAATTATGAAGCTGGTGAATGGGCTGAAAACTGCTACAATCTTGAAATCAAAAAACCTGAC
GGGAGTGAGTGTCTACCAGCAGCGCCAGACGGGATTCGGGGCTTCCCCCGGTGCCGGTATGTGCACAAAG
TATCAGGAACGGGACCGTGTGCCGGAGACTTTGCCTTCCACAAAGAGGGTGCTTTCTTCCTGTATGATCG
ACTTGCTTCCACAGTTATCTACCGAGGAACGACTTTCGCTGAAGGTGTCGTTGCATTTCTGATACTGCCC
CAAGCTAAGAAGGACTTCTTCAGCTCACACCCCTTAAGAGAGCCGGTCAATGCAACGGAGGACCCGTCCA
GTGGCTACTATTCTACCACAATTAGATATCAGGCTACCGGTTTTGGAACCAATGAGACGGAGTACTTGTT
CGAGGTTGACAATTTGACCTACGTCCAACTTGAATCAAGATTCACGCCACAGTTTTTGCTCCAGCTGAAT
GAGACAATATATGCAAGTGGGAAAAGGAGCAACACCACGGGAAAACTAATTTGGAAGGTCAACCCCGAAA
TTGATACAACAATCGGGGAGTGGGCCTTCTGGGAAACTAAAAAAACCTCACTAGAAAAATTCGCAGTGAA
GAGTTGTCTTTCACAGCTGTATCAAACGGAGCCAAAGACATCAGTGGTCAGAGTCCGGCGCGAACTTCTT
CCGACCCAGAGACCTACACAACAACTGGAGACCACAAAATCATGGCTTCAGAAGATTCCTCTGCAATGGT
TCAAGTGCACAATCAAGGAAGGGAAGCTGCAGTGTCGCATCTGATAACCTTTGCCACAATCTCCACGAGT
CCTCAATCCCCCACAACCAAACCAGGTCAGGACAACAGCACCCATAATACACCCGTGTATAAACTTGACA
TCTCTGAGGCAACTCAAGTTGAACAACATCATCGCAGAACAGACAACGACAGCACAGCCTCCGACACCCC
CCCCGCCACGACCGCAGCCGGACCCCCAAAAGCAGAGAACATCAACACGAGCAAGAGCGCTGACTCCCTG
GACCCCGCCACCACGACAAGTCCCCAAAACCACAGCGAGACCGCTGGCAACAACAACACTCATCACCAAG
ACACCGGAGAAGAGAGTGCCGGCAGCGGGAAGCTGGGCTCGATTACCAATACTATTGCTGGAGTCGCAGG
ACTGATCACAGGCGGGAGAAGAACTCGAAGAGAAGCAATTGTCAATGCTCAACCCAAATGCAACCCCAAT
CTACATTACTGGACTACTCAGGATGAAGGTGCTGCAATCGGATTGGCCTGGATACCATATTTCGGGCCAG
CAGCCGAGGGAATTTACACAGAGGGGCTAATGCACAATCAAGATGGTTTAATCTGTGGATTGAGGCAGCT
GGCCAATGAGACGACTCAAGCTCTTCAACTGTTCCTGAGAGCCACAACTGAGCTACGCACCTTTTCAATC
CTCAACCGTAAGGCAATTGATTTCTTGCTGCAGCGATGGGGCGGCACATGCCACATTTTGGGACCGGACT
GCTGTATCGAACCACATGATTGGACCAAGAACATAACAGACAAAATTGATCAGATTATTCATGATTTTGT
TGATAAAACCCTTCCGGACCAGGGGGACAATGACAATTGGTGGACTGGATGGAGACAATGGATACCGGCA
GGGATTGGAGTTACAGGGGGTATAATTGCAGTTATCGCTTTATTCTGTATATGCAAATTTGTCTTTTAGT
TTTTCTTTAGATTGCTTCATGGCAAAGCTCAGCCTCAAATCAATGAGATTAGGATTTAATTATATGGATC
ACTTGAATCTAAGATTACTTGACAAATGATAATATAATACACTGGAGCTTTAAATATAGCCAATGTGATT
CTAACTCCTTTAAACTCACAATTAATCATAAACAAGGTTTGACATCAATCTAGTTATATCTTTGAGAATG
ATAAACTTGATGAAGATTAAGAAAAA


In [13]:
! head -32 data/GP.clustal
CLUSTAL O(1.2.4) multiple sequence alignment


1976         GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTTTTCCA	60
2004         ------------------------------------------------------------	0
05/2014      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCA	60
08/2014      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCA	60
10/2014      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCA	60
11/2014      GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTCTTCCA	60
2007         GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGTTTTCCA	60
2003         GATGAAGATTAAGCCGACAGTGAGCGCAATCTTCATCTCTCTTAGATTATTTGTTTTCCA	60
                                                                         

1976         GAGTAGGGGTCGTCAGGTCCTTTTCAATCGTGTAACCAAAATAAACTCCACTAGAAGGAT	120
2004         --------------------------------------------ACTTCACTAGAAGGAT	16
05/2014      GAGTAGGGGTCATCAGGTCCTTTTCAATTGGATAACCAAAATAAGCTTCACTAGAAGGAT	120
08/2014      GAGTAGGGGTCATCAGGTCCTTTTCAATTGGATAACCAAAATAAGCTTCACTAGAAGGAT	120
10/2014      GAGTAGGGGTCATCAGGTCCTTTTCAATTGGATAACCAAAATAAGCTTCACTAGAAGGAT	120
11/2014      GAGTAGGGGTCATCAGGTCCTTTTCAATTGGATAACCAAAATAAGCTTCACTAGAAGGAT	120
2007         GAGTAGGGGTCATCAGGTCCTTTTCAATCGTATAACCAAAGTAAACTTCACTAGAAGGAT	120
2003         GAGTAGGGGTCATCAGGTCCTTTCCAATCATATAACCAAAATAAACTTCACTAGAAGGAT	120
                                                          ** ************

1976         ATTGTGGGGCAACAACACAATGGGCGTTACAGGAATATTGCAGTTACCTCGTGATCGATT	180
2004         ATTGTGGGGCAACAACACAATGGGTGTTACAGGAATATTGCAGTTACCTCGTGATCGATT	76
05/2014      ATTGTGAGGCGACAACACAATGGGTGTTACAGGAATATTGCAGTTACCTCGTGATCGATT	180
08/2014      ATTGTGAGGCGACAACACAATGGGTGTTACAGGAATATTGCAGTTACCTCGTGATCGATT	180
10/2014      ATTGTGAGGCGACAACACAATGGGTGTTACAGGAATATTGCAGTTACCTCGTGATCGATT	180
11/2014      ATTGTGAGGCGACAACACAATGGGTGTTACAGGAATATTGCAGTTACCTCGTGATCGATT	180
2007         ATTGTGGGGCAACAACACAATGGGTGTCACAGGAATATTGCAGTTACCTCGTGATCGATT	180
2003         ATTGTGAGGCAACAACACAATGGGTATTACAGGAATATTGCAGTTACCTCGTGATCGATT	180
             ****** *** *************  * ********************************
In [16]:
from Bio import AlignIO
from Bio.Align import AlignInfo

alignment = AlignIO.read('data/GP.clustal', 'clustal')
summary_align = AlignInfo.SummaryInfo(alignment)
summary_align
Out[16]:
<Bio.Align.AlignInfo.SummaryInfo at 0x7fdac87a1550>
In [17]:
dir(summary_align)
Out[17]:
['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_get_all_letters',
 '_get_base_letters',
 '_get_base_replacements',
 '_get_column_info_content',
 '_get_gap_char',
 '_get_letter_freqs',
 '_guess_consensus_alphabet',
 '_pair_replacement',
 'alignment',
 'dumb_consensus',
 'gap_consensus',
 'get_column',
 'ic_vector',
 'information_content',
 'pos_specific_score_matrix',
 'replacement_dictionary']

there are plenty of built in methods we can use to find out more about our alignments:

In [21]:
summary_align.dumb_consensus()
Out[21]:
Seq('GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGT...AAA', SingleLetterAlphabet())
In [22]:
summary_align.gap_consensus()
Out[22]:
Seq('GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGT...AAA', SingleLetterAlphabet())
In [24]:
consensus = str(summary_align.dumb_consensus())
In [25]:
from Bio import SeqUtils

search_cons = "AAAAAAA"
SeqUtils.nt_search(consensus, search_cons)
Out[25]:
['AAAAAAA', 1018]

woo hoo we found the editing site!

green-divider

How different are the proteins from each other? - Hamming Distance

After we perform an alignment, and we can extract the aligned seqeuences to calculate the hamming distance/

To compute the hamming distance we will use the skibio.alignment.TabularMSA class.

What this does is creates an object that properly formats our alignment into tabbed data so we can mess around with it.

First let's convert the alignment file into a fasta file using AlignIO's convert method.

We do this because we can now index (or label) from the first lines of our fasta file:

In [38]:
from Bio import AlignIO

fasta_aln = AlignIO.convert('data/GP.clustal', 'clustal', 'data/GP_aligned.fasta', 'fasta')
fasta_aln
Out[38]:
1

The output here is 1, which just means it converted our one file for us ---> we should see our new file appear in the working directory.

In [13]:
from skbio import DNA, TabularMSA

#this creates a `TabularMSA` object using our newly converted fasta file, and specifying that it is DNA
tabbed_alignment = TabularMSA.read('data/GP_aligned.fasta', format='fasta', constructor=DNA)

# reassigns the default ids (1 to n) to the fasta ID's instead of default 0,1,3 numbering
tabbed_alignment.reassign_index(minter='id')
tabbed_alignment
Out[13]:
TabularMSA[DNA]
-----------------------------------------------------------------------
Stats:
    sequence count: 8
    position count: 2406
-----------------------------------------------------------------------
GATGAAGATTAAGCCGACAGTGAGCGTAATCTT ... GAGAATGATAAACTTGATGAAGATTAAGAAAAA
--------------------------------- ... ---------------------------------
...
GATGAAGATTAAGCCGACAGTGAGCGTAATCTT ... GAGAATGATAAACTTGATGAAGATTAAGAAAAA
GATGAAGATTAAGCCGACAGTGAGCGCAATCTT ... GAGAATGATAAACTTGATGAAGATTAAGAAAAA

skbio is really awesome because it saves the alignment in a tidy dataframe for us; we can use our dataframe knowledge to access these aligned sequences. Let's check out the first:

In [29]:
tabbed_alignment[0]
Out[29]:
DNA
----------------------------------------------------------------------
Metadata:
    'description': ''
    'id': '1976'
Stats:
    length: 2406
    has gaps: False
    has degenerates: False
    has definites: True
    GC-content: 45.64%
----------------------------------------------------------------------
0    GATGAAGATT AAGCCGACAG TGAGCGTAAT CTTCATCTCT CTTAGATTAT TTGTTTTCCA
60   GAGTAGGGGT CGTCAGGTCC TTTTCAATCG TGTAACCAAA ATAAACTCCA CTAGAAGGAT
...
2340 AACAAGGTTT GACATCAATC TAGTTATCTC TTTGAGAATG ATAAACTTGA TGAAGATTAA
2400 GAAAAA
In [1]:
from skbio import DistanceMatrix
from skbio.sequence.distance import hamming

dm = DistanceMatrix.from_iterable(tabbed_alignment, metric=hamming, keys=tabbed_alignment.index)
print(dm)
/usr/local/lib/python3.8/site-packages/skbio/util/_testing.py:15: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as pdt
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-13d494d2818f> in <module>
      2 from skbio.sequence.distance import hamming
      3 
----> 4 dm = DistanceMatrix.from_iterable(tabbed_alignment, metric=hamming, keys=tabbed_alignment.index)
      5 print(dm)

NameError: name 'tabbed_alignment' is not defined
In [49]:
# let's customize a bit

plot = dm.plot(cmap = 'plasma')
In [73]:
! clustalo -i data/GP.fasta --distmat-out=data/GP_percentage.mat --full --percent-id 
In [77]:
! cat data/GP_percentage.mat
8
1976    100.000000 97.745053 96.425603 96.467165 96.342477 96.425603 98.004988 97.049044
2004    97.745053 100.000000 95.720202 95.766222 95.628164 95.674183 96.962724 96.180396
05/2014 96.425603 95.720202 100.000000 99.958437 99.916874 99.958437 96.799667 96.051538
08/2014 96.467165 95.766222 99.958437 100.000000 99.875312 99.916874 96.841230 96.093101
10/2014 96.342477 95.628164 99.916874 99.875312 100.000000 99.875312 96.674979 95.926850
11/2014 96.425603 95.674183 99.958437 99.916874 99.875312 100.000000 96.758105 96.009975
2007    98.004988 96.962724 96.799667 96.841230 96.674979 96.758105 100.000000 97.381546
2003    97.049044 96.180396 96.051538 96.093101 95.926850 96.009975 97.381546 100.000000
In [ ]:
 
Notebooks AI
Notebooks AI Profile20060