Build Status

Downloads

Prwlr - profiles crawler

Prwlr integrates Genetic Interactions and Phylogenetic Profiles.

Nothing is more fun that BLASTing each protein sequence from the organisms of interest!

Prwlr uses KEGG Orthology to determine who is the ortholog of whom. You don’t have to download it manually - Prwlr uses its API!

We all love to use 20-or-so different software pieces just to annotate one network! And to store the profiles in some unintelligible form!

Phylogenetic Profiles are simple python objects. They are represented as binary lists with characters of choice (but + and - are my favourite) and hold a couple of small-but-useful methods.

Bioinformatics software is difficult to create so it should be hard for someone else!

Prwlr is numpy- and pandas-based wherever possible. It integrates well with pandas.DataFrames.

Let’s use Prwlr!

Get the Phylogenetic Profiles for each of the organism’s ORF.

import prwlr as prwl

species=[
    'Aeropyrum pernix',
    'Agrobacterium fabrum',
    'Arabidopsis thaliana',
    'Bacillus subtilis',
    'Caenorhabditis elegans',
    'Chlamydophila felis',
    'Dictyostelium discoideum',
    'Drosophila melanogaster',
    'Escherichia coli',
    'Homo sapiens',
    'Plasmodium falciparum',
    'Staphylococcus aureus',
    'Sulfolobus islandicus',
    'Tetrahymena thermophila',
    'Trypanosoma cruzi',
    'Volvox carteri',
]

profiles = prwlr.profilize_organism(
    organism="Saccharomyces cerevisiae",
    reference_species=species
)

profiles.head()
ORF_ID PROF
YNL113W -+–+-+-++–+++
YNL130C ——–+——
YNL141W -++-+-++++—++
YNL151C -+–+—+——
YNL162W ++–+-+-++-++++

Please note that KEGG Orthology put restrictions on the download bandwidth. Profilizing the ORFs set for an entire organism can take up to several minutes

You’re short on time and interested in just a subset of ORFs? Use restrict_to parameter:

profiles = prwl.profilize_organism(
    organism="Saccharomyces cerevisiae",
    reference_species=species,
    restrict_to=["YNL113W", "YNL130C", "YNL141W", "YNL151C"],
)

The order of the organisms in the profile’s .query attribute is always imposed by prwlr by sorting and removing the duplicates. Thanks to that, you will always get the same profile for the same set of organisms, no need to worry about that.

Parse your Genetic Interactions network. It can come from the widely-known Costanzo Network or from any other source.

ExN_NxE = prwl.read_sga('./SGA_ExN_NxE.txt')

OK, now let’s integrate it…

ExN_NxE_profiles = prwl.merge_sga_profiles(
    ExN_NxE,
    profiles,
)

…and calculate the distances between the profiles!

ExN_NxE_profiles_pss = prwl.calculate_pss(
    ExN_NxE_profiles,
    method='jaccard',
)

How does it look now?

ExN_NxE_profiles_pss
ORF_Q GENE_Q ENTRY_Q PROF_Q ORF_A GENE_A ENTRY_A PROF_A GIS SMF_Q SMF_A DMF PSS
YBL097W brn1-16 K06676 -+—-++-+—-+- YPL110C gde1 K18696 ——+——–+ 0.0219 0.8542 1.0235 0.8962 0.8333333
YBL097W brn1-16 K06676 -+—-++-+—-+- YPL115C bem3 K19840 —————- 0.0121 0.8542 0.9865 0.8547 1.0
YBL097W brn1-16 K06676 -+—-++-+—-+- YPL116W hos3 K11484 —————- -0.0147 0.8542 1.01 0.8481 1.0
YBL097W brn1-16 K06676 -+—-++-+—-+- YPL119C dbp1 K11594 -+–+-++-++–+-+ -0.0036 0.8542 1.013 0.8617 0.5555556
YBL097W brn1-16 K06676 -+—-++-+—-+- YPL120W vps30 K08334 -+–+-++-+—-++ -0.0488 0.8542 0.871 0.6952 0.2857143
YBL097W brn1-16 K06676 -+—-++-+—-+- YPL127C hho1 K11275 -+–+-++-+—–+ 0.0082 0.8542 0.996 0.8589 0.42857143
YBL097W brn1-16 K06676 -+—-++-+—-+- YPL134C odc1 K15110 —-+-++-+—–+ -0.0139 0.8542 1.025 0.8616 0.5714286
YBL097W brn1-16 K06676 -+—-++-+—-+- YPL135W isu1 K22068 -+–+-++-++–++- 0.0368 0.8542 0.9295 0.8308 0.375
YBL097W brn1-16 K06676 -+—-++-+—-+- YPL138C spp1 K14960 ——-+-+—— -0.0763 0.8542 0.9973 0.7756 0.6
YBL097W brn1-16 K06676 -+—-++-+—-+- YPL140C mkk2 K08294 —————- -0.025 0.8542 1.011 0.8386 1.0

Maybe you would like to see what’s inside on of the profiles?

ExN_NxE_profiles_pss.iloc[0].PROF_A.get_present()
['DDI', 'VCN']

With something more human-readable?

IDs_names = prwl.get_IDs_names(species)

[
    IDs_names[i]
    for i in ExN_NxE_profiles_pss.iloc[0].PROF_A.get_present()
]
['Dictyostelium discoideum', 'Volvox carteri']