Skip to content
Nov 17 10

New versions: MEME and InterProScan

by A. Bretaudeau

Hello all,

Just a few words to keep you informed about the progress of the Dr Motifs project.

InterProScan

InterProScan 4.7 has been released a few days ago. We have installed this new version on our cluster.

The main new feature is that now it uses hmmer 3 for Gene3D and Pfam databanks. This means faster and better results hopefully!

The next big release planned is 5.0 (no release date yet known). According to the authors, it will be a complete rewrite of the tool.

Our web form is still available at iprscan.genouest.org and as a webservice on our Opal server.

The EBI as also created a new web form. The results visualization seems more convenient. But I experienced a few instabilities while testing… I will test it more extensively when it will get more stable.

MEME suite

We have updated MEME to the brand new 4.5 version. The main change is about the TOMTOM application which allows to compare a motif against a database of known motifs from databanks like Jaspar. The submission form is now more usable and a proper webservice is available.

GenOuest is an alternate MEME server

GenOuest is an alternate MEME server

We are also proud to appear now in the list of alternate servers for MEME!

Web interface for the MEME suite is available here.

More details on how to use this software suite will come in a one of my next articles.

What’s next

In the next weeks, I will be:

  • preparing a training session on motifs study;
  • installing new tools;
  • writing more tutorials about these tools;
  • finishing a little surprise for drmotifs website.

Just a word about the tools I am going to install: most useful publicly available tools are now deployed, I am going to focus on some tools developed by the Symbiose INRIA team we are working with.

Oct 15 10

Sequence hammering

by A. Bretaudeau

Hi all,

As promised in my last post, I’m going to show you what are the Hmmer tools and how you can use them for your sequence analysis.

First, you have to know that Hmmer is a collection of tools dedicated to the manipulation of Hidden Markov Models. So it is particularly useful for studying complex motifs with subtle signals. And it is designed to work with protein sequences.

Hmmer 3 is a suite of 12 tools (hmmalign, hmmbuild, hmmconvert, hmmemit, hmmfetch, hmmpress, hmmscan, hmmsearch, hmmsim, hmmstat, jackhmmer and phmmer). But don’t be afraid, with only 5 of them, you can already do great things! And there’s a good documentation for each tool on the official website.

The most common things you can do are summarized in the following figure:

Hmmer usage

So let’s have a look at the most important tools.

From sequences to HMM: hmmbuild

Let’s suppose you have a set of sequences and you think that they contain a common motif. With hmmer, you can build an HMM representing this motif. To do this, there are basically two steps:

  1. First create a multiple alignment of your sequences using a program like clustalw (there are other program capable of doing this task, but this is not really the topic of this post);
  2. Give this multiple alignment to the hmmbuild tool included.

Several alignment formats can be read by hmmbuild: aln (clustalw output format), or Stockholm for example.

The output of hmmbuild is a text file representing the found HMM.

How to use it?

Hmmbuild is available on our platform:

Mobyle web interface

SOAP webservice on our Opal server

Alternatively, the whole hmmer suite is available from command line on genocluster2: just issue the command “source /local/env/envhmmer-3.0” (or “. /local/env/envhmmer-3.0.sh” if you use bash) and you’ll get access to the 12 hmmer tools.

Blastp-like search: phmmer and jackhmmer

The first rudimentary version of Jackhmmer

Hmmer comes with two other useful tools: phmmer and jackhmmer. They do the same kind of analysis as blastp and psi-blast respectively.

Phmmer takes a protein sequence as input and search for similar sequences in a protein databank (NR for example).

Jackhmmer do the same work as phmmer, except it repeats it iteratively. This means that, as psi-blast do, it launches phmmer a first time, then look at the results, select the best matches to the query sequences, build a new HMM from it and search again into the databank for new similar sequences. You can set the maximum number of iterations to be done (and if you set it to 1, it will do exactly as a normal phmmer search).

You may wonder why you should use phmmer/jackhmmer (or not) instead of the traditional blastp/psi-blast? Performances is the answer: it seems that phmmer runs a bit faster than the good old blastp (well, I’ve only done a quick test, check it with your data). But keep in mind that it only works for proteic sequences. And you’re not guaranteed to get the same results as with blastp (scores are not identical). So try it and see if you’re happy with it!

How to use it?

Both tools are available on our platform:

Phmmer on Mobyle and Jackhmmer on Mobyle

Phmmer SOAP webservice and Jackhmmer SOAP webservice

Alternatively, the whole hmmer suite is available from command line on genocluster2: just issue the command “source /local/env/envhmmer-3.0” (or “. /local/env/envhmmer-3.0.sh” if you use bash) and you’ll get access to the 12 hmmer tools.

Searching some known HMM in a new sequence: hmmscan

Do you remember how InterProScan works? Hmmscan is quite similar: it takes as input a fasta sequence, and it searches in it any occurrences of HMM registered in specific databank. The main HMM databank used by hmmscan is Pfam. It is a databank of HMM representing protein families. In fact there are two sections in Pfam: Pfam-A which is a manually curated collection of protein families. Pfam-B is a bit lower quality as it contains families automatically generated. So the usual process is to search first using Pfam-A, and if you don’t get results, search using Pfam-B.

InterProScan is doing the same king of thing, but it uses several tools to search within several databanks. In fact, when you use InterProScan, you already use hmmer. In the list of programs used by InterProScan, there is hmmpfam, which is the ancestor of hmmscan (in hmmer 2).

How to use it?

Hmmscan is available on our platform:

Hmmscan on Mobyle

Hmmscan SOAP webservice

Alternatively, the whole hmmer suite is available from command line on genocluster2: just issue the command “source /local/env/envhmmer-3.0” (or “. /local/env/envhmmer-3.0.sh” if you use bash) and you’ll get access to the 12 hmmer tools.

Searching for sequences using a HMM: hmmsearch

Hmmsearch does the opposite of hmmscan: you start from a HMM and then you search into sequence databanks for sequences containing the HMM your interested in.

Using it is quite simple: just give a HMM and select a sequence databank (a proteic one) to search in. Instead of the sequence databank, you can also give a fasta file containing some specific sequences.

The result is a list of matches with corresponding scores.

How to use it?

Hmmsearch is available on our platform:

Hmmsearch on Mobyle

Hmmsearch SOAP webservice

Alternatively, the whole hmmer suite is available from command line on genocluster2: just issue the command “source /local/env/envhmmer-3.0” (or “. /local/env/envhmmer-3.0.sh” if you use bash) and you’ll get access to the 12 hmmer tools.

HMM retrieving: hmmfetch

Hmmfetch is another tool which lets you retrieve a HMM from a databank (Pfam for example). You just have to give a list of HMM identifiers and hmmfetch will give you the whole HMM file. Identifiers can be for example PF00045 or Caudal_act.

How to use it?

Hmmfetch is available on our platform:

Hmmfetch on Mobyle

Hmmfetch SOAP webservice

Alternatively, the whole hmmer suite is available from command line on genocluster2: just issue the command “source /local/env/envhmmer-3.0” (or “. /local/env/envhmmer-3.0.sh” if you use bash) and you’ll get access to the 12 hmmer tools.

Other tools

Other hmmer tools are also available on the platform (web interfaces on Mobyle and SOAP webservices on Opal), though there are not very useful for most of the analysis.

That’s it for today!

Sep 22 10

Some news: hmmer 3 and meme

by A. Bretaudeau

Hi all,

September has been a busy month (and it’s not finished…)! But finally, I found some time to work on two things:

Hmmer 3

Copyright © Michael JastremskiHmmer is a suite of tools for the analysis of HMM motifs (remember my post about HMM?).

Hmmer 3 has been officially released on 28 March 2010 and we have installed it on our cluster. If you have an account, you just have to launch the command “source /local/env/envhmmer-3.0”, and you’re ready to use it.

Hmmer is composed of the following programs: hmmalign, hmmbuild, hmmconvert, hmmemit, hmmfetch, hmmscan, hmmsearch, hmmsim, hmmstat, jackhmmer and phmmer.

Now hmmer3 is also available directly from our mobyle server. And I have deployed all the tools of this suite as SOAP webservices on our Opal server too.

I’ll try to show you how to use these tools later, but you can already start to play with it!

You can also have a look at the official documentation which present each tool. It is available here.

MEME

MEME is another suite of tools for pattern matching and discovery. For more info, see the official website.

It is composed of a few main tools: MEME, MAST, TOMTOM, GOMO, GLAM2, GLAM2SCAN, FIMO and MCAST. Smaller utilities are available only from command line.

All of them are available from our cluster (source /local/env/envmeme). I have installed the default web interface too and it is available here. And webservices are too available, as usual, from our Opal server.

Like the hmmer suite, I’ll try to describe MEME features later, but you can already try to use it following the documentation available from the web interface.

These tools should also be accessible directly from our mobyle server soon.

That’s all for today…

Aug 13 10

The technical architecture

by A. Bretaudeau

Let’s talk about the technical architecture we’re working on for the Dr Motifs project!

The goal of Dr Motifs is to provide an easy access to bioinformatics tools dedicated to pattern study. We would like it to be useful for both beginners and power users. So we’re planning on deploying for each tool a user friendly web interface and a SOAP-based web-service.

Web interfaces: Mobyle

Mobyle is a web portal “aimed at the integration of bioinformatics software and databanks”.

Concretely we have installed it on our servers and you can access it at mobyle.genouest.org. And it looks like this:

How does it work? On the left side of the portal, you have a list of all the programs you can get access directly from your browser. You can search for a specific program (blast or clustalw for example). Click on a program and you get a web form where you can select a few options and upload your input data. Now click on the ‘Run’ button at the top and the program will be launched.

An advantage of using Mobyle is that you can reuse your data within each tool. For example, you can upload a multifasta file to create a multiple alignment using clustalw, then directly use this alignement to construct an HMM with hmmbuild and finally use this HMM to search within a database for similar sequences using hmmsearch. And all of this without the need to download then upload intermediate files.

Mobyle keep a trace of all the jobs you run, and you can access it simply by clicking on it on the left side of the page.

With this, you can build small workflows to work with your data directly in your browser. And there will be a great graphical workflow editor in the next Mobyle release.

From the provider side of you, all we have to do to add a tool in the mobyle interface is to write a XML file describing how the program can be launched, and from that the interface is automatically generated. It really eases our work, so don’t hesitate to tell us if you would like us to add a tool to our mobyle portal.

Web-services: Opal

Mobyle is great for providing a user-friendly web interface. But some of you might be interested in accessing our tools using programmatically. That’s why we’re deploying SOAP-based web-services using Opal.

With this kind of web-services, you can launch any of our tool directly from your code (Java, Perl, Python, etc) or from applications like Taverna. I’ll try to write some help/tutorials on this some day.

Opal is a toolkit developed at the NBCR (California, USA). We also contributed some code to add some features we were missing. The principle is simple: when we add a tool to our Opal installation, a WSDL file is automatically created which can be used by any user willing to access this tool programmatically.

All our services currently deployed are listed on the Opal dashboard. You can also get access directly to the WSDL file list here.

As Mobyle, adding a new tool is as simple as writing a XML file describing the program, so let us know if you want a tool to be added.

What else?

All of this is cool, but that’s not all! What about users that don’t know which pattern matching or discovery tool to use with their data? Based on this technical architecture, we are going to develop an even more user-friendly web interface for people having difficulties choosing the right program for the right usage. You’ll learn more about this soon. (Yes, this is some kind of teasing!)

Jul 28 10

Hiden Markov Models (HMM)

by A. Bretaudeau

The life of Andrey M.

Do you know Andrey Markov?

(source: wikimedia)

As written in Wikipedia, “Andrey Andreevich Markov was born in 1856 in Ryazan as the son of the secretary of the public forest management of Ryazan, Andrey Grigorevich Markov, and his first wife Nadezhda Petrovna Markova“.

Apart from having a beautiful son, he also gave birth to what is now known as Markov chains.

Ok, why am I talking about him here?

HMM: Hidden Markov models

For the moment, I have described 2 main ways to describe a motif: patterns and profiles. Hidden Markov models (HMM) is the third (and last) main method to represent a motif.

HMM are based on Markov chains as described by Mr Markov during his brilliant mathematician career. Briefly, they’re statistical models that allow to represent a motif based on the probability to find a given letter after another one. Let’s have a look at an example to try to understand this. If we have this set of aligned sequences containing a motif:

AATACT
GA-AGT
ATTAGA
GCTAGT

We can represent the corresponding motif with the following HMM:

How do you read that? It is not that much difficult: each cycle is a position in the motif. For each of these positions, the probability to find each letter of the alphabet is written (there is a probability of 0.25 to find a T in position 2).

You also get the probability assigned to the transition from a position to another: for example, in position 2, you have the choice to go to the position 3 or directly to position 4 (which occurs when there’s a gap).

The image above is one possibility to view a HMM, but you can also use more eye candy visualization like logos:

In logos, the bigger the letters are, the more conserved they are in the motif.  When no letter is shown, it simply means that no significant letter is expected at the corresponding position. If you want to create your own logos, you can use WebLogo service.

As noticed for profiles, this example is based on information provided by the observation of an alignment of sequences containing the motif. But you can also add more biological knowledge to this kind of motif, particularly when using proteic sequences: you just have to modify the probabilities to be more tolerant to some substitutions between related amino-acids.

Using HMM to describe motifs allows to use many algorithms written for similar problems related or not to bioinformatics. It is used for example for speech recognition, artificial intelligence or pattern analysis in non biological data.

In bioinformatics, using HMM motifs is mainly helpful for complex motifs. It brings the ability to find motifs occurrences in divergent sequences.

Conclusion on motif formats

Now that we’ve talked about HMM, you know about the 3 main motifs formats used in bioinformatics. Each format differs mainly by their expressivity and their performances. Choose PROSITE patterns if you want to discover or match simple patterns with low divergence.  If your motif is more complex, choose profiles or HMM.

Jul 1 10

Profiles, PWM, PSWM, PSSM

by A. Bretaudeau

Remember the problem we faced some times ago?

The problem is that with a PROSITE syntax, you can’t write patterns with subtleties like “in position X, there’s a G most of the times, but sometimes a C“.

This kind of situation is very frequent in biological sequences. Interactions between proteins and/or sequences do not always follow the all-or-none law: if a protein binds to a specific DNA sequence, there are good chances that it will also bind to slightly degenerated sequences too. The interaction may be a bit weaker, but it can be sufficient for the biological function to occur.

Now all the art of pattern matching and discovery is to be tolerant to degenerated versions of a motif, without accepting random sequences that don’t have the biological function.

Profiles

To address this issue, a category of motifs is commonly used: they’re named profiles. First, have a look at this alignment:

AATAGTCGC
GGTAGTCTA
ATTAGTCGA
GCTAGTCGG

And now, the corresponding profile:

   1     2     3     4     5     6     7     8     9
A: 0.5   0.25  0     1     0     0     0     0     0.5
T: 0     0.25  1     0     0     1     0     0.25  0
G: 0.5   0.25  0     0     1     0     0     0.75  0.25
C: 0     0.25  0     0     0     0     1     0     0.25

Columns correspond to the position in the motif, and rows are the possible letters (DNA can only contain A, T, G or C). This array gives the probability to find a letter at a given position in the motif.

In literature, you’ll find several terms that correspond to profiles: PWM (Position Weight Matrix), PSSM (Position-Specific Scoring Matrix), PSWM (Position-Specific Weight Matrix).

Matching profiles

To do pattern matching with profile is simple. Let’s try to test the following sequence with the motif above:

ATGAGTCGA

All we have to do is to make an addition of each score assigned to each letter in the matrix above:

0.5 + 0.25 + 0 + 1 + 1 + 1 + 1 + 0.75 + 0.5 = 6 points

So with profiles it is possible to compute a score to evaluate if a sequence contains a given motif. The higher the score is, the higher is the probability that the sequence contains the searched motif. This is a quantitative result, whereas with PROSITE syntax you only get a qualitative result (the sequence contains the motif or not).

Discovering profiles

In the above example, we built our profile using an alignment of sequences. So the scores written in the matrix are calculated from observations of real sequences.

There are other ways to build such matrices, especially for proteic sequences. This way, you can increase the expressivity of the motif with biological knowledge.

A common technique is to assign higher scores to some amino acids that are rarely seen in biological sequences or that have important biochemical properties. Suppose you know there’s a conserved phenylalanine (F) and leucine (L) in your motif. Phenylalanine (F) is much more meaningful than a simple leucine (L) due to their chemical properties. So you can increase the score assigned to the phenylalanine in the profile. This way a degenerated sequence that contains the phenylalanine (F) but not the leucine(L) will get a higher score than a sequence that contains only the leucine.

Another technique is also related to amino-acids biochemical properties: if you have a conserved leucine (L) in your motif, you can also assign a significant score to other amino-acids that are similar: for example isoleucine (I). And you can set a negative score to amino-acids completely different, for example proline (P).

This is a Venn diagram representing the different families of amino-acids based on their biochemical properties:

Amino Acids Venn Diagram

source: wikimedia

As you can see leucine and isoleucine are close to each other which means they have similar properties.

BLOSUM, PAM: substitution matrices

Another approach is used to improve profile generation. Think about the leucine/isoleucine case above: it is possible to assign a significant score to the isoleucine if there’s a conserved leucine at a position in the motif. But which score should we assign? 1 point, 10 points, more?

To address this issue, substitution matrices have been designed, the most famous ones being BLOSUM or PAM. The principle is simple: take a big set of related sequences with a homogeneous degree of divergence, then compute the frequency of substitution of one amino-acid by another. You end up with a matrix that looks like that:

BLOSUM62 substitution matrix (source: Hannes Röst, wikimedia)

For each amino acid on the left, you get the score corresponding to the substitution by another amino acid at the bottom. For example, if a leucine is replaced by a glycine, there’s a penalty of -4 points. But if a tryptophan is conserved, the score is very high (11 points). If you find an isoleucine instead of leucine, you’ll get 2 points which is still positive.

This kind of matrix is used by BLAST to assign a score to an alignment of proteins with substitutions.

Conclusion

To sum up the situation:

– PROSITE syntax is great for describing very conserved patterns. It gives a yes/no response when doing pattern matching. This kind of pattern is based on observation of sequence alignments.

– Profiles (PWM/PSWM, PSSM) are useful for less conserved motifs. Pattern matching with profiles gives a quantitative result (a score). You can improve your profiles with biological data.

We now have 2 ways to describe a motif! But this is not finished. There are HMMs too. We’ll talk about that later…

Jun 1 10

InterProScan: new web form and web service

by A. Bretaudeau

Remember the InterProScan how-to?

I’ve just made available 2 new ways to run InterProScan on our platform:

A new web interface!

InterProScan on Mobyle

I’ve added InterProScan in the Mobyle platform. If you don’t know Mobyle yet, I will present it later. It brings a nicer web interface to run InterProScan, and the ability to launch many other tools in an integrated platform.

A web service!

InterProScan is now also available as a standard SOAP web service. You can find its WSDL on our Opal server, together with other web services already available on our platform.

Have fun playing with it!

May 18 10

Motifs, patterns, profiles…

by A. Bretaudeau

As you know, this site is dedicated to patterns! And motifs! And profiles too! And we will surely talk about Mr Markov!

Wait… Before you get lost, I think it’s time for some definitions!

Patterns or motifs?

We will use the words “patterns” and “motifs” indistinctly to talk about “a recurring thing in something“.

In the bioinformatics context, something is usually a sequence: a nucleic one for genes, genomes, or RNA. It can also be proteic primary sequences (i.e. amino acids).

And thing can represent functional regions in the genome like promoter site where some transcription factors binds. For proteins, it can be a catalytic site common to several proteins.

These patterns help to find functions for new sequences or to group them in families or subfamilies.

Discovery and matching

There are two complementary ways to study motifs in biological sequences:

The first one is motif discovery: briefly, you have a set of sequences with a common function (same catalytic function, or known to bind some transcription factor for example) and you want to know the exact parts of the sequences that are common to all sequences and that are responsible for the function. A simple way to discover motifs is to align the sequences and see if there are some conserved parts.

When you already know a motif (because you or anybody else has discovered it), you can switch to pattern matching methods. In this case, you have a good description of a pattern, and you want to find some new sequences containing it. This is very useful when you want to find new members of a protein family for example, or if you want to annotate a gene.

InterProScan is a good example of a pattern matching tool: it performs a pattern matching using many well-known patterns and searching in a given protein sequence.

Pattern formats

An important point when studying motifs is to understand their different formats. Let’s have a look at a few sequences containing a motif :

...AATAGTCGC...
...GGTAGTCTA...
...ATTAGTCGA...
...GCTAGTCGG...

As you can see, theses 4 sequences share an identical portion in the middle: TAGTC.

There are many ways to represent a sequence pattern. One of the easiest way is to use regular expressions (or regex if you prefer). In fact regex are a family of formats, and in bioinformatics, the most popular one is PROSITE.

Regular expression: PROSITE syntax

PROSITE patterns allow to represent simple motifs using a format easily readable by humans. This syntax has been standardized by the PROSITE databank which stores many known motifs in different formats available to the scientific community.With the previous example, the motif would be represented like this:

T-A-G-T-C

Each position in the sequence is represented by a letter: A, T, G or C in the case of nucleic sequences. You can also use x which is a special letter meaning ‘at this position, there can be any letter‘. This is very helpful when you have gaps in a motif. Each position in the sequence is separated by ‘-‘.

Now, suppose you have a new sequence related to the 4 above:

...AATAGTCGC...
...GGTAGTCTA...
...ATTAGTCGA...
...GCTAGTCGG...
...CTTACTCGG...

With these sequences, you can write the following pattern:

T-A-[GC]-T-C

This means that the 3rd letter can be a G or a C.

With the PROSITE syntax you can also write more complex motifs like repeats of letters with variable length. For example, A(2,4) means that the letter A can be repeated from 2 to 4 times.

So PROSITE syntax is very helpful to represent simple motifs. Many motifs, like catalytic sites for example, are well-known and are made available in databanks like PROSITE. You can view a real example of pattern there.

Unfortunately, biology is complex

However, there are some cases where using PROSITE syntax is too limited to represent a complex motif: if you look at the previous example, we have written that a G or a C can be present in the 3rd position. But, it’s not that simple! In fact we encountered only one sequence with a C and four with a G.

This problem is quite common in biological motifs as perfectly conserved motifs are not that common.

Fortunately, there are other motifs syntaxes to address this issue: Profiles and Hidden Markov Model. We will talk about this in a future post!

Apr 27 10

How to use InterProScan

by A. Bretaudeau

Let’s begin a series of how-tos on the usage of tools dedicated to pattern analyses. Today, let’s talk about InterProScan which is a very powerful tool for pattern matching.

The InterPro database

As said on InterPro website:

InterPro is an integrated documentation resource for protein families, domains, regions and sites. InterPro combines a number of databases (referred to as member databases) that use different methodologies and a varying degree of biological information on well-characterised proteins to derive protein signatures. By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful integrated database and diagnostic tool (InterProScan).

Let’s summarize it in a few words: InterPro is a big database combining data from several databases available around the world (PROSITE or Pfam for example). You can see InterPro as a huge list of known motifs in different formats: simple regex patterns, profiles, …

These motifs represents catalytic site, or functional sites found in proteins. They can be used for example to group proteins in families.

InterProScan

A very useful tool associated with InterPro is InterProScan. Its goal is to answer a simple question:

I have a new protein sequence, and I don’t know anything about it. Is there some known motif in it that would help me assign a function to the protein?

Now, how does it work? InterProScan takes a protein sequence (or nucleic, see below) and it tries to match it against each of the motif registered in its database. In fact, as motifs are registered in different formats, it launches several programs (blastprodom, fprintscan, hamap, and others: see full list below) that will compute the pattern matching. InterProScan then integrate all the results returned by each program to construct a full list of known motifs found in the input sequence.

Each motif is returned with the corresponding identifier in the InterPro member database. To be clearer: you can get for example the motif with id PF01623 from Pfam database.

If you enable the InterPro lookup, you can also get the InterPro identifier corresponding to the same motif: for example, the same motif is known as PF01623 in Pfam and as IPR002568 in InterPro.

InterProScan also offers the possibility to automatically retrieve the Gene Ontology terms associated to each motifs found in the input sequence. This eases the interpretation of the results. For example, for the motif IPR002568, the GO term GO:0003676 is returned. This term means that the found motif is related to the nucleic acid binding function.

Finally, depending on which member database returned a found motif, you can get a score to help you estimate the pertinence of the results.

Hands on InterProScan!

Web forms

Now that you know everything about InterProScan, it’s time to use it! You have the possibility to use it directly from EBI web submission form:

InterProScan submission form @ EBIOn this form, all you need to do is:

  • To specify your email if you want to receive the results in your mailbox
  • To select the program you want to be run with your sequence. Leaving everything checked is generally a good option if you don’t know which one to select.
  • To paste or upload a proteic sequence.

The same form is also available at the GenOuest platform if you want to use the GenOuest cluster (aka genocluster).

Command line

The web forms are fine if you want to submit one sequence, and only proteic sequences. But if you have many sequences to test or if your sequence is a nucleic one, using the command line is the best solution.

Command line can be useful too if you submit long sequences, as web forms usually limit the size of input sequences.

To use InterProScan on the GenOuest cluster, you have to create an account if you don’t already have one.

Before launching InterProScan, you have to define some environment variables: this is automatically done by launching the following line:

source /local/env/enviprscan

Now, you’re ready to launch InterProScan! To get some help on the available options, simply run:

iprscan -cli

This will show you something like that:

usage: iprscan -cli [-email <addr>] [-appl <name> ...] [-nocrc] [-altjobs] [-seqtype p|n] [-trlen <N>] [-trtable <table>]
 [-iprlookup] [-goterms] -i <seqfile> [-o <output file>]

 -i <seqfile>      Your sequence file (mandatory).
 -o <output file>  The output file where to write results (optional), default is STDOUT.
 -email <addr>     Submitter email address (required for non-interactive).
 -appl <name>      Application(s) to run (optional), default is all.
 Possible values (dependent on set-up):
 blastprodom
 fprintscan
 hamap
 hmmpfam
 hmmpir
 hmmpanther
 hmmtigr
 hmmsmart
 superfamily
 gene3d
 patternscan
 profilescan
 seg
 coils
 [tmhmm]
 [signalp]
 -nocrc            Don't perform CRC64 check, default is on.
 -altjobs          Launch jobs alternatively, chunk after chunk. Default is off.
 -seqtype <type>   Sequence type: n for DNA/RNA, p for protein (default).
 -trlen <n>        Transcript length threshold (20-150).
 -trtable <table>  Codon table number.
 -goterms          Show GO terms if iprlookup option is also given.
 -iprlookup        Switch on the InterPro lookup for results.
 -format <format>  Output results format (raw, txt, html, xml(default), ebixml(EBI header on top of xml), gff)
 -verbose          Print messages during run

Ok, this looks a little bit complicated. But here’s an example of command you can write:

iprscan -cli -i ~/your_input_sequence.fasta -o ~/output_iprscan.txt -iprlookup -goterms -email your.email@example.com -format txt -seqtype p

This is the equivalent of the command which is launched when you use the EBI or GenOuest web form.

Do more with the command line

Nucleic sequences

Now, what if your input sequence is a nucleic one? You just have to switch the -seqtype option from p to n. InterProScan will then search patterns in the 6 different ORF (you can specify a translation table if you want). Be careful as this uses much more resources on the cluster.

Multiple sequences

Submitting many sequences using web forms is not really comfortable. The easiest solution for this is to use the iterator perl script distributed with InterProScan. To use, simply create a multifasta file containing all the sequences you would like to submit (don’t mix proteic and nucleic sequences off course). Now, on the genocluster, run a command like that (make sure you have issued the source command before that):

iterator.pl -i ~/your_input_sequences.fasta -o ~/output_iprscan.txt -c "iprscan -cli -i %infile -iprlookup -goterms -email your.email@example.com -format txt -seqtype p"

What does it mean? Simply that the iprscan command specified with the -c option will be applied to each sequence found in ~/your_input_sequences.fasta. The file ~/output_iprscan.txt will contain the full list of motifs found by InterProScan in each sequence.

The important part of the iprscan command is to use %infile for -i option, and not to use the -o option to allow the iterator script to capture the results.

Long sequences

If you want to submit long sequences to InterProScan (up to 1000 bp for nucleic sequences), you can use a dedicated InterProScan version on the genocluster. To use it, the only thing to do is to use an alternative environment script befor issuing usual commands:

source /local/env/enviprscan_local

Web service

An alternative to command line is to use web services. If you don’t know what it is, don’t worry, I’ll write an article about that soon!

The EBI proposes a SOAP webservice on their website, but nucleic sequences are not allowed. GenOuest doesn’t provide a webservice yet, but it should be available in a few weeks…

Apr 14 10

Announcing the Dr Motifs project

by A. Bretaudeau

Hello,

Welcome to the inaugural post for the Dr Motifs project! I think this is the perfect place to present you the aim of this project.

What’s this?

So the first question you might ask is: what is this project?

The basic idea of Dr Motifs is to build a platform to ease the access to bioinformatics tools dedicated to pattern matching and pattern discovery.

Main targets are:

  • To propose most standard pattern matching and discovery tools
  • To propose innovative software developed by INRIA Symbiose team (or other teams worldwide)
  • To integrate all these tools into a common platform, allowing workflows creation
  • To create pre-designed workflows to be used by beginners for common analysis
  • To create a website to publish news and reviews from the patterns world (great, you’ve found this website!)
  • To provide some help to people wanting to learn more about patterns, but who don’t know how to start.

Why this project?

Dr Motifs is a response to a situation that we believe is problematic: in the last few years there has been more and more software developed in the field of pattern discovery and matching. However, this field of research is still quite obscure for many people, and most notably by potential users of these tools.

The usual questions are: “Which tool is adapted to find X type of motif?”, “Is tool A better than tool B?”, “Tool C was cool 5 years ago, is there anything better now?”, …

Even for a specialist, it is often difficult to have a complete vision of all the software and algorithms available. For a biologist, it is even more complicated to find his way in the world of patterns. This is becoming more sensible with the growth of high-throughput technologies where people have to run many analysis and choose software based on the quality of its results and its speed of execution.

We would like Dr Motifs to become a comprehensive website to help people:

  • to launch common analysis in a simple manner
  • to make choices when they want to do new analysis
  • to stay in touch with patterns actuality.

GenOuest and Symbiose team have been active in the field of pattern analysis these last few years, and we believe we can make such a website.

Who’s working on that?

The original idea for Dr Motifs has grown at the GenOuest bioinformatics platform in Rennes (France) led by Olivier Collin. The project is funded by IBiSA organism.

Development is done by Anthony Bretaudeau, developer at the GenOuest bioinformatics platform.

Other people from GenOuest or INRIA Symbiose team (which is the team associated to the GenOuest platform) will participate to this blog from time to time.

You are encouraged to participate to this project by commenting our work, sharing your thoughts on the projects, …

Why is it called “Dr Motifs”?

Choosing a good name is not the easiest part of a project. This name came up after a very long and intensive brainstorming a little thinking about the thematic of the project: “pattern discovery and matching”.

In french you write it: “Découverte et recherche de Motifs“. Tada! We found the perfect name: “Dr Motifs”.

Ok, what’s next?

At the time of writing, I’m doing an inventory of the tools already available on the platform. Most of them are currently available from our website, in the “Pattern discovery” and “Pattern matching and models” sections. It is also time to do some bibliography! Development will become soon, but we’ll discuss that later.