Skip to content

How to use InterProScan

by A. Bretaudeau on April 27, 2010

Let’s begin a series of how-tos on the usage of tools dedicated to pattern analyses. Today, let’s talk about InterProScan which is a very powerful tool for pattern matching.

The InterPro database

As said on InterPro website:

InterPro is an integrated documentation resource for protein families, domains, regions and sites. InterPro combines a number of databases (referred to as member databases) that use different methodologies and a varying degree of biological information on well-characterised proteins to derive protein signatures. By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful integrated database and diagnostic tool (InterProScan).

Let’s summarize it in a few words: InterPro is a big database combining data from several databases available around the world (PROSITE or Pfam for example). You can see InterPro as a huge list of known motifs in different formats: simple regex patterns, profiles, …

These motifs represents catalytic site, or functional sites found in proteins. They can be used for example to group proteins in families.


A very useful tool associated with InterPro is InterProScan. Its goal is to answer a simple question:

I have a new protein sequence, and I don’t know anything about it. Is there some known motif in it that would help me assign a function to the protein?

Now, how does it work? InterProScan takes a protein sequence (or nucleic, see below) and it tries to match it against each of the motif registered in its database. In fact, as motifs are registered in different formats, it launches several programs (blastprodom, fprintscan, hamap, and others: see full list below) that will compute the pattern matching. InterProScan then integrate all the results returned by each program to construct a full list of known motifs found in the input sequence.

Each motif is returned with the corresponding identifier in the InterPro member database. To be clearer: you can get for example the motif with id PF01623 from Pfam database.

If you enable the InterPro lookup, you can also get the InterPro identifier corresponding to the same motif: for example, the same motif is known as PF01623 in Pfam and as IPR002568 in InterPro.

InterProScan also offers the possibility to automatically retrieve the Gene Ontology terms associated to each motifs found in the input sequence. This eases the interpretation of the results. For example, for the motif IPR002568, the GO term GO:0003676 is returned. This term means that the found motif is related to the nucleic acid binding function.

Finally, depending on which member database returned a found motif, you can get a score to help you estimate the pertinence of the results.

Hands on InterProScan!

Web forms

Now that you know everything about InterProScan, it’s time to use it! You have the possibility to use it directly from EBI web submission form:

InterProScan submission form @ EBIOn this form, all you need to do is:

  • To specify your email if you want to receive the results in your mailbox
  • To select the program you want to be run with your sequence. Leaving everything checked is generally a good option if you don’t know which one to select.
  • To paste or upload a proteic sequence.

The same form is also available at the GenOuest platform if you want to use the GenOuest cluster (aka genocluster).

Command line

The web forms are fine if you want to submit one sequence, and only proteic sequences. But if you have many sequences to test or if your sequence is a nucleic one, using the command line is the best solution.

Command line can be useful too if you submit long sequences, as web forms usually limit the size of input sequences.

To use InterProScan on the GenOuest cluster, you have to create an account if you don’t already have one.

Before launching InterProScan, you have to define some environment variables: this is automatically done by launching the following line:

source /local/env/enviprscan

Now, you’re ready to launch InterProScan! To get some help on the available options, simply run:

iprscan -cli

This will show you something like that:

usage: iprscan -cli [-email <addr>] [-appl <name> ...] [-nocrc] [-altjobs] [-seqtype p|n] [-trlen <N>] [-trtable <table>]
 [-iprlookup] [-goterms] -i <seqfile> [-o <output file>]

 -i <seqfile>      Your sequence file (mandatory).
 -o <output file>  The output file where to write results (optional), default is STDOUT.
 -email <addr>     Submitter email address (required for non-interactive).
 -appl <name>      Application(s) to run (optional), default is all.
 Possible values (dependent on set-up):
 -nocrc            Don't perform CRC64 check, default is on.
 -altjobs          Launch jobs alternatively, chunk after chunk. Default is off.
 -seqtype <type>   Sequence type: n for DNA/RNA, p for protein (default).
 -trlen <n>        Transcript length threshold (20-150).
 -trtable <table>  Codon table number.
 -goterms          Show GO terms if iprlookup option is also given.
 -iprlookup        Switch on the InterPro lookup for results.
 -format <format>  Output results format (raw, txt, html, xml(default), ebixml(EBI header on top of xml), gff)
 -verbose          Print messages during run

Ok, this looks a little bit complicated. But here’s an example of command you can write:

iprscan -cli -i ~/your_input_sequence.fasta -o ~/output_iprscan.txt -iprlookup -goterms -email -format txt -seqtype p

This is the equivalent of the command which is launched when you use the EBI or GenOuest web form.

Do more with the command line

Nucleic sequences

Now, what if your input sequence is a nucleic one? You just have to switch the -seqtype option from p to n. InterProScan will then search patterns in the 6 different ORF (you can specify a translation table if you want). Be careful as this uses much more resources on the cluster.

Multiple sequences

Submitting many sequences using web forms is not really comfortable. The easiest solution for this is to use the iterator perl script distributed with InterProScan. To use, simply create a multifasta file containing all the sequences you would like to submit (don’t mix proteic and nucleic sequences off course). Now, on the genocluster, run a command like that (make sure you have issued the source command before that): -i ~/your_input_sequences.fasta -o ~/output_iprscan.txt -c "iprscan -cli -i %infile -iprlookup -goterms -email -format txt -seqtype p"

What does it mean? Simply that the iprscan command specified with the -c option will be applied to each sequence found in ~/your_input_sequences.fasta. The file ~/output_iprscan.txt will contain the full list of motifs found by InterProScan in each sequence.

The important part of the iprscan command is to use %infile for -i option, and not to use the -o option to allow the iterator script to capture the results.

Long sequences

If you want to submit long sequences to InterProScan (up to 1000 bp for nucleic sequences), you can use a dedicated InterProScan version on the genocluster. To use it, the only thing to do is to use an alternative environment script befor issuing usual commands:

source /local/env/enviprscan_local

Web service

An alternative to command line is to use web services. If you don’t know what it is, don’t worry, I’ll write an article about that soon!

The EBI proposes a SOAP webservice on their website, but nucleic sequences are not allowed. GenOuest doesn’t provide a webservice yet, but it should be available in a few weeks…

From → How-tos

  1. shefali permalink

    i have a query regarding interproscan results?? what IPR unintegrated term represent in results ?waiting for quick response ..

  2. IPR unintegrated means that InterProScan found a match on your sequence with a motif that doesn’t have a corresponding InterPro ID. This motif is only available in the member database with the given specific ID.

Trackbacks & Pingbacks

  1. InterProScan: new web form and web service | Dr Motifs – Blog

Leave a Reply

Note: XHTML is allowed. Your email address will never be published.

Subscribe to this comment feed via RSS