Skip to content

Motifs, patterns, profiles…

by A. Bretaudeau on May 18, 2010

As you know, this site is dedicated to patterns! And motifs! And profiles too! And we will surely talk about Mr Markov!

Wait… Before you get lost, I think it’s time for some definitions!

Patterns or motifs?

We will use the words “patterns” and “motifs” indistinctly to talk about “a recurring thing in something“.

In the bioinformatics context, something is usually a sequence: a nucleic one for genes, genomes, or RNA. It can also be proteic primary sequences (i.e. amino acids).

And thing can represent functional regions in the genome like promoter site where some transcription factors binds. For proteins, it can be a catalytic site common to several proteins.

These patterns help to find functions for new sequences or to group them in families or subfamilies.

Discovery and matching

There are two complementary ways to study motifs in biological sequences:

The first one is motif discovery: briefly, you have a set of sequences with a common function (same catalytic function, or known to bind some transcription factor for example) and you want to know the exact parts of the sequences that are common to all sequences and that are responsible for the function. A simple way to discover motifs is to align the sequences and see if there are some conserved parts.

When you already know a motif (because you or anybody else has discovered it), you can switch to pattern matching methods. In this case, you have a good description of a pattern, and you want to find some new sequences containing it. This is very useful when you want to find new members of a protein family for example, or if you want to annotate a gene.

InterProScan is a good example of a pattern matching tool: it performs a pattern matching using many well-known patterns and searching in a given protein sequence.

Pattern formats

An important point when studying motifs is to understand their different formats. Let’s have a look at a few sequences containing a motif :

...AATAGTCGC...
...GGTAGTCTA...
...ATTAGTCGA...
...GCTAGTCGG...

As you can see, theses 4 sequences share an identical portion in the middle: TAGTC.

There are many ways to represent a sequence pattern. One of the easiest way is to use regular expressions (or regex if you prefer). In fact regex are a family of formats, and in bioinformatics, the most popular one is PROSITE.

Regular expression: PROSITE syntax

PROSITE patterns allow to represent simple motifs using a format easily readable by humans. This syntax has been standardized by the PROSITE databank which stores many known motifs in different formats available to the scientific community.With the previous example, the motif would be represented like this:

T-A-G-T-C

Each position in the sequence is represented by a letter: A, T, G or C in the case of nucleic sequences. You can also use x which is a special letter meaning ‘at this position, there can be any letter‘. This is very helpful when you have gaps in a motif. Each position in the sequence is separated by ‘-‘.

Now, suppose you have a new sequence related to the 4 above:

...AATAGTCGC...
...GGTAGTCTA...
...ATTAGTCGA...
...GCTAGTCGG...
...CTTACTCGG...

With these sequences, you can write the following pattern:

T-A-[GC]-T-C

This means that the 3rd letter can be a G or a C.

With the PROSITE syntax you can also write more complex motifs like repeats of letters with variable length. For example, A(2,4) means that the letter A can be repeated from 2 to 4 times.

So PROSITE syntax is very helpful to represent simple motifs. Many motifs, like catalytic sites for example, are well-known and are made available in databanks like PROSITE. You can view a real example of pattern there.

Unfortunately, biology is complex

However, there are some cases where using PROSITE syntax is too limited to represent a complex motif: if you look at the previous example, we have written that a G or a C can be present in the 3rd position. But, it’s not that simple! In fact we encountered only one sequence with a C and four with a G.

This problem is quite common in biological motifs as perfectly conserved motifs are not that common.

Fortunately, there are other motifs syntaxes to address this issue: Profiles and Hidden Markov Model. We will talk about this in a future post!

No comments yet

Leave a Reply

Note: XHTML is allowed. Your email address will never be published.

Subscribe to this comment feed via RSS