Skip to content

Profiles, PWM, PSWM, PSSM

by A. Bretaudeau on July 1, 2010

Remember the problem we faced some times ago?

The problem is that with a PROSITE syntax, you can’t write patterns with subtleties like “in position X, there’s a G most of the times, but sometimes a C“.

This kind of situation is very frequent in biological sequences. Interactions between proteins and/or sequences do not always follow the all-or-none law: if a protein binds to a specific DNA sequence, there are good chances that it will also bind to slightly degenerated sequences too. The interaction may be a bit weaker, but it can be sufficient for the biological function to occur.

Now all the art of pattern matching and discovery is to be tolerant to degenerated versions of a motif, without accepting random sequences that don’t have the biological function.

Profiles

To address this issue, a category of motifs is commonly used: they’re named profiles. First, have a look at this alignment:

AATAGTCGC
GGTAGTCTA
ATTAGTCGA
GCTAGTCGG

And now, the corresponding profile:

   1     2     3     4     5     6     7     8     9
A: 0.5   0.25  0     1     0     0     0     0     0.5
T: 0     0.25  1     0     0     1     0     0.25  0
G: 0.5   0.25  0     0     1     0     0     0.75  0.25
C: 0     0.25  0     0     0     0     1     0     0.25

Columns correspond to the position in the motif, and rows are the possible letters (DNA can only contain A, T, G or C). This array gives the probability to find a letter at a given position in the motif.

In literature, you’ll find several terms that correspond to profiles: PWM (Position Weight Matrix), PSSM (Position-Specific Scoring Matrix), PSWM (Position-Specific Weight Matrix).

Matching profiles

To do pattern matching with profile is simple. Let’s try to test the following sequence with the motif above:

ATGAGTCGA

All we have to do is to make an addition of each score assigned to each letter in the matrix above:

0.5 + 0.25 + 0 + 1 + 1 + 1 + 1 + 0.75 + 0.5 = 6 points

So with profiles it is possible to compute a score to evaluate if a sequence contains a given motif. The higher the score is, the higher is the probability that the sequence contains the searched motif. This is a quantitative result, whereas with PROSITE syntax you only get a qualitative result (the sequence contains the motif or not).

Discovering profiles

In the above example, we built our profile using an alignment of sequences. So the scores written in the matrix are calculated from observations of real sequences.

There are other ways to build such matrices, especially for proteic sequences. This way, you can increase the expressivity of the motif with biological knowledge.

A common technique is to assign higher scores to some amino acids that are rarely seen in biological sequences or that have important biochemical properties. Suppose you know there’s a conserved phenylalanine (F) and leucine (L) in your motif. Phenylalanine (F) is much more meaningful than a simple leucine (L) due to their chemical properties. So you can increase the score assigned to the phenylalanine in the profile. This way a degenerated sequence that contains the phenylalanine (F) but not the leucine(L) will get a higher score than a sequence that contains only the leucine.

Another technique is also related to amino-acids biochemical properties: if you have a conserved leucine (L) in your motif, you can also assign a significant score to other amino-acids that are similar: for example isoleucine (I). And you can set a negative score to amino-acids completely different, for example proline (P).

This is a Venn diagram representing the different families of amino-acids based on their biochemical properties:

Amino Acids Venn Diagram

source: wikimedia

As you can see leucine and isoleucine are close to each other which means they have similar properties.

BLOSUM, PAM: substitution matrices

Another approach is used to improve profile generation. Think about the leucine/isoleucine case above: it is possible to assign a significant score to the isoleucine if there’s a conserved leucine at a position in the motif. But which score should we assign? 1 point, 10 points, more?

To address this issue, substitution matrices have been designed, the most famous ones being BLOSUM or PAM. The principle is simple: take a big set of related sequences with a homogeneous degree of divergence, then compute the frequency of substitution of one amino-acid by another. You end up with a matrix that looks like that:

BLOSUM62 substitution matrix (source: Hannes Röst, wikimedia)

For each amino acid on the left, you get the score corresponding to the substitution by another amino acid at the bottom. For example, if a leucine is replaced by a glycine, there’s a penalty of -4 points. But if a tryptophan is conserved, the score is very high (11 points). If you find an isoleucine instead of leucine, you’ll get 2 points which is still positive.

This kind of matrix is used by BLAST to assign a score to an alignment of proteins with substitutions.

Conclusion

To sum up the situation:

– PROSITE syntax is great for describing very conserved patterns. It gives a yes/no response when doing pattern matching. This kind of pattern is based on observation of sequence alignments.

– Profiles (PWM/PSWM, PSSM) are useful for less conserved motifs. Pattern matching with profiles gives a quantitative result (a score). You can improve your profiles with biological data.

We now have 2 ways to describe a motif! But this is not finished. There are HMMs too. We’ll talk about that later…

No comments yet

Leave a Reply

Note: XHTML is allowed. Your email address will never be published.

Subscribe to this comment feed via RSS