Protocol Online logo
Top : Forum Archives: : Bioinformatics and Biostatistics

Weight matix - (Jun/30/2006 )

Hi , everyone!
I want to construct the weight matrix from the multiple alignment results.
But I can't find a simple tool to do this, can you tell me some more easier methods except the neural network learning algorithm .
Thanks!
yours truely
bo yang

-boyang-

Hello.

Erm can you elaborate a bit... by weight matrix do you mean something like the BLOSUM or PAML matrix? If not, what, exactly do you mean. Second, why would you even consider a NN because you are not trying to 'learn' anything?

-perlmunky-

thanks for perlmunky's reply!
I'm sorry for my unelaborate post. what i need to do is constructing a position weight matrix from the conserved region in the multiple alignment results because it may be a conserved binding site.
Then the weight matrix would be used to predict other posible sites in the genome.
I can only get some reference article about the NN learning approach, but I must spend most of my time to do experiments. For me whose major was biology, the NN learning also was too hard because of my poor mathmatics skills. But i think it maybe a good chance to improve my skills.
Many thanks!
thanks for your enthusiastic help!

-boyang-

Oh right!

You don't want a weight matrix - if you are talking in term of machine learning, which you are, a weight matrix is something different! If you wish to measure conservation then you could use something like Shannons Theory of Information (also called Entropy - see the wikipedia for the mathematical definition - it is simple but I can't recall it at the moment (lack of sleep!)). Now take that idea and throw it away - you don't want it or any other silly measure of conservation!

The problem you are trying to solve is an interesting one, indeed alot of work has been done using machine learning to solving biological problems - promoter identification, prediction of NLS, contact number, solvent accessibility, secondary structure - the list goes on.

An idea starting point for your work is PSI-BLAST! If you take your sequences and PSI-BLAST them, you can - with the right command switches - get something called a PSSM (position specific scoring matrix) this profile will detail the occurence of each residue in every position in a multiple sequence alignment (MSA). - you will need a parser to get just the information you want (they exist online - if you can't find one then I have one you can copy).

Now for each residue in our initial (seed) sequence we have a vector describing the percentage likelihood of seeing any of the other residues (this is pretty much you conservation/weight matrix. Congratulations!

Now with a set of characterised proteins you can link each postion to either being an active site or not (1 = true = active site: 0 = false = not active site). Now link that data to the vectors:
.
.
.
1 (VECTOR)
1 (VECTOR)
0 (VECTOR)
.
.
.

The question is, is a single residue and it's associated vector enough to predict if the residue is part of the active site? The answer is probably not! You need to sample the sequence space around that residue - forming a window. The next questions are: how big should this window be? should I use a pseudo-position for terminal regions? Should I remain in the sequence space using differnet length windows at terminal regions so I DO NOT HAVE TO MAKE A PSEUDO position.... the list goes on and I am lacking imagination!

The answer to these questions is not simple - much work has been done on optimising window length - generally the default is set to a window of 15 residues centered around a core residue (the one you know about). Then to make matters worse, you need to consider other features that may aid you identification of said biological feature... would secondary structure help, solvent accessibility, contact number etc, if they do have an impact is it statistically significant?

My suggestions to you would be have a look at the JOONE software for creating a Neural Net or be a sheep and follow the latest bioinformatics trend and use a support vector machine (SVM).... both a mathematically challenging but if I can use them then so can anyone else!

Post back with any questions and I will try to help / clarify my poor explanations / ramblings

-perlmunky-

QUOTE (perlmunky @ Jul 1 2006, 08:38 AM)
Oh right!

You don't want a weight matrix - if you are talking in term of machine learning, which you are, a weight matrix is something different! If you wish to measure conservation then you could use something like Shannons Theory of Information (also called Entropy - see the wikipedia for the mathematical definition - it is simple but I can't recall it at the moment (lack of sleep!)). Now take that idea and throw it away - you don't want it or any other silly measure of conservation!

The problem you are trying to solve is an interesting one, indeed alot of work has been done using machine learning to solving biological problems - promoter identification, prediction of NLS, contact number, solvent accessibility, secondary structure - the list goes on.

An idea starting point for your work is PSI-BLAST! If you take your sequences and PSI-BLAST them, you can - with the right command switches - get something called a PSSM (position specific scoring matrix) this profile will detail the occurence of each residue in every position in a multiple sequence alignment (MSA). - you will need a parser to get just the information you want (they exist online - if you can't find one then I have one you can copy).

Now for each residue in our initial (seed) sequence we have a vector describing the percentage likelihood of seeing any of the other residues (this is pretty much you conservation/weight matrix. Congratulations!

Now with a set of characterised proteins you can link each postion to either being an active site or not (1 = true = active site: 0 = false = not active site). Now link that data to the vectors:
.
.
.
1 (VECTOR)
1 (VECTOR)
0 (VECTOR)
.
.
.

The question is, is a single residue and it's associated vector enough to predict if the residue is part of the active site? The answer is probably not! You need to sample the sequence space around that residue - forming a window. The next questions are: how big should this window be? should I use a pseudo-position for terminal regions? Should I remain in the sequence space using differnet length windows at terminal regions so I DO NOT HAVE TO MAKE A PSEUDO position.... the list goes on and I am lacking imagination!

The answer to these questions is not simple - much work has been done on optimising window length - generally the default is set to a window of 15 residues centered around a core residue (the one you know about). Then to make matters worse, you need to consider other features that may aid you identification of said biological feature... would secondary structure help, solvent accessibility, contact number etc, if they do have an impact is it statistically significant?

My suggestions to you would be have a look at the JOONE software for creating a Neural Net or be a sheep and follow the latest bioinformatics trend and use a support vector machine (SVM).... both a mathematically challenging but if I can use them then so can anyone else!

Post back with any questions and I will try to help / clarify my poor explanations / ramblings


Thanks for your generous explanation.

I had downloaded a parser named blast-parser which is written by python , but it seemed that
can't get the PSSM from the PSI-BLAST results. So whether you can send me a copy? And my mail adrress is boyang@163.com
other questions would be added after i look at the JOONE software and other resoures.
Many thanks!

-boyang-

Here is some example perl code you should look at:
It might work, you need to give it the full path to the file <DIR> and then the filename <FILE>
It will print to standard output which you can either redirect to make the script write to another file... something you can do.

CODE
#!/usr/bin/perl

use warnings;
use strict;

my ($dir, $file ) = @ARGV;

if (@ARGV != 2) { die "usage: <DIR> <FILE>\n"; }

open F, $dir.$file || die "Can't open $file :: $!\n";

while(<F>) {
  chomp;
  /^Sta|^PSI|Last|\s+\d/ig and next;
  s/\s{1,}//;

  my @line = split /\s{1,}/;
  my $check = scalar @line;
  my $acid = $line[1];
  my @vect = @line[22 .. 41];
  print "@vect\n";
}

-perlmunky-

Many thanks!

-boyang-