Protocol Online logo
Top : Forum Archives: : Bioinformatics and Biostatistics

Develop a way to retrieve frequent protein segments by suitable algorithm - program, idea and so on . (Aug/02/2006 )

Hi! I am Rod.

Each protein sequence can be expressed as a number of fragments.
A set of proteins can then be classified into categories for prediction of protein families, subcellular localizations, etc.

More specifically, I am going to develop the program first so as to construct such tree by feeding it with the protein sequences one by one each time.

Upon the completion of tree construction, all fragments with significant frequencies can be retrieved as features.

How can i do by enter the next step? generalized suffix tree (GST) , or not?
please give me some suggestions, or your experience.
thank you so much !

-rod-

Hello Rod,

>Each protein sequence can be expressed as a number of fragments.
Ok, windows I assume. This is common practice - generally windows of 15 around a central residue are used.

>A set of proteins can then be classified into categories for prediction of protein families, subcellular >localizations, etc.
Erm, not sure what you mean here. Classically proteins are broken up by their secondary structure content - either as a SCOP or CATH like family or as the overall fold type. Classifiers are then trained on data from say SCOP all alpha proteins.

>More specifically, I am going to develop the program first so as to construct such tree by feeding it >with the protein sequences one by one each time.
So do you mean an 'online' learning algorithm?
What does this tree do - what do you want it to do? Please explain a bit more.

>Upon the completion of tree construction, all fragments with significant frequencies can be retrieved >as features.
Is a tree the best way to do something like this? What do you feature vectors look like?

>How can i do by enter the next step? generalized suffix tree (GST) , or not?
I have never used a GST, it looks alot like a decision tree (from a breif scan of the wikipedia page). It is hard to guess what route is best given that I don't understand what you are going to try.

>please give me some suggestions, or your experience.
I have been using SVM, ANN, Bayesian theory and various distance based methods for sometime... you may want consider these.

>thank you so much!
Thanks if you hadn't posted this I wouldn't have come across GST... I will now be reading up on GSTs.

-perlmunky-