Sciencemadness Discussion Board

Anyone doing Protein Folding ?

franklyn - 25-7-2006 at 07:01

Contribute to the advancement of Biological Science by

donating your computer's unused C.P.U. cycles. Join

the distributed computing effort to resolve the Protein

Folding dilema.

"In 1969 Cyrus Levinthal introduced the concept, later to be known
as Levinthal Paradox, that a protein cannot find its native state
by a random search through all its possible conformations, because
such an exhaustive enumeration in theory would require eons of
years, while proteins fold on a fraction of second."

Determing how proteins fold and misfold picks up where the

human Genome project left off. Some would say now the real

work begins.

"We have about a billion DNA bases in our genome.That's less than
1GB of hard drive storage! The genetic differences between humans
and chimps is less than a million DNA bases --
small enough to easily fit on a 1.4MB floppy!"

See the Project home page for more info here _

An exellent " how to " resource for installing and managing

the software.


Ramiel - 25-7-2006 at 09:24

Yea, I've been using <a href="">BOINC</a> to devote all of my idle time to predictors and [shamefully perhaps] SETI for about 6 months now, it's a good idea of course, no matter how you look at it.

p.s. I apologise for not posting very much in the way of science people, been down in for a couple of months too, away from my lovely chemicals.

Polverone - 25-7-2006 at 17:24

Yes, I let my PC do some folding when idle. It's my little way of saying "thank you" to the Pande group for porting the Amber force fields to Gromacs, which I use daily for molecular dynamics simulations.

chemoleo - 25-7-2006 at 18:05

Protein folding...yeah :)
It sounds so interesting, and so much like the final answer to biology - but in reality, the problems to predicting structure from sequence is almost impossible.
The Levinthal paradox is always taught at uni, but it is inherently faulty - it ignores folds that overlap with itself, but instead calculates all possible conformations (there are three angles to each amino acid, and at 100 amino acid that makes 3^100 possibilities. There are of course many less, as there may be internal clashes. Plus, there are many that are energetically not favorable. That's why the view of a 'folding funnel' was invoked, where some low energy structures fold further and further until the lowest energy one is found. Incrementality, or iteration, are the clues to this.

Anyway... I really wonder what algorithm they use to calculate structures from sequence. To my knowledge no such algorithm exists, instead some force field (all newtonian based, but not quantum mechanically) approximations are made. THey sometimes work, but mostly they don't, and often structures are determined *with* the knowledge of experimetnally determined structures, making the predictions biased.
All the computing power is great, but if the algorithm is incorrect, then it is useless. Get the right algorithm, and the Nobel and eternal fame will be yours, truly.

nitro-genes - 26-7-2006 at 04:48

I wonder if the perfect algorithm is even possible. Some proteins are heavily dependant on the action of chaperones during folding. These chaperones may actually give the protein a different folding energy distribution than that can be predicted from whatever computer algorith there is. Completely ab inito calculations of protein structure may not be possible in the near future, but algoritms like ROSETTA and Monte Carlo are quite promising, avoiding the "local minimum problem" that was a problem for some of the first algorithms designed. They can assist in "finetuning" NMR and crystallography derived structures to a higher resolution and may give a good indiction of the structure of an unknown protein. I know for sure that this method of looking at protein structures will become more and more important as this technique will become more mature...
Reasons enough for me to join the "Dutch Power Cows" in their epic battle to become champion of the peptides! :P

A good understandable book btw that explains protein structure prediction from sequence was "Introduction to bioinformatics, by Lesk" for me...:)

[Edited on 26-7-2006 by nitro-genes]

Polverone - 26-7-2006 at 17:41

The folding@home project is as much about testing methods as actually trying to fold proteins. I don't think there's any reason, in principle, why newtonian methods can't handle most folding problems; excited states and the making/breaking of covalent bonds aren't typically involved, are they? It's conformational changes and weak interactions within the molecule and with solvent and ions, right?

There are 3 big obstacles to protein folding with force field based methods, as I see them:
1) The force fields may be too simple to reproduce the path that leads to experimental structures. For example, maybe it turns out that explicit polarization is necessary and the old AMBER and CHARMM workhorses would never, ever give rise to correct folding for most proteins.
2) The force fields may be theoretically adequate to represent the process, but are not appropriately parameterized to accomplish it in practice. Water models, ion models, and biopolymer models all get parameterized based on different criteria. Parameterizations use substantial shortcuts because of shortfalls in electronic structure calculations/experimental data that are available. Force field parameters explicitly dedicated to protein folding might be necessary for good results.
3) If I'm to believe Wikipedia, proteins of a few hundred amino acids fold on a time scale of milliseconds. That's an enormous length of time for a molecular dynamics simulation. Nanosecond simulations are routine, but microseconds are still quite exotic. Milliseconds are staggering, and the slowness with which millisecond-length simulations can be produced will likely impede the iterative improvement of points 1 and 2.

For all we know, the algorithm(s) needed to properly fold proteins are already here: we just can't see it because our computers are too slow and/or the parameters chosen so far have been poor.

chemoleo - 27-7-2006 at 17:02

From what I understand, the biggest problem resides in 2). The presence of molecules *other* than the protein molecule itself. Most simiulations to my knowledge model folding in vacuo, that is, in the absence of water, ions, cofactors, or simply other proteins with which the protein in question cannot adopt the correct fold. How are we ever going to account for water molecules in a given protien molecule? We cannot even calculate the dynamic hydration cage around a simple small amphiphatic molecule for instance. HOw can this then work with large molecules such as proteins? Furthermore many proteins simply do not fold up in conditions of low salt. Physiological ionic strength within cells is around 150 mM, with all sorts of counterions, them being K, Na, Ca and Cl, SO4, PO4 etc. How can you essentially calculate the infinite number of 'conformations' which are all highly dynamic of the solvent surrounding the protien, and thus keeping it stably in solution?

This is why proteomics-based approaches are essentially doomed to fail. Noone can purify protiens, quantify them in their folded states, in simple Chip assays, unlike DNA. Each protien is essentially a new chemical on its own. General purification protocols are only applicable to soluble simple small proteins, but fail as soon as complexes, or larger proteins are involved.

Are you aware of programs that, apart from empirical parameterisation, that rigourously and physically account for the presence of solvent molecules?

We can't even do it for simple molecules, what makes you think it will be possible for large proteins to be folded in silico? Re. problems that are non-Newtonian in nature - from NMR I know that the environment has huge effects on things such as the electron clouds, which is seen through the observable effect on the spin resonances (chemical shifts). Noone can even attempt to predict how solvents will affect the resonances, how then can you think we can predict the folding of a protien altogether? Then you have the 3-body problem. Here you have thousands of 'bodies', making a deterministic solution impossible. THis is I suppose why such large computational power is required, to determine attempts of solutions iteratively...

Polverone - 27-7-2006 at 17:37

MD simulations routinely use explicit water and counterions. Vacuum simulations of DNA fail horribly after a short time, so I always include water. Are you sure you can't see hydration cages around small molecules? I've written little programs to analyze hydrogen bonding networks among and residence times of water molecules around altered DNAs, and you can definitely see enhanced hydration effects around newly introduced polar groups (this is all from analysis of MD simulations I've run).

We don't need exact or analytical solutions to the folding problem; iterative and/or stochastic solutions are fine as long as the later stages are close enough to real protein behavior. Being able to get arbitrarily close to the solution, as with other many-body problems, should be fine.

As I said before, I don't know if current force field models are capable of handling folding well, but I wouldn't dismiss the entire approach so quickly. At the same time I would remain cautious because it will take decades of Moore's law, or at least many years of development of dedicated hardware, to make millisecond-length MD simulations routine, and that's regardless of whether or not current force fields are good enough. Fancier force fields are generally slower as well, though over time that's offset by improved implementation techniques.

chemoleo - 27-7-2006 at 18:20

Ok, I confess I am not up to date on the inclusion of solvent in MD simulations. AT the time of my undergrad, we were certainly taught of this insurmountable problem. I wonder how this is taken account of. I shall ask around in the lab later.

Also, how does DNA horribly fail in vacuo? Does it simply not form the double helix? Are the positions of the water molecules realistic, do they overlap with H2O's in i.e. crystal structures?
By the way, wht I am talking about is the precise locations of the water, which are of course extremely dynamic, but presumably 'average out' over time.

Anyway, experimentally observed folding of proteins is still far from a decided overlap with simulated folding, although there are a couple of reports which claim this (and of which I am doubtful, see Valerie Daggetts papers).

Polverone - 27-7-2006 at 20:44

Explicit solvent has been available in popular MD codes for decades. Proteins and small molecules don't do such awful things as nucleic acids do in vacuo, though, so maybe the simplified for-undergraduates story you heard was "we can't do solvated simulations" when they meant "we don't want to spend the computer time to do solvated simulations."

Some popular water models are/were SPC, SPCE, and TIP3P. Since water molecules are very numerous in solvated systems, and people generally care more about solute behavior than solvent behavior, they've generally been computationally simple models parameterized on (for example) reproducing the density of liquid water. I wouldn't be surprised at all if they fail to match many experimental hydration effects. Anecdotally, I have simulated one DNA starting from an experimental structure that had a single water molecule explicitly included (in the "hole" left by an abasic site), and water molecules diffused in and out of that location over the course of the whole simulation.

Sometimes you get lucky: for example, with the right choices the SPC water model does an excellent job of reproducing the relative free energies of hydration of chloride and bromide ions, and it's about as simple as water models get. But this is coincidence, not design. Michael Shirts has written some very interesting papers on highly accurate, precise, reproducible determination of relative and absolute solvation free energies using molecular dynamics (the folding@home project produced at least some of the data that he used for this). He found that current water models are all pretty bad at reproducing solvation free energies, but now that simulation on the necessary grand scale is available, it's possible to reparameterize models with the explicit goal of correcting these free energies.

It's possible in many packages to use non-water solvents also, but water is by far the most common and its performance is specially optimized in e.g. Gromacs and probably other packages too; I could solvate my systems in methanol or THF, but it'd be considerably slower.

In vacuum, electrostatic repulsion between the backbones will rapidly disintegrate the double helix. You can fake it for a while (a few hundred picoseconds) if you use a short cutoff for coulombic interactions, so the charges are rarely "seen" by the paired strands, but I think even those simulations eventually fall apart. For stability we add explicit water, explicit Na+ counterions, and model long-range coulombic interactions with particle mesh Ewald or fast multipole method (this isn't a way to "cheat" the charges out of the system, just a way to speed up simulation and keep the computational expense scaling under control).

unionised - 1-8-2006 at 09:19

I think it's a safe bet we are all doing protein folding. Otherwise we would be dead.

franklyn - 5-8-2006 at 21:20

Originally posted by unionised
I think it's a safe bet we are all doing protein folding. Otherwise we would be dead.


Nerro - 6-8-2006 at 04:21

Wouldn't it be much easier to just save up a lot of models of proteins yet to be folded and to let a supercomputer do it in a few hours? Or is the combined power of the people on the web (those that want to participate that is, not all people on the web) equal to a supercomputer already?

Also if we have 1 billion nucleobases that would require almost 2 GB because there are four "bits" in DNA rather than the two bits used on a harddrive. Every base would have to be numbered using 2 "harddrivebits" either 00 01 10 or 11 :)

franklyn - 6-8-2006 at 09:20

Originally posted by Nerro
Wouldn't it be much easier to just save up a lot of models of proteins yet to be folded and to let a supercomputer do it in a few hours?

The problem is to determine the spacial configuration of a given protein chain.
Its called folding in a similar sense to origami, in which a sheet of paper can
be made into countless indefinite shapes.
A very good analogy is to dangle a chain of the kind used for jewelry and as it
is set down, observe the particular pattern it forms as it bundles in a heap.
This can be repeated indefinitely and the pattern will never repeat.
This illustrates the problem, each protein has a definite characteristic shape.
How does it do that ? Out of an almost literal infinity of possibilities.


franklyn - 6-9-2006 at 08:02


Nerro - 6-9-2006 at 09:43

well, it's not exactly an infinate supply of possibilities. After all the energies vary with every way the proteins are folded and certain sequences always form helices or beta-sheets. Proteins may have similar structures, or partially similar sequences.

Are you really telling me it's takes so long to do a large number of rough approximations? When those are done the most probable ones could be calculated properly later...

Sorry for the critical look, it just seems like alot of people are making mice out of elephants :P

unionised - 6-9-2006 at 11:40

"well, it's not exactly an infinate supply of possibilities. "
Are you sure?
The angle between 2 bits of the protein might be 180degrees, it might be 90 and it might be any of an infinite variety of angles in between.
Anyway, supercomputers are expensive whereas getting people to donate computer time as screensavers a la SETI is cheap.
Presumably, if it is a computationally simple problem ,then as soon as someone downloads the software and gives it a few minutes it will all be over and done with.

neutrino - 6-9-2006 at 12:37

Supercomputers really aren’t as powerful as people think.

A supercomputer is nothing more than a large number of smaller computers in an array. Put together a bunch of desktops and you can rival the most powerful supercomputers in existence. In fact, Virginia Tech did this a few years ago:

from the Wikipedia

System X is a supercomputer assembled by Virginia Tech in the summer of 2003, comprising 1,100 Apple PowerMac G5 computers….On November 16, 2003, it was ranked by the TOP500 list as the third-fastest supercomputer in the world.

A large network of ordinary desktops really does possess an awe-inspiring amount of computing power.

Polverone - 6-9-2006 at 13:37

Originally posted by neutrino
Supercomputers really aren’t as powerful as people think.

A supercomputer is nothing more than a large number of smaller computers in an array. Put together a bunch of desktops and you can rival the most powerful supercomputers in existence.

A large network of ordinary desktops really does possess an awe-inspiring amount of computing power.

This is only true for some problems. The F@H and all other @Home projects work because they can break up their research into many small sub-problems that can execute independently. These are "embarassingly parallel" problems that are trivial to scale up. Many problems need rapid and frequent communication between processors to get any parallel speedup, and that's where multiprocessors, clusters, and "true" supercomputers shine a lot brighter than a big pile of PCs with special screensavers.

The line between cluster and "true" supercomputer is blurred, but both of them are almost sure to have very high speed connections between the nodes of the computer. Gigabit ethernet is entry-level nowadays, but faster and more specialized setups like Infiniband, Myrinet, or Quadrics are needed to get decent scaling on many problems. When inter-node communication gets too slow, the processors waste their time waiting for information from their neighbors. The larger the number of processors, the more likely (in general) that there will be significant slowdown due to communication bottlenecks. Eventually you reach a point where adding new processors does no good at all. Making communication-heavy problems take advantage of a large number of processors requires a careful mix of software and hardware capabilities, and is very difficult.

Wolfram - 23-2-2007 at 05:35

Are you sure the protein structures known are right? Could it not be so that the protein structure gets modified when proteins form crystals?

nitro-genes - 17-5-2007 at 10:15

No, because both crystallography and NMR can be used for determining protein structure. The advantage of NMR is that the protein can be used in solution so no need to grow perfect crystals of the protein you want to study, which is not always easy. Crystallography does however give a higher resolution, but generally the NMR and crystallography determined structures are not far apart...

Re: anyone doing protein folding?

Pyridinium - 17-5-2007 at 11:16

This might be a huge oversimplification, but I always figured proteins "know" how to fold in the same way any other molecules "know" to form certain compounds or stacked complexes in solution. It's just a difference of scale.

I'm looking in Proteins by T.E. Creighton (great book). It says a typical globular protein has maybe 30% of its amino acids at turns where they don't participate in folding bonds. I guess some passable folding algorithms for a given protein could be made by putting together tidbits like these... although computational chemistry was never my favorite subject.

Creighton's book hints that intracellular proteins are actually folding and unfolding all the time in a cell. I think I remember the profs telling us the same thing. Strange, but I guess it makes sense if they are only H-bonded and not pegged together with disulfides.

They'd still have to spend more time folded than they would unfolded in order to do anything useful, correct?

EDIT: I should have mentioned that it's mainly the structural / extracellular proteins that have disulfide bonds. Intracellular proteins are more often strictly H bonded. Yes I'm sure you could find some exceptions to this.

[Edited on 18-5-2007 by Pyridinium]

Taaie-Neuskoek - 22-7-2007 at 13:59

This is why proteomics-based approaches are essentially doomed to fail. Noone can purify protiens, quantify them in their folded states, in simple Chip assays, unlike DNA. Each protien is essentially a new chemical on its own. General purification protocols are only applicable to soluble simple small proteins, but fail as soon as complexes, or larger proteins are involved.

As someone working in a group where people do a lot of proteomics, I tend to disagree. Ok, the proteome is still a really big black box, but there are some really interesting developments... now there's the chip-chip array so you can fish transcription factors easier, there are people who made protein arrays where you can dump your favourite MAPK on and see which proteins are phosphorilated, etc. Now do proteomic people themselve completely confess that proteomics is still in its very beginning, and that still a hell of a lot needs to be done, but I believe that they'll give some great tool to scientists... and at least they're honest about it, unlike metabolomics people who promise a hell of a lot more than they actually can do.

In our group we do some some very basic proteomics work, mainly 2D gels and try to find differences between genotypes (talking plants here), which seems to work well. Sometimes even in-gel assays for native proteins are done successfully. This technique as also been automated, and has/is delivering quite good results...

My BOINC project

pantone159 - 10-9-2007 at 12:06

I am about to start doing some of this stuff. I will be working on a distributed computing project, also based on BOINC, in my case the goal is to study cell adhesion and migration by molecular modeling of membrane proteins.

I don't understand the science much yet, at first I will be focused on the programming tasks of getting a placeholder computing system working, which will take some time, and after that will start to look at distributing our existing codes. (I don't know much about the specific models/codes that have been used so far. For example, I don't know how they compare to the Folding@home and Rosetta@home models.)

However, I see that some of the people here are really familiar with these kinds of models. I'm interested in pointers on where I can learn how these models work.

Regarding distributed computing systems... It seems like there are two main distributed computing frameworks that people use, one being BOINC (Berkely Open Infrastructure for Network Computing), which is used by Seti@home, Rosetta@home, and will be used by our stuff, and Cosm (perhaps also named Mithril) which is used by Folding@home.
We picked BOINC as it seems to be more heavily used, so under more active development, more feature-filled, and more general. (I think that the exact same code that you downloaded to run Seti@home could be used to run our stuff, just by pointing it at a different website to get instructions from.)

At some point (not yet), we will be interested in volunteers, mainly for basic testing purposes at first.

So, my main questions are:
1 - Where should I go to read about the models used? I know little specifically about molecular dynamics models, although I do generally understand the underlying physics.
2 - Any advice regarding distributed computing?