[Xplor-nih] PASD for structure determination of proteins from solid state data

Thu Nov 23 08:23:56 EST 2006

Dear Xplor-NIH developers,

For some time ago I posted some questions regarding using the PASD/MARVIN
facility to determine the structure of a protein from solid state data. Thanks
a lot, your answers helped a lot. I am now ready to start the calculations, but
I have some more specific questions for you regarding data formats and specific
(non-existing) options in PASD. I wondered if there is some documentation out
there on PASD, alternatively, I hope you can help me...

I think PASD is a very elegant and robust method to handle ambiguous data and
false peaks and I would prefer using this method. However, for the solid state
data I work with the ambiguity is very pronounced. I.e. for your case of
cyanovirin-N you have and assignment degeneracy of >=5 in 5% of the cases and
degeneracy >= 10 in less than 2% of the cases (as judged from output from
“initialMatch3dC.tcl”) whereas for my case with solid state data I have
degeneracy>=5 in ca. 2/3 of the cases and degeneracy>=10 in 1/3 of the cases
and also degeneracies close to 100!. this is because the data is only 2D and
with large line widths. This means altogether that the fraction of inconsistent
long-range restraints is higher than 80%.
I would like to try PASD anyway. The type of restraints that can be derived from
the data I use is between carbon pairs. How does that agree with data format of
the “.shifts” and “.PCK” files? Does the initialMatch3d.tcl script for matching
C-C chemical shifts assignment possibilities with peak positions need any
modification? I hope to have 3d spectra available later to reduce the
degeneracy.

One thing I would rather do for the moment was to do the initial-matching step
myself and proceed straightforward to the SA-steps. I have developed a
purpose-designed method to filter out unlikely assignments in each of the
restraints based on an inferential approach using prior probabilities derived
from the database of structures and information on secondary structure. I plan
to use this inferential restraint assignment (IRA) method to reduce the
degeneracies of the restraints (with the risk of neglecting some information).

I am trying to build the required “.noes” and “.shiftAssignments” files. Could
you explain to me the precise syntax of these files?

excerpt from “cvn_3dc_pass1.noes”:
------------------------------------------------------
restraint 3dc1003
   -bounds  3.30  1.80
   -intensity -2230000.000000
   -fromProtonShift 3.882600
   -toProtonShift 0.082600
   -fromHeavyatomShift 63.505000
   -note from file 3dc_capp_def.PCK, peak 1003
   -note sequential
   -note degeneracy 2
   -note Good.
   assign 3dc1003_0 3dc128_from(38_SER_HB#) 3dc492_to(39_VAL_HG1#)
      -upBoundCorrection   0.50
      -good
      -unfoldedFromProtonPeakPosition 3.882600
      -unfoldedFromHeavyatomPeakPosition 63.505000
      -unfoldedToProtonPeakPosition 0.082600
      -note upper bound increased because selections involve 1 methyls
      -note Symmetry partner is restraint 3dc1228 assignment 3dc1228_4
      -note sequential
      -note Good.
   end
   assign 3dc1003_1 3dc287_from(82_SER_HB#) 3dc492_to(39_VAL_HG1#)
      -upBoundCorrection   0.50
      -unfoldedFromProtonPeakPosition 3.882600
      -unfoldedFromHeavyatomPeakPosition 63.505000
      -unfoldedToProtonPeakPosition 0.082600
      -note upper bound increased because selections involve 1 methyls
      -note Symmetry partner is restraint 3dc1228 assignment 3dc1228_6
      -note long range
      -note Bad.  Violation in reference struct is 10.574321
   end
end
----------------------------------------- end of excerpt

In particular, could you explain:
What is the syntax of the numberings: assign 3dc1003_1 3dc287_from...
which entries in the restraint class is required? (I guess “-note” is not)
All restraints for my case is between carbons – what does “proton” and
“heavyAtom” mean in my context?

The SA-scripts used by MARVIN also requires shift input. I thought that the
initialMatch did all the possible assignments, isn't that true? or is the
matching of peak positions and shifts repeated at later stages? Anyway, could
explain the syntax to me:

excerpt from “cvn_3dc_pass1.shiftAssignments”:
--------------------------------------------------------
shiftAssignment 3dc0_from(1_LEU_HA)
   -protonSelection (resid 1 and resn LEU and name HA)
   -heavyatomSelection (resid 1 and resn LEU and name CA)
   -from
   -protonShift 3.865200
   -heavyatomShift 54.047100
   -symmetricShiftAssignment 3dc355_to(1_LEU_HA)
   -note ref struct completeness 0.750000
end
---------------------------------------------------------- end of excerpt

again, I only have the carbons assigned, what would that mean?

Another thing that I would like to do was to use the output assignment
likelihoods from IRA as input prior assignment likelihoods in PASD. I.e use
these derived estimates in place of lambda_p(i,j) eq. (6) in your paper in
JACS2004 and then include all of the possible assignments in the restraint (i.e
the very high degeneracy). This means that I would like to use a different
annealing protocol that starts with w0=1. One way to do that could be to start
straightaway with the pass2 step. Is it possible to use user-provided values
for the prior assignment likelihoods lambda_p(i,j)'s? I would very much like to
try it!

I also have a question regarding the calculation of the overall assignment
likelihood in PASD (eq. 5 in the JACS2004 paper). If one would use Bayes
theorem to calculate the posterior probability that a given assignment in a
restraint is true then one would calculate the product of the prior probability
(prior likelihood in your paper) and the likelihood (instantaneous likelihood
in your paper). In PASD you use the sum of the two probabilities rather than
the product, why did you prefer to use the sum?. when say w0=0.5 the  overall
assignment likelihood would be almost equal for assignments with priors with
values 0.0001, 0.001 and 0.1. I was just curious if you had any considerations
of that kind... if use the summing not to get trapped in structures biased by
the priors...

Sorry, that was a lot of questions. I hope you can find time to answer them.
Nonetheless, I am sure automation would be very much appreciated by the solid
state NMR community. By the way, I think it is possible to develop IRA to also
assign resonances in combination with PASD...

If you need me to specify some more details, don't hesitate to contact me.

Citat John Kuszewski <johnk at mail.nih.gov>:

> Hi,
> 
> On Sep 13, 2006, at 10:05 AM, jtn at chem.au.dk wrote:
> 
> > Dear Xplor-NIH developpers,
> >
> > I plan to use the PASD/MARVIN facility in the Xplor-NIH structure  
> > determination
> > package to calculate the structure of a protein from solid state  
> > data. PASD
> > seems to me to be very ideal for that task, since the experimental  
> > data from
> > solid state spectra is very sparse and noisy presenting a  
> > challenging problem.
> > The main source of information from solid state data is an 13C-13C  
> > dipolar
> > correlation experiment containing cross peaks between carbon pairs  
> > close in
> > space (conseptually similar to NOESY). This I plan to use do derive  
> > broad
> > distance bounds.
> >
> 
> Marvin is set up to handle ordinary NOESY cases smoothly, but could  
> probably
> be made to work with your data without too much difficulty.
> 
> > My question is can I use this type of 2D CC data in PASD? - what  
> > would be the
> > syntax of the "NOE-files"?
> > I cannot find any example of NOE data in the eginput/marvin/*  
> > subdirectories of
> > the Xplor-NIH distribution, where should I look for it?
> 
> Marvin ordinarily begins with a chemical shift table and a NOESY peak  
> location table.
> These can be in one of several formats.  In the eginput/marvin/*  
> examples, we include
> PIPP formatted shift and peak tables (this is NIH, after all!).  The  
> shift tables end in
> .shifts, and the peak location tables end in .PCK.
> 
> The first step is to run the initialMatch TCL scripts, which read the  
> shift and peak tables
> for a given experiment, match up the peak locations to shift table  
> entries within given chemical shift tolerances,
> estimate distance bounds, do some pre-filtering of the resulting peak  
> assignments,
> and then write out two tables for the rest of Marvin to use--a  
> shiftAssignments
> file, which largely reproduces the information in the input shift table,
> and a peaks file, which includes the location of each peak and any  
> possible assignments
> that were found during the initialMatch script.
> 
> Depending on the precise nature of your data, you might just be able  
> to use those scripts
> unchanged.  Alternatively, you could write your own .shiftAssignments  
> and .peaks files and
> just use the rest of Marvin to calculate the structures.
> 
> Those .shiftAssignments and .peaks files are intended to be human- 
> readable, and the syntax is fairly obvious.
> One issue that might not be obvious, though, is that every single  
> shiftAssignment must have a unique
> name given to it.  Similarly, every peak and peak assignment must  
> also have a unique name.
> If you repeat a name, you'll get an error when you try to read in the  
> tables.
> 
> > Does a python interface to PASD exist?
> 
> No.  It's all scripted with TCL.  Fortunately, TCL is  a trivial  
> language to learn.  And I don't
> really use too many TCL-isms outside of the code for the pre- 
> filtering.  The code that runs
> the structure calculations is reasonably straightforward.
> 
> >
> > One of the main challenge to solid state structure is the presense of
> > intermolecular contact (i.e. CC interactions). This leads to cross  
> > peaks not
> > being consistent with a monomer structure. Furthermore these  
> > contacts would
> > tend
> > to be systematic rather than random which might result in non-averaged
> > "noise"-peaks. How do you think PASD will perform?
> 
> Hard to say--it depends on how great a fraction of the peaks  
> correspond to those
> intermolecular contacts, how many assignments we end up with for each  
> peak (both
> intra- and inter-molecular), how well we can model the overall  
> system, and so forth.
> 
> >
> > I plan to also perform calculations for a dimer model system  
> > (modeling monomer
> > interactions in the solid state), for this system PASD should be  
> > able to
> > identify which contact are intra- and intermolecular - I hope...
> > Can I use an NCS (to enforce identity of the monomers) potential  
> > energy term as
> > well from the tcl interface?
> > What about distance difference potential to enforce NMR equivalence  
> > of the
> > monomers?
> 
> You can add any xplor-nih potentials you like during a Marvin  
> calculation.   I've
> done structures with NCS and distance symmetry.   It's just a matter  
> of setting up the
> potentials at the beginning of the sa_pass*.tcl scripts, using the  
> classic xplor language
> enclosed in a TCL XplorCommand " ... " call.  When you're ready to  
> try, let me know and
> I'll post in more detail.
> 
> --JK
>