pccx | phylogeny based coverage calculation and extension

Overview

pccx – phylogeny based coverage calculation and extension – is a simple application for target selection for structural genomics. It enables to:

It is implemented in Java as part of the FORESTER package. Currently, three scoring methods are implemented:

Download

» forester.jar version 4.0

FORESTER at sourceforge.net

Usage

java -cp path/to/forester.jar org.forester.tools.pccx [options] <phylogen(y|ies) infile> [external node name 1] [name 2] ... [name n]

Options:

-d: 1/distance based scoring method (instead of branch counting based)
-ld: -ln(distance) based scoring method (instead of branch counting based)
-x[=<n>]: optimally extend coverage by <n> external nodes. Use none, 0, or negative value for complete coverage extension.
-o=<file>: write output to <file>
-i=<file>: read (new-line separated) external node names from <file>
-p=<file>: write output as annotated phylogeny to <file> (only first phylogeny in phylogenies infile is used)

Annotated phylogenies (branches are colored as follows: green - maximum coverage, red - minimum coverage, black - arithmetic mean of coverage scores) can be viewed with ATV (version 4.00 ALPHA 5 or greater), to ensure that the colored branches are displayable, please use an appropriate configuration file for ATV (»example configuration file).

Examples

For the examples, a phylogeny based on the Malate/L-lactate dehydrogenase alignment from Pfam 21.0 is used (»Ldh_2.nhx).

As of 2007-05-25, the following seven sequences from this family have a structure in PDB: 1s20, 1nxu (DLGD_ECOLI); 1rfm (COMC_METJA); 1v9n (MDH_PYRHO); 1vbi (Q746L8_THET2); 1wtj, 2cwf (Q4U331_PSESM); 1xrh (ALLD_ECOLI); and 1z2i (Q7CRW4_AGRT5).

To calculate a coverage score for a given phylogeny using a "sum of 1/branch-segment-sum" (default) scoring method:

% java -cp path/to/forester.jar org.forester.tools.pccx Ldh_2.nhx DLGD_ECOLI COMC_METJA MDH_PYRHO Q746L8_THET2 Q4U331_PSESM ALLD_ECOLI Q7CRW4_AGRT5 -p=Ldh_2_b7.nhx

Output:

Options: scoring method: sum of 1/branch-segment-sum

Normalized score: 0.1497663297543091
Raw score       : 33.84719052447385

Wrote annotated phylogeny to "Ldh_2_b7.nhx"

In this annotated phylogeny, branches are colored accoring to coverage: green - maximum coverage, red - minimum coverage, black - arithmetic mean of coverage socores:

To calculate a coverage score for a given phylogeny using a "sum of 1/branch-length-sum" scoring method:

% java -cp path/to/forester.jar org.forester.tools.pccx -d Ldh_2.nhx DLGD_ECOLI COMC_METJA MDH_PYRHO Q746L8_THET2 Q4U331_PSESM ALLD_ECOLI Q7CRW4_AGRT5

Output:

Options: scoring method: sum of 1/branch-length-sum [for self: 1/branch-length] [min branch length: 0.0010]

Normalized score: 0.12868805358848912
Raw score       : 7623.40971285036

To optimally extend coverage by 10 more sequences:

% java -cp path/to/forester.jar org.forester.tools.pccx -x=10 Ldh_2.nhx DLGD_ECOLI COMC_METJA MDH_PYRHO Q746L8_THET2 Q4U331_PSESM ALLD_ECOLI Q7CRW4_AGRT5 -p=Ldh_2_b7_x10.nhx

Output:

Options: scoring method: sum of 1/branch-segment-sum

Printing 10 names to extend coverage in an optimal manner:

 before:
Normalized score: 0.1497663297543091
Raw score       : 33.84719052447385

0       Q3PGX6_PARDE    0.16718837131297096
1       Q6D702_ERWCT    0.18360557829584423
2       Q1V2K0_9RICK    0.1942380873796807
3       Q7PI68_ANOGA    0.2046462755533554
4       Q5QTW6_IDILO    0.21426464391066183
5       Q2T3J0_BURTA    0.22353129908439665
6       Q5WAN1_BACSK    0.23244837758112122
7       Q8UIX7_AGRT5    0.2404779463407785
8       Q323Z3_SHIBS    0.2481616097766543
9       Q8YB95_BRUME    0.25576625930608254

 after:
Normalized score: 0.25576625930608254
Raw score       : 57.803174603174654

Wrote annotated phylogeny to "Ldh_2_b7_x10.nhx"

In this annotated phylogeny, branches are colored accoring to coverage: green - maximum coverage, red - minimum coverage, black - arithmetic mean of coverage socores:

Comparison of scoring methods currently implemented in pccx

As for the examples above, a phylogeny based on the Malate/L-lactate dehydrogenase alignment from Pfam 21.0 is used.

The graph was produced with gnuplot.

Background

Brenner S.E. (2000). Target selection for structural genomics. Nature Structural Biology, 7, 967 - 969. [Nature Structural Biology]

Rodrigues A.P.C., Grant B.J., and Hubbard R.E. (2006). sgTarget: a target selection resource for structural genomics. Nucleic Acids Research, 34, W225-W230. [Nucleic Acids Research]

Contact

Christian M Zmasek
Burnham Institute for Medical Research | cmzmasek yahoo com

Copyright © 2007 Christian M Zmasek | Last updated 2007-05-02

forester | www.phylosoft.org | www.phyloxml.org