Adding custom residues/ligands to ENCoM

ENCoM uses the system of 8 atom types from Sobolev *et al.*, which are listed in the following table:

Legitimate (1) or illegitimate (0) contacts between the ENCoM atom types
LIGIN class	Hydrophilic	Acceptor	Donor	Hydrophobic	Aromatic	Neutral	Neutral-donor	Neutral-acceptor
Hydrophilic	1	1	1	0	1	1	1	1
Acceptor	1	0	1	0	1	1	1	0
Donor	1	1	0	0	1	1	0	1
Hydrophobic	0	0	0	1	1	1	1	1
Aromatic	1	1	1	1	1	1	1	1
Neutral	1	1	1	1	1	1	1	1
Neutral-donor	1	1	0	1	1	1	0	1
Neutral-acceptor	1	0	1	1	1	1	1	0

Each residue that is to be considered by ENCoM needs to have every heavy atom assigned to one of these 8 types. This is accomplished by configuration text files with the .atypes extension. The default files already cover the standard amino acids and nucleotides, in addition to some common modified nucleotides. Here are the contents of the amino_acids.atomtypes file:

YCM | N:3, C:6, O:2, CA:7, CB:4, SG:4, CD:4, CE:6, OZ1:1, NZ2:1
ALA | N:3, CA:7, C:6, O:2, CB:4
ARG | N:3, CA:7, C:6, O:2, CB:4, CG:4, CD:7, NE:3, CZ:6, NH:3, NH1:3, NH2:3
ASN | N:3, CA:7, C:6, O:2, CB:4, CG:6, OD:2, OD1:2, ND:3, ND2:3
ASP | N:3, CA:7, C:6, O:2, CB:4, CG:6, OD:2, OD1:2, OD2:2
CYS | N:3, CA:7, C:6, O:2, CB:4, SG:6
GLN | N:3, CA:7, C:6, O:2, CB:4, CG:4, CD:6, OE:2, OE1:2, NE:3, NE2:3
GLU | N:3, CA:7, C:6, O:2, CB:4, CG:4, CD:6, OE:2, OE1:2, OE2:2
GLY | N:3, CA:6, C:6, O:2
HIS | N:3, CA:7, C:6, O:2, CB:4, CG:5, ND:1, ND1:1, ND2:1, CD:5, CD1:5, CD2:5, CE:5, CE1:5, CE2:5, NE:1, NE1:1, NE2:1
ILE | N:3, CA:7, C:6, O:2, CB:4, CG:4, CG1:4, CG2:4, CD:4, CD1:4
LEU | N:3, CA:7, C:6, O:2, CB:4, CG:4, CD:4, CD1:4, CD2:4
LYS | N:3, CA:7, C:6, O:2, CB:4, CG:4, CD:4, CE:7, NZ:3
MET | N:3, CA:7, C:6, O:2, CB:4, CG:4, SD:8, CE:4
PHE | N:3, CA:7, C:6, O:2, CB:4, CG:5, CD:5, CD1:5, CD2:5, CE:5, CE1:5, CE2:5, CZ:5
PRO | N:6, CA:4, C:6, O:2, CB:4, CG:4, CD:4
SER | N:3, CA:7, C:6, O:2, CB:6, OG:1
THR | N:3, CA:7, C:6, O:2, CB:6, OG:1, OG1:1, CG:4, CG1:4, CG2:4
TRP | N:3, CA:7, C:6, O:2, CB:4, CG:5, CD:5, CD1:5, CD2:5, NE:3, NE1:3, CE:5, CE1:5, CE2:5, CE3:5, CZ:5, CZ1:5, CZ2:5, CZ3:5, CH:5, CH2:5
TYR | N:3, CA:7, C:6, O:2, CB:4, CG:5, CD:5, CD1:5, CD2:5, CE:5, CE1:5, CE2:5, CZ:5, OH:1
VAL | N:3, CA:7, C:6, O:2, CB:4, CG:4, CG1:4, CG2:4
ADD_TO_ABOVE | OXT:1

As you can see, the syntax is very simple. Each residue occupies one line. The residue name comes first, then the | separator following by a list of atom name:atom type pairs separated by commas.

The special instruction ADD_TO_ABOVE can be used to add atoms that are common to all residues listed above that line. It is handy for residues such as nucleotides which have many shared atoms among all species.

The second configuration file required by ENCoM is the mass definition file, another text file with the .masses extension. For residues for which only one mass per residue is needed, this file is very simple. Here are the contents of the amino_acids.masses file:

CONNECT: N -> C
N_MASSES: 1
CENTER: CA
NAME: CA

The CONNECT field tells ENCoM which atoms are connected when the residues form polymers. It allows the automatic inference of covalent connections between residues. In the case of ligands which do not form polymers, you can put any two atoms there. When 1 mass per residue is used, N_MASSES is always set to 1.

The CENTER field defines which atom will be selected as the position of the mass representing the whole residue. The NAME field is the name given to that mass and is usually the same as the CENTER field in the case of 1 mass per residue.

When using multiple masses per residue, the mass definition file becomes a little more complex. Here are the contents of the ribonucleic_acids.masses file:

CONNECT_INIT: O3' -> P
REDEFINE_RESIS: (n-1) -> (n) | O3'
CONNECT: C3' -> O3'
N_MASSES: 3
MASS_1:
        NAME: P
        ATOMS: P, O5', OP1, OP2, OP3, O3'
        CENTER: P
MASS_2:
        NAME: S
        ATOMS: C5', C4', O4', C3', C2', O2', C1'
        CENTER: C1'
MASS_3:
        NAME: B
        ATOMS: C10, C11, C12, C13, C14, C15, C16, C19, C2, C21, C24, C4, C5, C5M, C6, C8, CM1, CM2, CM5, CM7, N1, N2, N20, N3, N4, N6, N7, N9, O17, O18, O2, O22, O23, O4, O6
        CENTER: C2

The CONNECT_INIT and REDEFINE_RESIS record are optional and must precede the CONNECT record. They are needed in cases where an atom from a residue needs to be placed in the residue following it before the residue is divided in masses. In the above example, the O3’ atom is moved up in the chain of residues to allow for one mass to represent the whole phosphate group. Such movements usually change the connectivity between residues, so the CONNECT_INIT record defines the connectivity before the redefinition and the CONNECT record defines it after redefinition.

For each mass from i to N, a MASS_i record is needed, followed by the 3 indented fields: NAME, ATOMS and CENTER. Each mass needs a unique name, a list of the atoms that are part of that mass and the name of the center atom from which the mass will inherit coordinates.

Note

See the ENM and ENCoM detailed documentation for details on how to pass these configuration files to the ENCoM object.