Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in PQR parser #1855

Open
VonBoss opened this issue Mar 29, 2024 · 12 comments
Open

Bug in PQR parser #1855

VonBoss opened this issue Mar 29, 2024 · 12 comments
Assignees

Comments

@VonBoss
Copy link

VonBoss commented Mar 29, 2024

Summary

The Prody PQR parser fails to parse PQR files that do not have white space between their data fields. I have encountered this issue when the XYZ coordinates have 3 digits in front of the decimal place and are negative.

Example: lines from a file that failed to parse

ATOM      1    C STP     1     -26.417  62.269-123.283    0.00     3.30
ATOM      2    O STP     1     -30.495  65.669-122.669    0.00     3.08
ATOM      3    O STP     1     -25.792  61.516-124.085    0.00     3.32
ATOM      4    O STP     1     -25.262  61.506-124.230    0.00     3.05
ATOM      5    O STP     1     -30.521  65.070-122.439    0.00     3.05

There is no whitespace between the Y and Z coordinates in this file, so Prody fails to parse it.

Fix

Currently, Prody is parsing PQR files differently than PDB files by splitting each line on whitespaces. (prody/proteins/pdbfile.py)

  if not isPDB:
      fields = line.split()
  ......
  coordinates[acount, 0] = fields[6]
  coordinates[acount, 1] = fields[7]
  coordinates[acount, 2] = fields[8]

To avoid parsing issues Prody should parse PQR files by position/index which conforms to the file specification for a PQR file and matches other PQR parsers such as Biopython.

x = float(line[30:38])
y = float(line[38:46])
z = float(line[46:54])

Caveat

It appears that different softwares write the charge and radius information in different positions. (This may be the reason why Prody chose to parse on whitespace from the beginning.)

12345678901234567890123456789012345678901234567890123456789012345678901234
                Resn  Resi           X       Y       Z       Q       R

From https://pipe.rcc.fsu.edu/transcomp/PQRformat.htm
ATOM      1  N   VAL     1      16.783  48.812  26.447  0.0577   1.550
ATOM      2  H1  VAL     1      15.848  48.422  26.463  0.2272   1.200
ATOM      3  H2  VAL     1      16.734  49.803  26.251  0.2272   1.200
ATOM      4  H3  VAL     1      17.195  48.663  27.359  0.2272   1.200
ATOM      5  CA  VAL     1      17.591  48.101  25.416 -0.0054   1.700

From fpocket 4.0
ATOM      1    C STP     1     -26.417  62.269-123.283    0.00     3.30
ATOM      2    O STP     1     -30.495  65.669-122.669    0.00     3.08
ATOM      3    O STP     1     -25.792  61.516-124.085    0.00     3.32
ATOM      4    O STP     1     -25.262  61.506-124.230    0.00     3.05
ATOM      5    O STP     1     -30.521  65.070-122.439    0.00     3.05

From PyMol 2.5
ATOM     29  P     G    11     -17.189  -6.642 -23.827  0.00000000   0.000
ATOM     30  C5'   G    11     -15.744  -5.986 -25.911  0.00000000   0.000
ATOM     31  O5'   G    11     -16.783  -5.642 -25.005  0.00000000   0.000
ATOM     32  C4'   G    11     -15.722  -5.012 -27.074  0.00000000   0.000
ATOM     33  O4'   G    11     -16.947  -5.133 -27.838  0.00000000   0.000
@jamesmkrieger
Copy link
Contributor

Thank you. So, we should probably just use ranges for coordinates where they might lose the space and keep the field splitting with spaces for the rest.

@jamesmkrieger
Copy link
Contributor

However, it seems like this could be a problem depending on the program too:
https://ics.uci.edu/~dock/manuals/apbs/html/user-guide/a2566.html#:~:text=The%20PQR%20format%20is%20the,with%20standard%20molecular%20graphics%20programs says "all fields are whitespace-delimited, thereby allowing coordinates which are larger/smaller than ± 999 Å"

@jamesmkrieger
Copy link
Contributor

I suppose we need some kind of check for how many fields we get by splitting with whitespace delimiting and what we get if we use the standard columns

@jamesmkrieger
Copy link
Contributor

ok, I think I have a fix. Please try checking out the pqr_fix branch from https://github.com/jamesmkrieger/ProDy.git and testing a few files. I have tried some example lines from the ones you provided and they work.

@jamesmkrieger
Copy link
Contributor

jamesmkrieger commented Apr 17, 2024

I haven't merged this branch as I don't have any unit tests yet, but I guess at some point I can make them from the example snippets you gave here.

@VonBoss
Copy link
Author

VonBoss commented Apr 17, 2024

This looks like an appropriate fix to me.

@jamesmkrieger
Copy link
Contributor

thank you

@VonBoss
Copy link
Author

VonBoss commented Apr 21, 2024

paresPQR() Test

I was able to test your fix this weekend. Your correction can't handle lines starting with HEADER, TER, CONECT, etc. because they don't have the same number of fields.

Input PQR file

HEADER
HEADER File generated using PyMol 2.5.
HETATM    1  N   LYS     0       1.422   1.796   0.198  0.00000000   0.000
HETATM    2  CA  LYS     0       1.394   0.355   0.484  0.00000000   0.000
HETATM    3  C   LYS     0       2.657  -0.284  -0.032  0.00000000   0.000
HETATM    4  O   LYS     0       3.316   0.275  -0.876  0.00000000   0.000
HETATM    5  CB  LYS     0       0.184  -0.278  -0.206  0.00000000   0.000
HETATM    6  CG  LYS     0      -1.102   0.282   0.407  0.00000000   0.000
HETATM    7  CD  LYS     0      -2.313  -0.351  -0.283  0.00000000   0.000
HETATM    8  CE  LYS     0      -3.598   0.208   0.329  0.00000000   0.000
HETATM    9  NZ  LYS     0      -4.761  -0.400  -0.332  0.00000000   0.000
HETATM   10  OXT LYS     0       3.050  -1.476   0.446  0.00000000   0.000
HETATM   11  H   LYS     0       1.489   1.891  -0.804  0.00000000   0.000
HETATM   12  H2  LYS     0       0.521   2.162   0.464  0.00000000   0.000
HETATM   13  HA  LYS     0       1.322   0.200   1.560  0.00000000   0.000
HETATM   14  HB2 LYS     0       0.210  -0.047  -1.270  0.00000000   0.000
HETATM   15  HB3 LYS     0       0.211  -1.359  -0.068  0.00000000   0.000
HETATM   16  HG2 LYS     0      -1.128   0.050   1.471  0.00000000   0.000
HETATM   17  HG3 LYS     0      -1.130   1.363   0.269  0.00000000   0.000
HETATM   18  HD2 LYS     0      -2.287  -0.120  -1.348  0.00000000   0.000
HETATM   19  HD3 LYS     0      -2.285  -1.432  -0.145  0.00000000   0.000
HETATM   20  HE2 LYS     0      -3.625  -0.023   1.394  0.00000000   0.000
HETATM   21  HE3 LYS     0      -3.626   1.289   0.192  0.00000000   0.000
HETATM   22  HZ1 LYS     0      -4.736  -0.185  -1.318  0.00000000   0.000
HETATM   23  HZ2 LYS     0      -4.735  -1.400  -0.205  0.00000000   0.000
HETATM   24  HZ3 LYS     0      -5.609  -0.031   0.071  0.00000000   0.000
HETATM   25  HXT LYS     0       3.861  -1.886   0.115  0.00000000   0.000
END

Error message

    [540]  if not isPDB:
    [541]       fields = line.split()
--> [542]     if fields[5].find('.') != -1:
    [543]           # coords too early as no child
    [544]           fields.insert(4, '')
    [545]       if len(fields) != 11:


IndexError: list index out of range

Fix

I moved your field splitting code so that it is only applied to ATOM and HETATM lines.

while i < stop:
    line = lines[i]
    startswith = line[0:6].strip()

    if startswith == 'ATOM' or startswith == 'HETATM':
        if isPDB:
            atomname = line[12:16].strip()
            if long_resname:
                resname = line[17:21].strip()
            else:
                resname = line[17:20].strip()
                
        else:
            fields = line.split()
            if fields[5].find('.') != -1:
            # coords too early as no chid
                fields.insert(4, '')
            if len(fields) != 11:
                try:
                    fields = fields[:6] + [line[30:38].strip(), line[38:46].strip(), line[46:54].strip()] + line[54:].split()
                except:
                    LOGGER.warn('wrong number of fields for PQR format at line %d'%i)
                    i += 1
                    continue
            
            atomname= fields[2]
            resname = fields[3]

New fix applied to an example with no XYZ whitespace

parsePQR() input file

HEADER
HEADER This is a pqr format file writen by the program fpocket.                 
HEADER It contains all the pocket vertices found by fpocket.                    
ATOM      1    C STP     1     -26.417  62.269-123.283    0.00     3.30
ATOM      2    O STP     1     -30.495  65.669-122.669    0.00     3.08
ATOM      3    O STP     1     -25.792  61.516-124.085    0.00     3.32
ATOM      4    O STP     1     -25.262  61.506-124.230    0.00     3.05
ATOM      5    O STP     1     -30.521  65.070-122.439    0.00     3.05
TER
END

parsePQR() --> writePQR() output file

ATOM       1  C    STP       1      -26.417   62.269 -123.283  0.0000  3.3000
ATOM       2  O    STP       1      -30.495   65.669 -122.669  0.0000  3.0800
ATOM       3  O    STP       1      -25.792   61.516 -124.085  0.0000  3.3200
ATOM       4  O    STP       1      -25.262   61.506 -124.230  0.0000  3.0500
ATOM       5  O    STP       1      -30.521   65.070 -122.439  0.0000  3.0500

Other bugs in pdbfile.py

I used the writePQR() function to do some of my testing, but this function only works if an alpha carbon is present in the atom group. I added a try block in the writePQRstream() function to quickly get around this issue. Requiring an alpha carbon to write a PQR file seems like a bug to me.

Quick fix writePQRstream():

    calphas = atoms.ca
    try:
        ssa = calphas.getSecstrs()
    except:
        ssa = None
    helix = []
    sheet = []

@jamesmkrieger
Copy link
Contributor

Thank you very much! I’ll look over these in a month or so when I’m less busy

@jamesmkrieger
Copy link
Contributor

(Closed it by accident)

@jamesmkrieger
Copy link
Contributor

It looks like the last one should have the calphas line included in the try block

@atbogetti atbogetti self-assigned this May 8, 2024
@jamesmkrieger
Copy link
Contributor

ok, I've just added these two fixes to my branch and pull request too

@atbogetti, it would still be great if you could check this more thoroughly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants