Background & Summary

The increasing spread of antibiotic resistance in clinics is causing a global health crisis. Mobile genetic element encoding for resistance genes can be transferred among bacterial populations, leading to the need to make more efficient the discovery of new antibiotics and molecules able to improve their efficacy1,2. Gram-negative bacteria, such as Escherichia coli, Pseudomonas aeruginosa and Acinetobacter baumannii, are particularly challenging due to the presence of an outer membrane which reduces the permeability of antimicrobials and therefore their efficacy3,4,5. A major obstacle is represented by efflux pumps that act in synergy with the outer membrane ejecting a plethora of compounds with various chemical-physical properties, among which are different classes of antibiotics6,7,8. In addition, inactivating enzymes such as β-lactamases9 contribute to exacerbate the problem. To date, both academia and industry struggle to identify new antibiotic classes and optimize available compounds10,11,12,13,14,15. Although several strategies have been adopted (e.g., drug repurposing16 or systematic exploitation of natural compounds17), holistic approaches able to take into account multiple factors contributing to antimicrobial resistance are lacking18. Previous works focused primarily on the role of chemical-physical properties of antimicrobial molecules in their accumulation profile, searching for general “rules”19,20. For example, O’Shea and Moser19 found that antibiotics effective towards Gram-negative bacteria are generally characterized by high molecular weight (MW, around 600 Da) and high polarity (as expressed by cLogD7.4 below 0). More recently, Richter et al.21 identified the presence of a primary amine, flexible bond number (5 or less) and globularity (describing molecular shape), as key features to predict the accumulation of antibiotics in Gram-negative bacteria22. Predictive rules of efflux inhibition and avoidance were also identified23. This latter study combined standard molecular descriptors to properties derived from structure-based analyses (e.g., interaction descriptors extracted from molecular docking), allowing for a more complete and multi-factorial view. Definition of general rules able to predict both the permeability and the activity of antimicrobial compounds can greatly benefit from the application of machine learning approaches able to speed up the drug discovery process24,25. The application of these methods requires collection of data for the learning phase26,27, that highlights the need for curated molecular databases providing ready-to-use features.

In this regards, over the years, several molecular databases containing standard descriptors (e.g., PubChem28, DrugBank29, ChEMBL30, ZINC1531) or quantum-mechanical (QM) properties32,33,34,35,36 have been reported. These data have been extensively used in quantitative-structure-activity-relationships (QSAR) studies37,38, and recent works exploited additional information coming from molecular dynamics (MD) simulations39,40,41,42,43, that represent an effective tool to address key structural and kinetic features of biological systems44,45,46,47. However, although freely available servers for the automatic generation of force-field (FF) parameters are available48,49,50,51,52, small molecule parameterization remains often a non-trivial task53.

Following a previous work54, we present a homogeneous database of accurate all-atom FF parameters of more than 300 antimicrobial compounds, together with μs-long MD trajectories and QM-related data (e.g., ground-state optimized geometries). We additionally provide molecular descriptors of different nature: i) classical parameters usually considered in QSAR studies (e.g., MW, atom/ring counts, LogP, …); ii) MD-derived properties (e.g., root-mean-square fluctuations, statistics of intra- and inter-molecular H-bonds, hydration-shells structure and dynamics, …); iii) QM-based parameters (e.g., energies of frontier molecular orbitals, electronic gap, electric dipole moment, …). The computational protocol adopted is schematically depicted in Fig. 1. The molecules considered, ranging in size from cycloserine (13 atoms, MW = 102.09 Da) to rifalazil (132 atoms, MW = 941.09 Da), cover 24 classes of antimicrobial compounds with different mechanisms of action, plus miscellaneous compounds (e.g., fluorescent dyes such as rhodamine 6G and HT33342). In particular, about 30% of compounds in the whole dataset are β-lactams, and 10% are inhibitors of key bacterial proteins. High MW compounds (>~1000 Da) such as polymyxin, glycol- and lipo-peptides were omitted from the selection, due to the high computational costs/convergence issues associated to the QM calculation. A schematic depiction of representative compounds showing the overall chemical variability of the sample is given in Fig. 2. The complete set of antimicrobial families and compounds is reported in Table 1.

Fig. 1
figure 1

Schematic view of the computational protocol adopted to generate AB-DB. The different steps reporting some representative molecular descriptors are highlighted: molecular characterization, QM calculations, FF parameters generation, MD simulations. For further technical details see Methods section.

Fig. 2
figure 2

Schematic representation of four representative antimicrobials belonging to different classes, namely (a) penicillins, (b) (fluoro)quinolones, (c) aminoglycosides, and (d) aminocoumarins. Graphics rendered with DiscoveryStudio101.

Table 1 List of families and antimicrobial compounds included in AB-DB.

To the best of our knowledge AB-DB is unique in supplying homogeneously-derived properties of antimicrobial compounds. The accurate FF parameters can be reused for further MD simulations of compounds either alone or interacting with their macromolecular target(s). The MD trajectories can be exploited for ligand- or structure-based studies, in particular for molecular docking. The successful application of this technique requires the knowledge of the bio-active conformation of ligands55,56,57,58, that is not always found by classical searching algorithms59,60. Our curated, homogeneous and not straightforward properties can feed machine learning models towards the discovery of new antimicrobials61,62. Input/output files are also supplied to ensure data reproducibility. In the near future we plan to update AB-DB including more compounds, covering additional antimicrobial classes.

Methods

For each antimicrobial compound we obtained the 3D structure data file (.sdf format) from the PubChem database, except for 13 compounds for which the 3D conformation is not available. In those cases, marked in italic in Table 1, the starting structure was taken from the ChEMBL database, or from available X-ray structures. We then used the ChemAxon’s Marvin suite of programs63 to calculate the dominant protonation states at physiological pH. For known uncertain cases (e.g., tetracyclines), for which several micro-species with similar population were predicted, the choice has been driven by available experimental data on pKa values. The comparison between the experimental and calculated pKa values of the ionizable groups of a representative set of challenging molecules is reported in Table 2. The large relative errors (even exceeding 100%) were expected for these classes of antimicrobials since pKa determination in Marvin ChemAxon is based on molecular charge distribution, and these molecules are characterized by a complex electronic structure (e.g., with several possible resonance states). For each molecule we then proceeded with QM calculations and FF generation followed by all-atom MD simulations. Properties generated in each step will be referred to as QSAR, MD, and QM descriptors, respectively. The three types of analysis performed in this study are graphically exemplified in Fig. 3 for a test-case molecule and described in details below.

Table 2 Average experimental and predicted (boldface) pKa values. Relative percentage error of computed pKa value is reported in parentheses.
Fig. 3
figure 3

Graphical exemplification of the three types of analysis performed to generate AB-DB. Puromycin molecule has been selected as a test-case. QSAR: number of rotatable bonds (left), QM and FF: atomic partial charges (center), from negative (red) to positive (blue) values, MD: conformations extracted from MD trajectories (right). Graphics rendered with PyMOL102.

Quantum-mechanical calculations and force-field generation

The 3D structures of the molecules downloaded from the PubChem or ChEMBL databases (see above) underwent quantum-chemical calculations at the Density Functional Theory (DFT) level64, using the Gaussian16 package65. We employed the hybrid B3LYP functional66, in conjunction with the split-valence 6–31G** Gaussian basis-set67. The combination B3LYP/6–31G** represents a good compromise between accuracy and computational cost and is widely used for small molecules68,69,70. In all cases we disabled molecular symmetry (Symmetry = None), adopted restrictive convergence criteria for self-consistent-field iterations (10−8 Ha, SCF(Conver = 8)), and used a pruned (99,590) grid (Int = UltraFine) for numerical integration. For the few cases for which convergence criteria were not reached, geometry optimization was first performed with a smaller basis-set (6–31G) and converged geometry and molecular orbitals were used as a starting point for the subsequent B3LYP/6–31G** step. For each compound we optimized the ground-state structure employing the Polarizable Continuum Model71 to mimic the effect of water solvent (SCRF = (PCM,Solvent = Water)), particularly to avoid formation of strong intra-molecular H-bonds. We then performed full vibrational analyses obtaining real frequencies in all cases, thus confirming the geometries obtained to be global minima. We processed the output of Gaussian16 with GaussSum72 to extract molecular orbital data. On the optimized geometry we then performed B3LYP/6–31G** single-point energy calculations in vacuum to generate the atomic partial charges fitting the molecular electrostatic potential. We used the Merz-Kollman scheme73 to construct a grid of points around the molecule under the constraint of reproducing the overall electric dipole moment of the molecule (Pop = (ESP,Dipole,Regular)). The two-step restrained electrostatic potential (RESP) method74 implemented in the Antechamber package75 was used to generate atomic partial charges at the DFT level, instead of the automatic AM1-BCC charges76. This step enabled the generation of the FF files using the General Amber Force Field 2 (GAFF2)77. In a single case, namely the siderophore enterobactin loaded with Fe(III)78, FF files were obtained using the metal center parameter builder module79 of the Amber18 package80, slightly modified accordingly to the QM settings described above.

Molecular dynamics simulations

All-atom MD simulations were performed in explicit water solution using Amber18. Systems were solvated within a box of TIP3P water model81 and K+/Cl counter ions82, to reach an ionic concentration of 0.1 M, using the program tleap of Amber1880. GAFF2 parameters obtained as described above were adopted for antimicrobial compounds. All systems underwent an energy minimization, a heating followed by a cooling phase, and a short productive dynamics to relax the simulation box. Finally the production 1 μs-long MD simulation was performed, under the NPT ensemble (1 Atm and 310 K) using the isotropic Berendsen barostat83 and the Langevin thermostat84. Further details on MD settings can be found in ref. 54.

Descriptors generation

From the output of QM and MD simulations we extracted all molecular descriptors (~80 in total for each compound, see list in Table S1). Most QSAR descriptors were computed on the QM optimzed geometries using the calculator plugin of the Marvin ChemAxon program63. Given the importance of octanol/water partition coefficient in drug design85, we provide an additional estimate of this parameter by means of the XLOGP3 program86. Furthermore, for each compound we derived the molecular properties associated with the “entry rules”, a series of guidelines that have been recently proposed to increase small-molecule accumulation in Gram-negative bacteria21. QM-based properties were obtained from the Gaussian16 output files of the implicit-solvent geometry optimization. Isotropic and anisotropic polarizabilities were derived from the polarizability tensor according to ref. 87. We additionally provided the molecular dipole moment in vacuum consistent with the atomic partial charges of the FF files, computed as described above. From the all-atom MD simulations we obtained structural and dynamical features by means of the CPPTRAJ program88. First and second water shells were extracted using a lower (upper) cutoff of 3.4 (5.0) Å. For the analysis of intra- and inter-molecular H-bonds we adopted angle and distance cutoffs of 135° (donor-hydrogen-acceptor angle) and 3.5 Å (donor-acceptor), respectively89. The number and population of structural clusters were determined using a hierarchical agglomerative algorithm90 and the molecule root-mean-squaredeviation (RMSD) value as a metric. To evaluate atomic root-mean-square fluctuations (RMSF) we used the utility g_rmsf of the GROMACS package91. During the MD runs we also monitored three morphology descriptors related to the gyration tensor, i.e., asphericity, acylindricity, and kappa2, as implemented in the PLUMED plugin92. Asphericity and acylindricity give a measure of the deviation of the mass distribution from spherical and cylindrical symmetry, respectively; the relative shape anisotropy kappa2 is limited between 0 and 1 and reflects both symmetry and dimensionality93. The minimal projection area (MPA) is the minimum of the circular areas projected perpendicularly to the principal axes of inertia of the molecule, calculated based on the atomic van der Waals radii (Å). The dynamical evolution of the MPA have been determined with the combined use of Open Babel94 and ChemAxon’s calculator plugin63.

Data Records

AB-DB is available on figshare95. The computed molecular descriptors are given in the comma separated file all-descriptors.csv. A compressed TAR archive for each family is provided (e.g., carbapenems.tgz). In turn, every archive contains sub-folders named after the compound and the net charge considered in the calculations (e.g., carbapenems/ertapenem_-1/). For 34 molecules the two protonation states most populated at pH = 7.4 were considered (see Methods section). In these cases two folders per compound are reported, with different values of the net charge (e.g., quinolones/ciprofloxacin_0/ where the compound is considered as zwitterionic, and quinolones/ciprofloxacin_-1/ where the nitrogen atom of the piperazine ring is considered in its neutral form). Each compound folder contains a 2D sketch of the molecule (2d.png), and a total of 20 files distributed into three sub-directories reporting QM (QM/), FF (FF/), and MD (MD/) data. Figure 4 shows a schematic representation of the database structure describing the path of all files provided.

Fig. 4
figure 4

Schematic representation of the AB-DB structure reporting filenames of each sub-directory.

Quantum-mechanical data

QM/ folders contain files derived from QM calculations (see Quantum-mechanical calculations and force-field generation section). In details, the opt-freq.com and opt-freq.log are the input and output files of the Gaussian16 geometry optimization and frequency analysis in implicit solvent. The minimization steps are collected in the optimization.xyz file and the final optimized structure is given in structure data file format as optimized.sdf. This file, generated with Open Babel94 from the corresponding .xyz file and carefully checked manually, is also provided for reproducibility purposes since it has been used to compute QSAR descriptors. We also collected the electronic structure and the harmonic vibrational frequencies into electronic.dat and vibrational.dat files, respectively. The elec-pot.com and elec-pot.log are respectively the input and output files of the Gaussian16 single-point energy calculation in vacuum, performed to derive atomic partial charges. The resulting electrostatic potential file is elec-pot.dat.

Force-field data

For each compound we supply in the corresponding FF/ folder the mol.mol2 and Amber mol.prep files, containing the optimized structure of the molecule with RESP partial charges. The Amber force-field modification file mol.frcmod with all parameters not included in the GAFF2 is also provided. For reproducibility purposes we make available the Amber parameter/topology mol_solv.parm7, and coordinate/restart mol_solv.rst7 files used to perform the MD simulation in explicit solvent. The corresponding mol_solv.pdb file generated using the ambpdb program80 is also provided.

Molecular dynamics data

MD/ folders store the μs-long MD trajectories performed in explicit water solution (see Molecular dynamics simulations section) in the file trajectory.pdb (100000 frames). The representatives of the ten most populated clusters extracted from the trajectory are given in clusters.pdb, and their corresponding fraction in clusters.dat. The statistics of intra- and inter-molecular H-bonds are collected in hbonds-intra.dat and hbonds-inter.dat, respectively.

Technical Validation

AB-DB is built making use of the different computational steps detailed above: molecular characterization, QM calculations, FF generation, MD simulations, and extraction of physico-chemical descriptors. Concerning the starting configurations used for the subsequent steps, we carefully checked the protonation state of all compounds at physiological pH, paying particular attention to uncertain cases with two major populated species (see Table 2 and Methods section for details). In details, for these ambiguous cases, we searched the literature for experimental values of pKa, that were consistently used as reference throughout the class. As for the QM calculations, the DFT level of theory adopted is routinely used and has proven to be reliable for small organic molecules, providing FF parameters compatible with available FFs for macromolecules50. B3LYP/6–31G** calculations represent a good compromise between accuracy and computational cost96. Therefore, no further validation is here provided for QM calculations. In the following we present a thorough justification of the reliability of our data through the comparison with experiments.

Descriptors validation

Calculation of classical parameters reporting the topological properties of molecules, such as the number of atoms or the count of aliphatic bonds, is quite straightforward. Most popular databases (e.g. PubChem, DrugBank) indeed employ ChemAxon’s tools to automatically compute these properties. In AB-DB, we likewise used the same programs to obtain the QSAR descriptors. However, for LogP, which is known to be a key feature for antimicrobial penetration kinetics61,97,98, we exploited another widely used method (XLOGP3, see Methods section). Note that accurate prediction of LogP is a well-known challenge in computational chemistry, and is also common to find severe disagreement through experimental results obtained for the same compound99,100. In order to assess the quality of our predictions, we collected available experimental LogP values for a subset of molecules. Table 3 compares the experimental data, falling in the range [−1.69, 5.15], with the computed ones, highlighting the differences between the two methods. In most cases the two predicted values are similar and agree with the experimental LogP. However, as expected, ambiguous situations were also found. Methicillin, for instance, was well predicted by XLOGP3 (computed 1.96 vs. experimental ~1.90) while cxcalc yielded a poor estimation (computed 0.79). On the contrary, the latter program agrees with experiments for lomefloxacin (computed −0.43 vs. experimental −0.47), whereas the former failed (computed 0.27).

Table 3 Experimental logP (LogP exp) and LogP values computed by XLOGP3 and cxcalc.

Validation of force-field parameters and molecular dynamics trajectories

To assess the reliability of the FF generated for all compounds we computed the RMSD between the QM B3LYP/6–31G** optimized geometry and the molecular mechanics minimum-energy structure obtained with the GAFF2 parameters of the database. Table 4 shows the good agreement between the two sets of structures, differing on average by less than 1 Å, with an overall mean value of 0.5 ± 0.1 Å. The registered low RMSDs prove the accuracy of the FF parameters presented in AB-DB and used for the MD simulations.

Table 4 Average RMSDs and standard deviations (Å) between the DFT and molecular mechanics optimized geometries for the classes of antimicrobial compounds included in AB-DB.

To give a measure of the quality of MD simulations, we compared the representative conformations extracted from MD trajectories (cluster representatives) of selected compounds with their 3D experimental structure available on the Protein Data Bank, in complex with biological targets. When multiple experimental structures were available for the same compound, we considered the one with the highest resolution. The total number of 85 experimental structures collected are listed in Table S2, reporting the corresponding PDB code and the weighted average RMSDs (<RMDS>w), obtained using cluster populations as weights. The average values associated to selected families are also given. The mean value of <RMDS>w considering all families is 1.8 ± 0.8 Å, with the highest and smallest value reached by aminocoumarins (3.6 ± 0.4 Å) and tetracyclines (0.9 ± 0.2 Å), respectively. As expected, bigger and more flexible molecules give rise to higher <RMDS>w, whereas smaller and more rigid compounds show lower values. Overall, the performed MD simulations based on GAFF2 parameters appear to be able to sample molecular conformations found in available experimental structures.

As emphasized in previous works54,60, MD simulations enable to go beyond a static picture of molecules, providing ranges of properties accounting for their dynamical nature and their impact on biological activity. Prominent examples are represented by MD simulations performed to differentiate the most active inhibitors of ERK2 kinase39 and Ptch1 multidrug efflux transporter58.