MolDB4

a free software package for creating a web-based, searchable molecular structure database

Norbert Haider, University of Vienna, 2007 
norbert.haider@univie.ac.at

ATTENTION: there is already a newer version of this software available! Please visit the MolDB5 homepage.

These instructions explain how to set up a web-based, fully searchable molecular structure database by importing data from an MDL SD file. This is shown by an example using the SD files which are freely available from the PubChem FTP site at ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/

For additional information, please visit
http://merian.pch.univie.ac.at/~nhaider/cheminf/moldb.html

MolDB4 is a collection of fully functional PHP scripts for running a structure database with search options for text, functional groups, and structure/substructure. Included are also Perl scripts for data import from an SD file.

Download: http://merian.pch.univie.ac.at/pch/download/chemistry/moldb/moldb4.tar.gz


Technical background

The system uses MySQL as the database engine, whereas its "chemical intelligence" is provided by the checkmol/matchmol program (or its socket-based server version, cmmmsrv). All relevant data are stored in MySQL tables, except the bitmap graphics of the molecules. Structure/substructure searches are performed in a two-stage process, consisting of an efficient pre-selection and a full atom-by-atom comparison of the remaining candidates. For pre-selection, a combination of molecular statistics with dictionary-based fingerprints and hash-based fingerprints is used. Optionally, this operation can be performed with memory-based MySQL tables instead of disk-based tables. This will make the search even faster. In a test installation on a standard PC (AMD Athlon 1.6 GHz CPU, 1.5 GB of memory), substructure searches in a database of 500.000 compounds are typically complete within a few seconds.

Demo database


Installation

Hardware requirements: 

any x86 PC (the faster, the better)

Software requirements:

===> these packages are standard components of every distribution

moreover:
Note: this package was developed and tested on a Linux machine, but with a few adaptations it should run also on a Windows PC (except for operating in socket mode in combination with cmmmsrv). There is also at least one successful implementation on a FreeBSD system and another port to MacOS X.


Where to start?

First, create the MySQL tables from your SD file, for example Compounds_00000001_00025000.sdf (after unpacking the .gz file which you downloaded from the PubChem FTP site).

If you have wget installed, you can retrieve the PubChem file very conveniently by entering e.g.

wget ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/Compound_00000001_00025000.sdf.gz

Prerequisite: the checkmol/matchmol program should be installed in /usr/local/bin (for more information, see
http://merian.pch.univie.ac.at/~nhaider/cheminf/cmmm.html). Alternatively, cmmmsrv should be up and running (by default, this daemon is listening on port 55624).

The following steps should be executed exactly in this order:

  1. Edit the configuration file "moldb.conf"; the only item which must be known at this time is the user-ID of the MySQL administrator (usually this is "root"), and this account should have a password (attention: this is not the same as the Linux administrator, "root"). Please check your MySQL documentation.
    Select a name for your database, UIDs for a proxy user with write privileges and another proxy user with read-only privileges (the example file "moldb.conf" contains already these entries which can be accepted or modified; please change at least the passwords!).
    If you want to use bitmap graphics of your molecules (recommended), please specify a directory where these bitmaps should be stored and make sure you have write permission there. This can be the final location of the .png files (somewhere within your web server's document root) or a temporary directory from where you can copy the files to their final location later on. If you don't want bitmaps, just set $bitmapdir to an empty string ("").

  2. From the Linux command line, run the shell script "initdb.sh" by typing "./initdb.sh". This will create the database and the accounts for the proxy users with appropriate privileges and passwords as specified in "moldb.conf". You will be prompted for the MySQL password of the MySQL administrator (see above). This password will never be stored in any file, so you have to enter it when you are prompted for it. The Perl script "initfpdef.pl", located in the "util" directory, will be called automatically in order to create the MySQL table holding the fragment dictionary for the binary fingerprints (these molecular structures are supplied in the file "fp01.sdf", please feel free to modify it, if necessary).

  3. Run the Perl script "sdfcheck.pl" with the name of the SD file to be imported as the command-line argument, e.g. "perl sdfcheck.pl Compound_00000001_00025000.sdf". This will create a definition file "sdf2moldb.def". By default, only the first 10000 records are inspected which should be sufficient in order to collect the necessary information about the file structure. 

  4. Load the file "sdf2moldb.def" (see above) into a text editor and inspect it. Here, you can change the types of   MySQL variables or their dimensions. In addition, you MUST rename the most descriptive field name (preferentially holding the molecule's trivial name or its IUPAC name) into "mol_name". If you do not do this, you will have to manually adjust all the PHP scripts in order to retrieve the correct field. In our PubChem example, change the MySQL field name (i.e., the 2nd column, between the first and the second ":") "pubchem_iupac_name" into "mol_name". Many numeric fields are assumed to be floating-point numbers (MySQL type "DOUBLE"), but in fact are integer numbers, so their type should be changed into (e.g.) "INT(11)". In our PubChem example, only two fields should be kept as "DOUBLE": "pubchem_openeye_mw" and "pubchem_cactvs_xlogp". If you do not need one or more of the recognized SDF fields in your MySQL table, simply delete the corresponding lines from the definition file.

  5. Import the content of your SD file into your MySQL database by running the Perl script "sdf2moldb.pl" with the name of the SD file to be imported as the command-line argument, e.g. with "perl sdf2moldb.pl Compounds_00000001_00025000.sdf". Depending on the number of molecules in the SD file, this operation may take some time (on my machine, approx. 1-2 h for 10000 records). It is recommended to do this operation in a background job, most conveniently in a "screen" session (screen is a utility program which is included in most Linux distributions).
    As an alternative to sdf2moldb.pl, the script sdf2moldb-cmmm.pl can be used if a cmmmsrv daemon is running on your machine. This will work somewhat faster than sdf2moldb.pl (see above).

  6. If you want to add new records, repeat step 5 with another SD file. All data will be appended. ATTENTION: all SD files must have the same format (names and types of data fields) as the first one. For example, you can sequentially import all of the SD files from the PubChem site (see above). (And don't forget to let me know your performance figures for a database with >15 million structures....)  ;-)



The directory "util" contains some Perl scripts:

The latter four scripts should be run whenever a new version of checkmol/matchmol (or cmmmsrv, respectively) is installed on your system.

Furthermore, there are some diagnostic scripts available (bfpstat.pl, hfpstat.pl, ssrtest.pl) as well as a startup script
(cmmm-server) for starting/stopping the cmmmsrv daemon.

The script cp2mem.pl will copy the persistent molstat, molhfp, and molbfp MySQL tables to heap-based MySQL tables, which are accessed faster than the disk-based tables. In order to make use of these tables, the moldbconf.php file has to be modified (as an example, the file moldbconf-memory.php is included in the wwwfiles directory). Do not forget to run cp2mem.pl after each reboot of your machine, otherwise the memory-based tables would be empty.


Next, the web server part can be installed:

  1. Determine your web server's document root directory. This might be something like "/var/www/html" or "/data/htdocs", etc. (in a typical Apache installation, it is specified in the configuration file "httpd.conf").
     
  2. Within the document root, create a new directory which will hold the MolDB4 PHP scripts, e.g. "moldb4".
     
  3. Copy the "index.php" file, the "moldbtxt.php" file, the "moldbss.php" file, and the "moldbfg.php" file into this directory, as well as any further PHP scripts (e.g., "showmol.php", "details.php") and the "moldb.css" example style sheet. All of these files can be found in the directory "wwwfiles" of this package.
    If you are using bitmap graphics, create the appropriate directory, e.g. as a subdirectory "bitmaps" and move  all the *.png files there (if you did not specify this directory already as the output directory before running the "sdf2moldb.pl" script). The location (URL) of this bitmap directory must be also specified in the configuration file, "moldbconf.php" (see below). Please note that for large databases, it may be better to store groups of bitmap files in subdirectories (using the $subdirdigits variable in moldb.conf and moldbconf.php). 
     
  4. The configuration file "moldbconf.php" should contain the same values for database name, usernames and passwords as the setup configuration file "moldb.conf". This file should be placed outside the web server's document root, somewhere in the PHP include path (check your php.ini file and/or the PHP docs). Keeping the "moldbconf.php" file within the document root is a security risk, as it contains plain-text passwords. Make sure the web server can read this file and revoke read permission for other users. If you are using bitmap graphics, make sure that the variable $bitmapURLdir points to the correct location, e.g. "/moldb4/bitmaps" (no trailing slash!).
     
  5. Make sure you have the JME Java applet installed. The file "JME.jar" should be placed in a directory "classes" directly beneath the document root.
     
  6. Structure/substructure search is done with moldbss.php, for which two different versions are included. The slower one is moldbss-sh.php, it uses the checkmol/matchmol command-line program via shell calls. If you are running cmmmsrv, it is recommended to copy or rename moldbss-sock.php to moldbss.php, as this version is significantly faster. 


Note: it is assumed that PHP is not running in "safe mode". If safe_mode is set to On in the php.ini file, you have to make sure that  checkmol/matchmol is allowed to be executed, this can be done by specifying the corresponding directory in the variable safe_mode_exec_dir in php.ini (please check your PHP documentation).


That's it. You may wish to adjust the HTML layout of the index.php and moldb.css files and the PHP scripts, replacing "MolDB" by something else. If you write any useful extensions, I would be grateful if you donate these scripts to the project. And a little link to the checkmol/matchmol homepage would also be appreciated.

NH, 2007-10-08; last update: 2009-03-10