a free software package for creating a web-based, searchable molecular
Norbert Haider, University of Vienna, 2007
ATTENTION: there is already a newer version of this software available! Please visit the MolDB5 homepage.
These instructions explain how to set up a web-based, fully searchable
molecular structure database by importing data from an MDL SD file.
This is shown by an example using the SD files which are freely
available from the PubChem FTP site at ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/
For additional information, please visit
MolDB4 is a collection of fully functional PHP scripts for running a
structure database with search options for text, functional groups, and
structure/substructure. Included are also Perl scripts for data import
from an SD file.
The system uses MySQL as the database engine, whereas its "chemical
intelligence" is provided by the checkmol/matchmol program (or its
socket-based server version, cmmmsrv). All relevant data are stored in
MySQL tables, except the bitmap graphics of the molecules.
Structure/substructure searches are performed in a two-stage process,
consisting of an efficient pre-selection and a full atom-by-atom
comparison of the remaining candidates. For pre-selection, a
combination of molecular statistics with dictionary-based fingerprints
and hash-based fingerprints is used. Optionally, this operation can be
performed with memory-based MySQL tables instead of disk-based tables.
This will make the search even faster. In a test installation on a
standard PC (AMD Athlon 1.6 GHz CPU, 1.5 GB of memory), substructure
searches in a database of 500.000 compounds are typically complete
within a few seconds.
any x86 PC (the faster, the better)
- Linux operating system (any distribution)
- Apache web server 1.x or 2.x
- MySQL 3.x or higher
- PHP 4 or higher
- Perl 5.x
===> these packages
are standard components of every distribution
Note: this package was developed and tested on a Linux machine, but
with a few adaptations it should run also on a Windows PC (except for
operating in socket mode in combination with cmmmsrv). There is also at
least one successful implementation on a FreeBSD system and another
port to MacOS X.
Where to start?
First, create the MySQL tables from your SD file, for example
(after unpacking the .gz file which you downloaded from the PubChem FTP
If you have wget installed, you can retrieve the PubChem file very
conveniently by entering e.g.
Prerequisite: the checkmol/matchmol program should be installed in
/usr/local/bin (for more information, see
Alternatively, cmmmsrv should be up and running (by default, this
daemon is listening on port 55624).
The following steps
should be executed exactly in this order:
- Edit the configuration file "moldb.conf"; the only item
which must be known at this time is the user-ID of
the MySQL administrator (usually this is "root"), and
this account should have a password (attention: this is
not the same as the Linux administrator, "root").
Please check your MySQL documentation.
Select a name for your database, UIDs for a proxy
user with write privileges and another proxy user with
read-only privileges (the example file "moldb.conf" contains
already these entries which can be accepted or modified;
please change at least the passwords!).
If you want to use bitmap graphics of your
molecules (recommended), please specify a directory where
these bitmaps should be stored and make sure you have
write permission there. This can be the final location of the
files (somewhere within your web server's document root) or a temporary
directory from where
you can copy the files to their final location later
on. If you don't want bitmaps, just set
an empty string ("").
- From the Linux command line, run the shell script
"initdb.sh" by typing "
./initdb.sh". This will create the
database and the accounts for the proxy users with
appropriate privileges and passwords as specified in
"moldb.conf". You will be prompted for the MySQL password of the
MySQL administrator (see above). This password will
never be stored in any file, so you have to enter it when you are
prompted for it. The Perl script "initfpdef.pl", located in
the "util" directory, will be called automatically in order
to create the MySQL table holding the fragment dictionary
for the binary fingerprints (these molecular structures
are supplied in the file "fp01.sdf", please feel free
to modify it, if necessary).
- Run the Perl script "sdfcheck.pl" with the name of the SD
file to be imported as the command-line
argument, e.g. "
perl sdfcheck.pl Compound_00000001_00025000.sdf". This
will create a definition file "sdf2moldb.def". By
default, only the first 10000 records are
inspected which should be sufficient in order to collect the necessary information about the file structure.
- Load the file "sdf2moldb.def" (see above) into a text
editor and inspect it. Here, you can change the
types of MySQL variables or their dimensions. In
you MUST rename the most descriptive field name (preferentially holding
the molecule's trivial name or its IUPAC
name) into "mol_name". If you do not do this, you will have
to manually adjust all the PHP scripts in order to retrieve
the correct field. In our PubChem example, change the MySQL field
name (i.e., the 2nd column, between the first and the second
":") "pubchem_iupac_name" into "mol_name". Many numeric fields are
assumed to be
floating-point numbers (MySQL type "DOUBLE"), but in fact are integer
numbers, so their type should be changed into (e.g.)
"INT(11)". In our PubChem example, only two fields should be kept as
"DOUBLE": "pubchem_openeye_mw" and "pubchem_cactvs_xlogp". If you do
not need one or more of the recognized
SDF fields in your MySQL table, simply delete the
corresponding lines from the definition file.
- Import the content of your SD file into your MySQL database
by running the Perl script "sdf2moldb.pl" with the
name of the SD file to be imported as the command-line
argument, e.g. with "
perl sdf2moldb.pl Compounds_00000001_00025000.sdf".
Depending on the number of molecules in the SD
file, this operation may take some time (on my machine,
approx. 1-2 h for 10000 records). It is recommended to do
this operation in a background job, most conveniently
in a "screen" session (screen is a utility program
which is included in most Linux distributions).
As an alternative to sdf2moldb.pl, the script
sdf2moldb-cmmm.pl can be used if a cmmmsrv daemon is running on your
machine. This will work somewhat faster than sdf2moldb.pl
- If you want to add new records, repeat step 5 with another
SD file. All data will be appended. ATTENTION: all
SD files must have the same format (names and types
of data fields) as the first one. For example, you can
sequentially import all of the SD files from the PubChem site
(see above). (And don't forget to let me know your performance
figures for a database with >15 million
The directory "util" contains some Perl scripts:
The latter four scripts should be run whenever a new version of
checkmol/matchmol (or cmmmsrv, respectively) is installed on your system.
- fetchmol.pl retrieves the MDL molfile of a given record
number and writes it to standard output
- updatemol.pl replaces a specified molecular structure in
the database with a new one which is supplied as a molfile.
- mkmolstat.pl (re-)generates the molstat table (which
contains the molecular "statistics") by retrieving all structures and
analyzing them with checkmol
- mkmolfgb.pl (re-)generates the molfgb table (which contains
information about the functional groups) by retrieving all structures
and analyzing them with checkmol
- mkmolbfp.pl (re-)generates the molbfp table (which contains
the molecular "binary fingerprints") by retrieving all
structures and comparing them with the fragment dictionary which is
in the fpdef table and in the fp01.sdf file; this is done with
matchmol, using the "-F" option
- mkmolhfp.pl (re-)generates the molhfp table (which contains
the molecular "hashed fingerprints") by retrieving all
structures and automatically generating fragments and assigning hash
to them; this is done with checkmol, using the "-H" option
Furthermore, there are some diagnostic scripts available (bfpstat.pl,
hfpstat.pl, ssrtest.pl) as well as a startup script
(cmmm-server) for starting/stopping the cmmmsrv daemon.
The script cp2mem.pl will copy the persistent molstat, molhfp, and
molbfp MySQL tables to heap-based MySQL tables, which are accessed
faster than the disk-based tables. In order to make use of these
tables, the moldbconf.php file has to be modified (as an example, the
file moldbconf-memory.php is included in the wwwfiles directory). Do
not forget to run cp2mem.pl after each reboot of your machine,
otherwise the memory-based tables would be empty.
Next, the web server part can be
Determine your web server's document root directory. This might be something like "/var/www/html" or
"/data/htdocs", etc. (in a typical Apache installation, it is
specified in the configuration file "httpd.conf").
- Within the document root, create a new directory which will hold the MolDB4 PHP scripts, e.g. "moldb4".
- Copy the "index.php" file, the "moldbtxt.php" file, the "moldbss.php" file, and the "moldbfg.php" file
into this directory, as well as any further PHP scripts
(e.g., "showmol.php", "details.php") and the "moldb.css"
example style sheet. All of these files can be found in
the directory "wwwfiles" of this package.
If you are using bitmap graphics, create the
appropriate directory, e.g. as a subdirectory "bitmaps" and
move all the *.png files there (if you did not specify
this directory already as the output directory before
running the "sdf2moldb.pl" script). The location (URL) of
this bitmap directory must be also specified in the
configuration file, "moldbconf.php" (see below). Please note
that for large databases, it may be better to store groups
of bitmap files in subdirectories (using the
variable in moldb.conf and moldbconf.php).
- The configuration file "moldbconf.php" should contain the same
values for database name, usernames and passwords as the setup
configuration file "moldb.conf". This file should be placed outside
the web server's document root, somewhere in the PHP include path
(check your php.ini file and/or the PHP docs). Keeping the
"moldbconf.php" file within the document root is a security risk, as it
contains plain-text passwords. Make sure the web server can read this
file and revoke read permission for other users. If you are using
bitmap graphics, make sure that the variable
$bitmapURLdir points to the correct location, e.g. "/moldb4/bitmaps" (no trailing slash!).
- Make sure you have the JME Java applet installed. The file "JME.jar" should be placed in a directory
"classes" directly beneath the document root.
- Structure/substructure search is done with moldbss.php, for which two different versions are included. The
slower one is moldbss-sh.php, it uses the
checkmol/matchmol command-line program via shell calls. If you are
running cmmmsrv, it is recommended to copy or rename
moldbss-sock.php to moldbss.php, as this version is significantly
Note: it is assumed that PHP is
not running in "safe mode". If safe_mode is set to On in the php.ini
file, you have to make sure that checkmol/matchmol is allowed to
be executed, this can be done by specifying the corresponding directory
in the variable
safe_mode_exec_dir in php.ini (please check your PHP
That's it. You may wish to adjust
the HTML layout of the index.php and moldb.css files and the PHP
scripts, replacing "MolDB" by something else. If you write any useful
extensions, I would be grateful
if you donate these scripts to the project. And a little link to the checkmol/matchmol homepage would also be appreciated.
NH, 2007-10-08; last update: 2009-03-10