MolDB5

a free software package for creating a web-based, searchable molecular structure database

Norbert Haider, University of Vienna, 2009 
norbert.haider@univie.ac.at

These instructions explain how to set up MolDB5 as a web-based, fully searchable molecular structure database. Structures and data can be added via a web interface or by import from an MDL SD file. The latter option is shown by an example using the SD files which are freely available from the PubChem FTP site at ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/

For some information about the original ideas, please visit
http://merian.pch.univie.ac.at/~nhaider/cheminf/moldb.html

The previous version, MolDB4, is documented at
http://merian.pch.univie.ac.at/~nhaider/cheminf/moldb4.html

MolDB5 is a collection of fully functional PHP scripts for running a structure database with search options for text, functional groups, and structure/substructure. Included are also some Perl scripts for database setup and data import from an SD file.

Download: http://merian.pch.univie.ac.at/pch/download/chemistry/moldb/moldb5.tar.gz


Content

Features
Technical background
Installation
Using MolDB5
Administration of MolDB5
Appendix A (permissions and privileges)
Appendix B (configuration file settings)
Appendix C (performance tuning)

Features


Technical background

The system uses MySQL as the database engine, whereas its "chemical intelligence" is provided by the checkmol/matchmol program (or its socket-based server version, cmmmsrv). All relevant data are stored in MySQL tables, except the bitmap graphics of the molecules. Structure/substructure searches are performed in a two-stage process, consisting of an efficient pre-selection and a full atom-by-atom comparison of the remaining candidates. For pre-selection, a combination of molecular statistics with dictionary-based fingerprints and hash-based fingerprints is used. Optionally, this operation can be performed with memory-based MySQL tables instead of disk-based tables. This will make the search even faster. In a test installation on a standard PC (AMD Athlon 1.6 GHz CPU, 1.5 GB of memory), substructure searches in a database of 500.000 compounds are typically complete within a few seconds.

Demo MolDB5 database


Installation

Hardware requirements: 

any x86 PC (the faster, the better)

Software requirements:

==> these packages are standard components of every Linux distribution

moreover:
Note: this package was developed and tested on a Linux machine, but with a few adaptations it should run also on a  FreeBSD system and (probably) under MacOS X. The Windows development and (some) testing was done on a virtualized Windows XP system with the XAMPP 1.7.0 package. Instead of the Perl component of XAMPP, ActivePerl 5.6.1 was used and the missing database driver was installed via the PPM utility (ppm install DBI; ppm install DBD::MySQL).

Where to start?

After downloading and unpacking the MolDB5 package in your working directory, rename one of the two sample configuration files (moldb5.conf.sample_linux or moldb5.conf.sample_windows) into "moldb5.conf". Load this file into a text editor and adjust the settings. A detailed overview of all parameters is given in Appendix B.

Select a name for your database, UIDs for a proxy user with write privileges and another proxy user with read-only privileges (the example file "moldb5.conf" contains already these entries which can be accepted or modified; please change at least the passwords!).

If you want to use bitmap graphics of your molecules (recommended), please specify a directory where these bitmaps should be stored and make sure you have write permission there. This should be the final location of the .png files (somewhere within your web server's document root, preferentiall as a subdirectory "bitmaps" of your MolDB5 web directory). If you don't want bitmaps, just set $bitmapdir to an empty string ("").

Initialize the MolDB5 database by running the Perl script "initdb.pl" (enter "perl initdb.pl"). You will be prompted for the password of the MySQL administrator (typically, this is the "root" user). The script will create the appropriate MySQL database and two tables within this database (one for metadata and one for the fragment dictionary). If everything goes well, the script will produce this output:
setup of database $database OK    (where $database displays the name you have specified)
setup of fragment dictionary OK



If you want to import structures+data from an SD file (*.sdf), proceed as follows:

As an example, we use Compounds_00000001_00025000.sdf (after unpacking the .gz file which you can download from the PubChem FTP site).

If you have wget installed, you can retrieve the PubChem file very conveniently by entering e.g.

wget ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/Compound_00000001_00025000.sdf.gz

Prerequisite: the checkmol/matchmol program must be installed (e.g., in /usr/local/bin, for more information, see
http://merian.pch.univie.ac.at/~nhaider/cheminf/cmmm.html).

These steps should be executed exactly in this order:
  1. Run the Perl script "sdfcheck.pl" with the name of the SD file to be imported as the command-line argument, e.g. "perl sdfcheck.pl Compound_00000001_00025000.sdf". This will create a definition file "sdf2moldb.def". By default, only the first 10000 records are inspected which should be sufficient in order to collect the necessary information about the file structure. 

  2. Load the file "sdf2moldb.def" (see above) into a text editor and inspect it. Here, you can change the types of   MySQL variables or their dimensions. In addition, you MUST rename the most descriptive field name (preferentially the one holding the molecule's trivial name or its IUPAC name) into "mol_name". If you do not do this, you will have to manually adjust all the PHP scripts in order to retrieve the correct field. In our PubChem example, change the MySQL field name (i.e., the 2nd column, between the first and the second ":") "pubchem_iupac_name" into "mol_name". Many numeric fields are assumed to be floating-point numbers (MySQL type "DOUBLE"), but in fact are integer numbers, so their type should be changed into (e.g.) "INT(11)". In our PubChem example, only two fields should be kept as "DOUBLE": "pubchem_openeye_mw" and "pubchem_cactvs_xlogp". If you do not need one or more of the recognized SDF fields in your MySQL table, simply delete the corresponding lines from the definition file. Make sure that there is no field named "mol_id", because this is a reserved name of the MolDB5 system.

  3. Import the content of your SD file into your MySQL database by running the Perl script "sdf2moldb.pl" with the name of the SD file to be imported as the command-line argument, e.g. with "perl sdf2moldb.pl Compounds_00000001_00025000.sdf". Depending on the number of molecules in the SD file, this operation may take some time (on a standard business PC, approx. 1-2 h for 10000 records). It is recommended to do this operation in a background job, most conveniently in a "screen" session (screen is a utility program which is included in most Linux distributions).

  4. If you want to add new records, repeat step 3 with another SD file. All data will be appended. ATTENTION: all SD files must have the same format (names and types of data fields) as the first one. For example, you can sequentially import all of the SD files from the PubChem site (see above). (And don't forget to let me know your performance figures for a database with >25 million structures....)  ;-) 
     
  5. If you want to create a new data collection, repeat the steps described above, but change the "db_id=1" line in the sdf2moldb.def file into "db_id=2" (or higher, if applicable).



Next, the web server part can be installed:

  1. Of course, you must know the location of your web server's document root directory. This might be something like "/var/www/htdocs" or "/xampp/htdocs", etc. (in a typical Apache installation, it is specified in the configuration file "httpd.conf").
     
  2. Within the document root, you should have already created a directory which will hold the MolDB5 PHP scripts, e.g. "moldb5". Usually, the "bitmaps" directory will be there as a subdirectory (e.g., "moldb5/bitmaps"). Create another subdirectory named "admin" (e.g., "moldb5/admin").
     
  3. Copy all the files from the directory "wwwfiles" of this package to the MolDB5 web directory, copy also the two scripts in "wwwfiles/admin" to the "admin" subdirectory of the web directory.
     
  4. The configuration file "moldb5conf.php" should contain exactly the same values for all parameters as the setup configuration file "moldb5.conf". In principle, "moldb5conf.php" is nothing else than an exact copy of "moldb5.conf" with just two extra lines (one at the beginning: "<?php ", and one at the end: "?>"). For testing, this file can be placed also in the MolDB5 web directory. In a publicly accessible production environment, however, this file should be placed outside the web server's document root, somewhere in the PHP include path (check your php.ini file and/or the PHP docs). Keeping the "moldb5conf.php" file within the document root is a security risk, as it contains plain-text passwords. Make sure the web server can read this file and revoke read permission for other users. If you are using bitmap graphics, make sure that the variable $bitmapURLdir points to the correct location, e.g. "/moldb5/bitmaps" (no trailing slash!).
     
  5. Make sure you have the JME Java applet installed. The file "JME.jar" should be placed in the same directory as the main MolDB5 PHP scripts. 


Note: it is assumed that PHP is not running in "safe mode". If safe_mode is set to On in the php.ini file, you have to make sure that  checkmol/matchmol is allowed to be executed, this can be done by specifying the corresponding directory in the variable safe_mode_exec_dir in php.ini (please check your PHP documentation).



The subdirectory "util" of the working directory contains some Perl scripts:

These scripts should be run whenever a new version of checkmol/matchmol (or cmmmsrv, respectively) is installed on your system.

The script cp2mem.pl will copy the persistent molstat and molcfp MySQL tables to memory-based MySQL tables, which are accessed faster than the disk-based tables. In order to make use of these tables, the moldb_meta table has to be modified (the "usemem" column must be set to 'T'), this can be done with the Perl script "setmem.pl" withan  appropriate command-line argument, e.g. "perl setmem.pl 1=T" (enabling memory-based tables for data collection 1). Do not forget to run cp2mem.pl after enabling memory-based tables and after after each reboot of your machine, otherwise the memory-based tables would be empty. Generally, this option is recommended only for larger read-only data collections.



That's it for the installation part. You may wish to adjust the HTML layout of the index.php and moldb.css files and the PHP scripts, replacing "MolDB5" by something else. If you write any useful extensions, I would be grateful if you donate these scripts to the project. And a little link to the checkmol/matchmol homepage would also be appreciated.

Using MolDB5

Start page
On the start page (index.php), the user can select the data collection(s) from the list box and clicking the "Apply selection" button. Multiple selections (if enabled in the configuration file) can be made by pressing the <Ctrl> key while clicking at the list item. All search options are available via the navigation bar at the top of the page. A link to the administration page is displayed at the bottom (it may be better to remove this link...).

Browse
All selected data collections are available for browsing on this page, using standard navigation elements. Compound structures are displayed graphically (using bitmap pictures, if available), together with compound names and the ID number (mol_id). The latter represents a hyperlink to a detailed view of this particular compound (provided by the script "details.php").

Text search
A very simple text search tool which by default only searches the "mol_name" field. Search terms are not case-sensitive, as a wildcard character you can use the "%" symbol. By selecting the option "include other searchable fields", you can extend the search to all text fields which have been defined as "searchable" by the administrator (see below).

Functional group search
This search option makes use of checkmol's capabilities to recognize approx. 200 different functional groups. Categorisation of compounds is performed already when they are entered into the database. The properties are stored in binary format in a special table (molfgb), and the search operation is extremely fast. Multiple selection of functional groups from the list box can be made by pressing the <Ctrl> key while clicking at the desired item.

Structure search
This is the central search facility, permitting exact search, substructure search, and similarity search. In the browser, Java and Javascript must be enabled in order to display the JME applet. The input structure must contain at least three atoms. Checkboxes are available for more specific search modes (strict atom/bond type comparison, geometry checks) if a search results in too many hits.

Administration of MolDB5

Data collection handling
Apart from data import from SD files, new entries can be added via the administration frontend in the "admin" subdirectory. The main page ("admin/index.php") is used for adding new data collections and editing/erasing  existing ones. Besides the (recommended) password protection of this directory, the "trustedIP" setting restricts access to these functioanlities (for a more detailed description of privileges, see below)..
On this page, only one data collection can be selected at a time. All operations (edit data collection properties, edit data field definitions, erase selected data collection) will apply to this data collection. Whereas "edit data collection properties" and "erase selected data collection" are self-explaining, the option "edit data field definitions" requires some more explanation: here, you can add new fields to the records of the selected data collection, determine their (MySQL) type, the name which should display on the "details" output, and (optionally) a field name for SDF export. A newly created data collection has only two data fields: "mol_id" (an integer number) and "mol_name" which cannot be changed. A data collection which has been imported from an SD file contains (probably) some more fields. Whenever you add a new field, you can choose its type from the drop-down listbox or enter any valid MySQL column type by selecting "other..." from the list and entering the correct type in the input box. Caution: there is no syntax checking done by MolDB5, wrong input will result in MySQL errors. If you need some special MySQL column types and options, you should use some more sophisticated administration software like phpMyAdmin.

Structure/data input
Using "Add/edit data records", you can enter new structures+data or edit existing ones. This is done by the script "editdata.php", again all actions apply to the selected data collection.

Add new entry
Structures can be entered either by drawing, using the JME applet, or by pasting text in MDL molfile format into the text area. After submission, the structure is displayed and can again be edited if desired. After confirmation, the structure is added to the data collection and all relevant entries (molstat, functional group patterns, fingerprints) are made in the appropriate tables. The data entry form should be completed (at least, the name of the molecule should be entered) and saved.
In the user frontend (browse, search, details), newly added structures are always displayed using the JME applet in "depict" mode, until the bitmap graphics have been created by running the "updatebitmap.pl" admin script. This should be done automatically by a Cron job. In the admin frontend (add/edit structure), JME is always used for structure display.

Edit/erase entry
"Edit entry no." and "erase entry no." require a valid ID number (mol_id) as input. If you do not know this number, you can find it out by doing a search or just browse the data collection. Whenever an entry is erased, it is removed from all MySQL tables, but the original bitmap file (if it exists) will not be deleted. Instead, the "updatebitmap.pl" script will rename it into something like "00123456.png.to_be_deleted" on its next run. You can then use some OS shell command in order to actually delete these files (or modify the updatebitmap.pl script to do a real deletion).

Appendix A

Permissions and privileges

MolDB5 does not implement its own user administration, but relies on two access conditions:
  1. The web server's standard mechanism to protect a directory. This is usually configured either in a .htaccess file in the directory in question or in the central configuration file (typically /etc/apache2/httpd.conf), using the <Directory> keyword. Usually, a combination of valid client domain and password protection (htpasswd) is specified there. Caution: .htaccess files are effective only if enabled in the httpd.conf file! Please refer to the Apache documentation. It is highly recommended to restrict access to the "admin" subdirectory by this mechanism.
     
  2. The IP address of the client machine. Up to 10 "trusted IP" addresses can be specified in the moldb5conf.php file: these addresses have all possible privileges within MolDB5 (provided the user meets the access condition defined in .htaccess, see above). These users have the status of an "administrator". In addition, "sub-administrators" can be specified on a per-data-collection basis, listing their client IP address in the moldb_meta table, field "trustedIP" (again, up to 10 IP addresses).
Who can do what?

Only the administrator(s) can
The administrator(s) and the sub-administrator(s) of a data collection can
Any user who can access the "admin" subdirectory can


Appendix B

Parameter settings in the moldb5.conf / moldb5conf.php files

The configuration file moldb5.conf in the working directory and the PHP script moldb5conf.php in the MolDB5 web directory are almost identical: the only difference are the standard PHP tags "<?php " and "?>" in the first and last line of moldb5conf.php.

$database      = "moldb5";      # name of the MySQL database
$hostname      = "localhost";   # hostname of MySQL server, probably "localhost"
$clientname    = "localhost";   # name of MySQL client, usually "localhost"
$mysql_admin   = "root";        # MySQL administrator, usually "root"
$rw_user       = "mdb5-admin";  # proxy user with CREATE and INSERT privileges
$rw_password   = "top-secret";  # (make sure this script is unreadable to others)
$ro_user       = "mdb5-user";   # proxy user with SELECT privilege
$ro_password   = "secret";      # (better avoid dollar signs etc.)
$drop_db       = "y";           # erase entire database before re-creating it?


In this section, MySQL-specific settings have to be made: a databse name has to be specified. If $drop_db is set to "y", this database will be erased and recreated every time you run the initdb.pl script. The machine on which the MySQL server is running is specified by $hostname. Usually, the MySQL server and the web server are running on the same machine, in this case $hostname is set to "localhost" and $clientname is also set to "localhost". The name of the MySQL administrator user must be specified, ususally this is "root" (not to be confused with the system "root" user under Linux/Unix systems). The two proxy user accounts in MySQL will be automatically created when initdb.pl is run.
$tweakmolfiles = "y";           # "y" or "n" ("y" gives better performance)

If this parameter is set to "y", the imported molfiles will be replaced by (slightly) modified molfiles which contain some extra information about aromaticity. These "tweaked" molfiles are generated by checkmol and enable matchmol to do atom-by-atom comparisons (in a substructure search) more efficiently (see Appendix C). Tweaked molfiles are completely compatible with other programs which can import MDL molfiles.

$prefix        = "";            # this allows to have different MolDB5 instances
                                # in one MySQL database; each MolDB5 instance can
                                # contain multiple data collections

As explained in the comment, you can use this extra prefix to keep more than one MolDB5 installation in a single MySQL database or you can place MolDB5 together with other database applications (such as bulletin boards, wikis, blogs, etc.) in a single MySQL database. This makes sense only in environments where you are restricted to use only one MySQL database.

$sitename      = "MolDB5 demo"; # appears in title and headline
$CHECKMOL      = "/usr/local/bin/checkmol"; 
$MATCHMOL      = "/usr/local/bin/matchmol";
$use_cmmmsrv   = "y";           # set to "y" if a cmmmsrv daemon is available
$cmmmsrv_addr  = "127.0.0.1";   # must be numeric
$cmmmsrv_port  = 55624;         # the default cmmmsrv port is 55624

The $sitename string will appear in the header of all MolDB5 pages.
In $CHECKMOL and $MATCHMOL, the executables of checkmol and matchmol are specified. Under Linux, the programs are usually installed in the /usr/local/bin directory (most conveniently, matchmol is just a hard link to checkmol), under Windows it can be installed, for example, in C:\bin (but do not include the drive letter in these values). Setting $CHECKMOL to "checkmol" would be also OK if the program's directory is included in the system's search path (beware: this must apply also to the search path of the web server process).
Whereas all the backend Perl scripts rely on checkmol/matchmol, the PHP portion of the package can also use the cmmmsrv daemon instead, if installed on your system. This server process provides the same functionality as the command-line programs (Checkmol/matchmol), but communicates with the web server via TCP sockets instead of shell calls. This gives somewhat better performance (see Appendix C). In principle, cmmmsrv can run on a different machine than the web server, in this case $cmmmsrv_addr must be set to the (numeric) IP address of this machine.


$MOL2PS        = "/usr/local/bin/mol2ps";
$GHOSTSCRIPT   = "/usr/bin/gs";
#$bitmapdir = "";
$bitmapdir     = "/var/www/htdocs/moldb5/bitmaps";  # this is the base directory
$bitmapURLdir  = "/moldb5/bitmaps";
$digits        = 8;      # filenames will be 00000001.png, etc.
$subdirdigits  = 4;      # uses the first x digits of $digits (0 = no subdirectories)
#$mol2psopt     = "--rotate=auto3Donly --hydrogenonmethyl=off --color=/usr/local/etc/color.conf"; # options for mol2ps, e.g. "--showmolname=on"
$mol2psopt     = "--rotate=auto3Donly --hydrogenonmethyl=off"; # options for mol2ps, e.g. "--showmolname=on"
$scalingfactor = 0.22;          # 0.22 gives good results

These options are relevant only if you want to use bitmap graphics for 2D depiction of your molecular structures, otherwise set $bitmapdir to an empty string ($bitmapdir = ""). $MOL2PS points to the mol2ps program, $GHOSTSCRIPT to the Ghostscript program (under Windows, this must be the console application "gswin32c.exe"). The directory specified in $bitmapdir must be writable to you, but not necessarily to the web server. Whenever new structures are entered via the wb frontend, bitmap files are not immediately created, but queued for creation by an auxiliary script "updatebitmap.pl" which should be run in regular intervals by the user who has write permission to $bitmapdir. Most conveniently, this is done by a cron job, e.g. once a day.
Within the bitmap directory, subdirectories are created automatically by updatebitmap.pl (as well as sdf2moldb.pl) for each data collection, using the number ("db_id") of that data collection as the directory name. For larger data collection (>10000 structures), it is advisable to further divide the directory hierarchy by specifying a value of (e.g.) 4 as $subdirdigits and (e.g.) 8 as $digits for the PNG file names. This would give paths to the actual graphics files like /var/www/htdocs/moldb5/bitmaps/1/0000/00000045.png (in an URL, this would be /moldb5/bitmaps/1/0000/00000045.png). Thus, the maximum number of PNG files in one directory can be limited, otherwise you might run into problems with some shell commands.
For all available options of mol2ps, please refer to the mol2ps homepage or just enter "mol2ps" on a command prompt.

# further settings=============================================================

$multiselect   = "y";           # allow selection of multiple data collections
$default_db    = 1;             # default db_id
$trustedIP     = "127.0.0.1, 192.168.0.10";   # max. 10 IP addresses
$fpdict_mode   = 1;      # 1 = auto adjust, 2 = force 64 bit, 3 = force 32 bit
$scratchdir    = "/data/moldb/moldb-scratch";  # needed by cp2mem.pl,
                                               # must be writeable by mysql UID, too

In this section, you can specify whether or not users can select only one data collection for searching/browsing or all available data collections. If there are more than one data collections, $default_db specifies the one which is automatically selected whenever a user visits your MolDB5 website.
In $trustedIP, you can specify up to 10 (numeric) IP addresses for extended administrative privileges. Besides protecting the "admin" subdirectory with a password (using the stanrd mechanisms of the Apache web server, e.g. .htaccess files), the $trustedIP directive gives extra protection. Only from a client machine listed in $trustedIP, you can change the settings of a data collection (e.g. from read-only to read/write) or erase an entire data collection. If you want to assign special administrative priveleges to just one data collection to somebody (e.g. for write access to a read-only data collection), use the analogous "trusted IP addres" specification in the web administration frontend.

The parameter $fpdictmode is rather technical and determines the format in which dictionary-based fingerprints are stored in the molcfp table (either as unsigned 32-bit integers or as 64-bit integers). In the first case, a MySQL column will store information about presence/absence of 31 structural elements from the dictionary, in the latter case, up the 62 bits will be used for this purpose. A setting of "1" is recommended.

The $scratchdir variable holds the name of a directory for temporary files which are used by "cp2mem.pl" when it copies the content of disk-absed tables (molstat and molcfp) to memory-based tables (see Appendix C). It is very important that not only the user who runs "cp2mem.pl" has write access to this directory, but also the MySQL server process (which usually runs as user "mysql").

The last part of the configuration file contains various definitions for table names and should not be changed, unless you have a very good reason to do so.


Appendix C

Performance tuning

There are three options which should be considered when maximum performance in substructre searches is desired:



NH, 2009-02-19; last update: 2009-03-10