The checkmol/matchmol Homepage


What is checkmol/matchmol?
Which input formats are supported by checkmol/matchmol?
How can checkmol/matchmol be used?
How to obtain checkmol/matchmol?
What are the requirements of checkmol/matchmol?
Compiling and installing checkmol/matchmol
Usage (command-line options):
Features
Windows DLL version
Linux server version
Links
Contact

What is checkmol/matchmol?

Checkmol is a command-line utility program which reads molecular structure files in different formats (see below) and analyzes the input molecule for the presence of various functional groups and structural elements. At present, approx. 200 different functional groups are recognized. Output can be either clear text (English or German), a bitstring or its ASCII representation, or a set of special 8-character codes. This output can be easily placed into a database table, permitting the creation of chemical databases with a functional group search option.

Here is a complete list of recognized function groups (PDF).

Another output option of checkmol is a set of statistical values derived from a given molecule, which can also be used for quick retrieval from a database. These values include: the number of atoms, bonds, and rings, the number of differently hybridized carbon, oxgen, and nitrogen atoms, the number of C=O double bonds, the number of rings of different sizes, the number of rings containing nitrogen, oxygen, sulfur, the number of aromatic rings, the number of heterocyclic rings, etc. The combination of all of these values for a given molecule represents some kind of "fingerprint" which is useful for rapid pre-selection in a database structure/substructure search prior to a full atom-by-atom match (see below). For a fully functional set of PHP scripts implementing such a web database (plus utility scripts for data import), please visit the MolDB5 homepage.

Matchmol complements the capabilities of checkmol. It compares two (or more) molecular structures and determines whether one of them is a substructure of the other one. This is done by a full atom-by-atom comparison of the input structures. Thus, matchmol can be used as a back-end program for structure/substructure search operations in chemical databases (see below).

More detailed information is available in this publication:
Haider, N., Functionality Pattern Matching as an Efficient Complementary Structure/Reaction Search Tool: an Open-Source Approach. Molecules, 15, 5079-5092 (2010).

Which input formats are supported by checkmol/matchmol?

As input files, MDL molfiles (*.mol; 2D and 3D), Alchemy molfiles (*.mol), and Sybyl mol2-files (*.mol2) are currently understood by checkmol/matchmol, the preferred format is the MDL molfile format. The matchmol utility can also process MDL SD-files which can contain multiple molecular structures. At present, it is not intended to extend the number of supported input file formats, as there are powerful file format converters available, such as OpenBabel. A detailed description of the MDL file formats (molfile, SD-file) is available here.

How can checkmol/matchmol be used?

The main purpose of checkmol/matchmol is to permit the creation of fully searchable, web-based molecular structure databases entirely with free software. For example, a typical LAMP system (Linux, Apache, MySQL, PHP) can be easily extended with checkmol/matchmol into a chemical database with structure/substructure search options. A detailed description of how this can be done is given here.

Another application is batch-mode processing of data files containing multiple structures, in our case MDL SD files. For instance, one can do a substructure search e.g. for uracil-containing molecules in a large SD file like the Maybridge screening collection and write the matching molecules into another SD file. This can be achieved with the following command:

matchmol -m uracil.mol maybridge-complete.sdf > maybridge-uracils.sdf

The -m option causes output of hits in MDL molfile format (including any additional fields of the input SD file), uracil.mol contains the query structure (the "needle") and maybridge-complete.sdf is the database file (the "haystack"). Since version 0.2g of checkmol/matchmol, there is no size limit for the "haystack" file.

How to obtain checkmol/matchmol?

The two programs are in fact only one program which is invoked by two different names, i.e. there is only one source code. The utility is freely available under the terms of the GNU General Public License (GPL), for a detailed description of this license, please visit http://www.gnu.org/copyleft/gpl.html.

Download:
please visit the download directory at http://merian.pch.univie.ac.at/pch/download/chemistry/checkmol/,
it contains the source code (checkmol.pas is a symbolic link to the latest source file) as well as pre-compiled binaries for various platforms (Windows, Mac OS X) in the "bin" subdirectory; there is also a socket-based server version for Un*x-like systems (cmmmsrv) in the "server" sundirectory.

for a brief description of version history, please check the source code

What are the requirements of checkmol/matchmol?

The software is available both as source code and as a binary compiled for Linux (x86 architecture). It is entirely written in Pascal and it was compiled with Free Pascal 1.0.11 or Free Pascal 2.4.0 (starting from v0.4c). The Free Pascal compiler is also freely available under the GPL, and there are versions for a variety of operating systems and computer architectures. For more information about Free Pascal, please visit the project homepage at http://www.freepascal.org. The binary executable of checkmol/matchmol was built on a SuSE 10.1 or on a Ubuntu 10.04 system, but it should run on any other x86 Linux distribution, as there are no special libraries required. Supported platforms include also MS Windows (NT, 2000, XP).

Compiling and installing checkmol/matchmol

Compile with fpc (Free Pascal, see above), using the -Sd or -S2 option (Delphi mode; this is IMPORTANT!)

Example for compilation and installation:
fpc checkmol.pas -S2 -O3 -Op3
Note: if you are running MacOS X, use the following command:
fpc checkmol.pas -S2 -Tdarwin
as described on the Macs in Chemistry website (i.e., do not use the compiler optimisation flags)

This will give a file "checkmol.o" and a file "checkmol"; then, as "root" user, do the following:

cp checkmol /usr/local/bin    (or any other directory in your path)
cd /usr/local/bin
ln checkmol matchmol
          (ATTENTION: a symbolic link does not work!)

Note that checkmol and matchmol are the same executable, but the program behaves differently depending on the name it was invoked with. Of course, you can also copy "checkmol" to "matchmol" (instead of making a link), but then it takes twice as much disk space (under Windows, this is the only possibility, as there are no hard links available under this "OS").

Usage (command-line options):

checkmol can be invoked with the following arguments
checkmol [options] <filename>
 where [options] can be:
    -l  print a list of fingerprint codes + explanation and exit
    -v  verbose output
    -r  force SSR (set of small rings) ring search mode
        -M  accept metal atoms as ring members
  and one of the following:
    -e  english text (common name of functional group; default)
    -d  german text (common name of functional group)
    -c  code (acronym-like code for functional group)
    -b  bitstring (in decimal format) representing the presence of each group
    -s  (the ASCII representation of the above bitstring, i.e. 0s and 1s)
    -p  lists the position of each functional group (atom number of key atom)
    -x  print molecular statistics (number of various atom types, bond types, ring sizes, etc.
    -X  same as above, listing all records (even if zero) as comma-separated list

    -a  count charges in fingerprint

    -m  write MDL molfile (with special encoding for aromatic atoms/bonds)
    -h  hashed fingerprint mode with boolean output
    -H  hashed fingerprint mode with decimal output


options can be combined (like -vc); <filename> specifies any file in the formats supported (MDL *.mol, Alchemy *.mol, Sybyl *.mol2), the filename "-" (without quotes) specifies standard input

matchmol can be invoked with the following arguments
matchmol [options] <needle> <haystack>
 where <needle> and <haystack> are the two molecules to compare
 (supported formats: MDL *.mol, Alchemy *.mol, Sybyl *.mol2)
 options can be:
    -v  verbose output
    -x  exact match
    -s  strict comparison of atom and bond types
    -r  force SSR (set of small rings) ring search mode
    -m  write matching molecule as MDL molfile to standard output

    -M  accept metal atoms as ring members
    -n  additional output of atom numbers for matching atom pairs
    -N  like -n, but only for the first matching substructure found
    -g  check geometry of double bonds (E/Z)
    -G  check geometry of chiral centers (R/S)
    -a  check charges strictly
    -i  check isotopes strictly
    -d  check radicals strictly
    -f  fingerprint mode (1 haystack, multiple needles) with boolean output
    -F  fingerprint mode (1 haystack, multiple needles) with decimal output

Default output: record number + ":T" for hit  or ":F" for miss,  i.e., if the haystack contains only one molecule, then the result will be "1:T" or "1:F". The "haystack" can also be a MDL SD-file (containing multiple molecules); if invoked with "-" as file argument, both "needle" and "haystack" are read as only one SD-file from standard input, assuming the first entry in the SDF to be the "needle"; the output is: entry number + ":F" (false) or ":T" (true)

Features

At present, only smaller molecules are handled adequately, i.e. for each molecule the maximum number of atoms is 1024, the maximum number of bonds is 1024, the maximum ring size is 128 (i.e., rings larger than 128 members are treated as open-chain compounds), and the maximum number of rings is 1024. Checkmol/matchmol collects the "set of all rings" (SAR) instead of e.g. the "smallest set of smallest rings" (SSSR). Aromaticity is determined by application of the Hückel rule (4n + 2 pi electrons) without any geometry checks, but with adequate treatment of tautomeric/mesomeric structures where possible. For example, 1-methyl-2(1H)-pyridone is correctly recognized as aromatic, as well as cyclopentadienyl anion, tropylium cation, fulvene, tropone, etc.
New in version 0.2: if a molecule contains more than 1024 rings, a fallback mechanism changes the ring search mode from SAR to SSR (set of small rings, which is defined as follows: ringsize <= 12 atoms, no ring is completely contained in another one). For additional information, please check the version history description in the source code.
Starting with versions 0.3d and 0.3f, matchmol supports stereospecific search operations, either globally or on a per-atom or per-bond basis. Geometric isomers of the E/Z type (aka cis/trans isomers) are recognized as well as isomers with chiral centers (R/S isomers). The latter type of isomer discrimination works with 3D molfiles (using the XYZ coordinates) and with 2D molfiles (using "up" and "down" bond notation) in any combination.
Starting with version 0.4, checkmol supports the generation of hash-based fingerprints for efficient pre-selection in structure databases. The default values are as follows: only linear fragments, minimum fragment length: 3 atoms, maximum fragment length: 8 atoms, 2 bits per fragment, total bitstring length: 512 bits.
Starting with version 0.5, checkmol has an option (-p) to display all occurrences of all detected functional groups in a molecule by listing the corresponding "key atoms" (for a graphical representation of all functional groups with their key atoms, see the document fgtable.pdf).

Windows DLL version

Although the program can be smoothly compiled with Free Pascal on the Win32 platform as a console application, its encapsulation in a Windows Dynamic Link Library (DLL) would have specific advantages, such as seamless integration into database applications like MS Access (using VBA as the link). Alessandro Barozza from PROCOS had the idea for this DLL version and he also realized its implementation. Cited from Alessandro's code header:

WHY?
I needed substructure matching capability.
I needed a dll for using with visual basic or VBA (MS-Access).
I needed to pass the mol file as string (from a memo field in a database
and not as a molfile on the disk)
so... I've modified the original matchmol

A more detailed description of the features of this DLL and how to use it are given in the header of the source code (see download link below). Alternatively, you can use the Barsoi DLL, a library based on a C port of checkmol/matchmol which has been developed as a part of the pgchem::tigress project by Ernst-Georg Schmid (see below).

Download:
source code: MATCHMOLDLL.pas (286 KB)
compiled DLL: MatchMolDLL.dll (167 KB)

Linux server version

cmmmsrv, a socket-based server program providing checkmol/matchmol functionality has been developed as a replacement for the checkmol/matchmol command-line program in web-based molecular structure databases and related applications. Communication of any frontend program (e.g., a PHP script) with cmmmsrv takes place via sockets instead of shell calls, thus saving a significant amount of time.

Download:
source code: cmmmsrv.pas (400 KB)
compiled Linux (i586) binary: cmmmsrv.gz 
documentation: readme.txt
examples for using cmmmsrv can be found in the MolDB5R package (e.g., in the script incss.php)

Links

Contact

Checkmol/matchmol was written by Norbert Haider, Department of Pharmaceutical Chemistry (now: Department of Drug and Natural Product Synthesis), University of Vienna, Austria. You can contact me by e-mail: norbert.haider@univie.ac.at (no spam, no viruses, no HTML mails, please).

N. Haider, 2003-12-01; last update: 2013-05-24