Você está na página 1de 43

Chemical Structure Representation

1
and Search Systems

Lecture 1. Oct 28, 2003


John Barnard

Barnard Chemical Information Ltd


Chemical Informatics Software & Consultancy Services

Sheffield, UK
Purpose of my 7 lectures
2

How do you store chemical structures on


computer?
What can you do with them there?
How do the computer systems used in chemical
informatics work?

Data Structures + Algorithms


Lecture topics
3

Oct 28 Introduction to structure


representation;
Introduction to Graph theory [video link]
Oct 30 Problems of structure representation
[video link]
Nov 4 More graph theory; Structure analysis
and processing [video link]
Nov 11 Structure searching I [video link]
Nov 13 Structure searching II [video link]
Nov 18 Chemical similarity [Indianapolis]
Nov 20 Cluster analysis etc. [Bloomington]
John Barnard
4

B.Sc. in Biochemistry (Birmingham, UK)


M.Sc. and Ph.D in Information Studies (Sheffield,
UK)
Has run chemical informatics software
development and consultancy business since 1985
Barnard Chemical Information (BCI) Ltd
http://www.bci.gb.com
Adjunct Professor of Informatics at Indiana
University
Lecture 1: Topics to be Covered
5

Structure representations and computers


structure diagrams
nomenclature
line notations
connection tables
Introduction to Graph Theory
Representing a chemical structure
6

How much information do you want to include?


atoms present
connections between atoms
bond types C8H9NO3
stereochemical configuration
charges
isotopes
3D-coordinates for atoms
Representing a chemical structure
7

How much information do you want to include?


atoms present
connections between atoms OH
bond types
stereochemical configuration
charges
isotopes CH2
3D-coordinates for atoms O
H2N CH
OH
Representing a chemical structure
8

How much information do you want to include?


atoms present
connections between atoms OH
bond types
(aromatic ring identification)
stereochemical configuration
charges
CH2
isotopes O
3D-coordinates for atoms H2N CH
OH
Representing a chemical structure
9

How much information do you want to include?


atoms present
connections between atoms OH
bond types
stereochemical configuration
charges
isotopes CH2
3D-coordinates for atoms O
H2N CH
OH
Representing a chemical structure
10

How much information do you want to include?


atoms present
connections between atoms OH
bond types
stereochemical configuration
charges
isotopes CH2
3D-coordinates for atoms +
O
H3N CH
O
Representing a chemical structure
11

How much information do you want to include?


atoms present
connections between atoms OH
bond types
stereochemical configuration
charges
isotopes CH2
3D-coordinates for atoms 14
O
H2N CH
OH
Representing a chemical structure
12

How much information do you want to include?


atoms present
connections between atoms
bond types
stereochemical configuration
charges
isotopes
3D-coordinates for atoms
2D structure diagram
13

chemists natural language


used by most computer systems for display
shows topology, optionally stereochemistry
several commonly-used computer programs allow input/
editing of structure diagrams
ISIS/Draw (MDL)
http://www.mdl.com/downloads/downloadable/index.jsp
ChemDraw (CambridgeSoft)
http://www.cambridgesoft.com/products/
GRINS/JavaGRINS (Daylight)
http://www.daylight.com/products/javatools.html
MarvinSketch
http://www.chemaxon.com/marvin/
2D structure diagram
14

provides 2D pictorial representation of chemical


structure
display on screen
cut/paste/embed in Word document etc.
inter-convert with other forms for further
processing
database searching
structure analysis
property prediction
database analysis
Chemical Nomenclature
15

name that can be used to identify a substance


potentially important for legislation
represents chemical structure as text string
which can (sometimes) be pronounced
trivial names
usually short and easy to pronounce
do not usually give much information about structure
systematic names
usually long and difficult to pronounce
usually describe structure in considerable detail
Trivial and Systematic Names
16
NH2
O
CH CH2 OH
HO
Trivial name:
tyrosine
Systematic names:
-(p-hydroxyphenyl)alanine
-amino-p-hydroxyhydrocinnamic acid
Systematic Names
17

several systems under continual revision and extension


IUPAC
Chemical Abstracts (lecture from Dr Davis on Sep 9)
some special systems designed by individuals
not usually designed for computer processing
programs exist both to read (translate) and to generate
systematic names from computer formats
http://www.beilstein.com/products/autonom/anm2000.shtml
http://www.acdlabs.com/products/name_lab/
have arguably outlived their usefulness
IUPAC IChI (IUPAC Chemical Identifier) project
Registry Numbers
18

unique identifiers for compounds or substances


catalogue number
most chemical databases have them
Chemical Abstracts
Beilstein
private compound registries in pharmaceutical companies
usually just idiot numbers
no chemical information
may have hierarchical structure
parent compound stereoisomer salt batch
need to decide what is a separate compound
Line Notations
19

represent structures as compact linear string of


alphanumeric symbols
easily handled by computer
compact storage
easily transmitted over a network
allow rapid manual coding/decoding by trained
users
much faster for input than using a structure drawing
program
Line Notations: SMILES
20

Simplified Molecular Input Line Entry System


developed by Dave Weininger (Daylight)

NH2
O 1
CH CH2 OH
HO
OC(=O)C(N)CC1=CC=C(O)C=C1
Simplified SMILES encoding rules
21

atoms are shown by atomic symbols:


B, C, N, O, F, P, S, Cl, Br, I
hydrogen atoms are assumed to fill spare valencies
adjacent atoms are connected by single bonds
double bonds are shown `=', triple bonds are `#'
branching is indicated by parentheses
ring closures are shown by pairs of matching digits

Full rules:
http://www.daylight.com/smiles/smiles-intro.html
22
Other line notations
5
3 NH2 12 11
O 6 13
1 CH CH2 OH
HO 4
8 9
ROSDAL (Beilstein)
Representation Of Structure Diagram Arranged Linearly
1O-2=3O,2-4-5N,4-6-7=-12-7,10-13O
Sybyl Line Notation (Tripos)
OHC(=O)CH(NH2)CH2C[1]=CHCH=C(OH)CH=CH@1
Wiswesser Line Notation (WLN) (obsolete)
QVYZ1R DQ
Connection Tables (CTs)
23

main form of structure representation in computer


systems
list atoms and bonds (and other data) as a table
many different formats
internal CTs (in memory)
algorithmic processing
external CTs (disk files)
archival storage
data exchange between programs
24
Redundant Connection Table
13
OH 1. O 1 21
2. C 0 11 32 41
3. O 0 22
11 9 4. C 1 21 51 61
5. N 2 41
12 8 6. C 2 41 71
7. C 0 61 82 12 1
6 8. C 1 72 91
CH2 9. C 1 81 10 2
10. C 0 92 11 1 13 1
5 11. C 1 10 1 12 2
H2N CH 12. C 1 11 2 71
4
13. O 1 10 1
O OH
3 1
Internal Connection Table
25

usually redundant
every bond shown twice, once for each atom
implemented as array of records
record for each atom might store
atomic type
hydrogen count
formal charge
2D display co-ordinates
bonds to neighbouring atoms
etc.
MDL Connection Table
26

proprietary file format developed by MDL


http://www.mdl.com/downloads/public/ctfile/ctfile.jsp
de facto standard for exchange of datasets
several different flavours and versions
Molfile (single molecule)
SDfile (set of molecules and data)
RGfile (Markush structure)
Rxnfile (single reaction)
RDfile (set of reactions with data)
separates atoms and bonds into separate blocks
New MDL File Formats
27

Since this lecture was delivered on Oct 28, 2003


MDL have published details of a new file format
called XDfile
XML-based data format for transferring
structure/reaction information with associated data
built around existing MDL connection table formats
can incorporate Chime strings (encrypted format used to
render structures and reactions on a Web page)
can incorporate SMILES strings
Details available in MDL documentation at:
http://www.mdl.com/downloads/public/ctfile/ctfile.jsp
MDL Connection Table
28

Header Block
data on molecule name and file origin
counts of atoms and bonds etc.

Tyrosine
-ISIS- 08220120432D

13 13 0 0 0 0 0 0 0 0999 V2000
MDL Connection Table
29

Atoms block
one line per atom
specifies X,Y,Z-coords, atom symbol, isotope, charge,
stereo code etc.
0.2459 -1.4736 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.5815 -1.4724 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.9944 -2.1872 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.5810 -2.9037 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.2495 -2.9008 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.6586 -2.1854 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.4836 -2.1830 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-1.9042 -2.1792 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.1027 -2.1870 0.0000 C 0 0 3 0 0 0 0 0 0 0 0 0
-3.1359 -1.1516 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-3.9070 -2.1847 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-4.4070 -2.6845 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-4.4989 -1.5618 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
MDL Connection Table
30
Bonds Block
one line per bond (each bond shown once)
specifies row numbers for atoms, and codes for bond
type, bond stereochemistry etc.

1 2 2 0 0 0 0
6 7 1 0 0 0 0
3 4 2 0 0 0 0
3 8 1 0 0 0 0
4 5 1 0 0 0 0
9 10 1 0 0 0 0
2 3 1 0 0 0 0
9 11 1 0 0 0 0
5 6 2 0 0 0 0
11 12 1 0 0 0 0
6 1 1 0 0 0 0
11 13 2 0 0 0 0
8 9 1 0 0 0 0
M END
Standard Connection Table
31Formats

different vendors have proprietary CT formats


many attempts to establish agreed standard
formats
no real general success
different user communities have failed to coordinate
efforts
some standards exist in restricted areas
SMILES and MDL CT formats widely used
most popular programs read/write several different
formats
Standard Connection Table
32Formats

Standard Molecular Data (SMD) format


never gained wide acceptance
Protein Data Bank (PDB) format
Crystallographic Information File (CIF/mmCIF)
Molecular Information File (MIF)
developed from SMD and compatible with CIF
Chemical Exchange Format (CXF)
Chemical Abstracts Service
Standard Connection Table
33Formats

Chemical Markup Language (CML)


uses principles of the eXtensible Markup Language (XML) protocol
for data exchange using the Internet
http://www.xml-cml.org
Chemical EXchange (CEX)
exchange protocol for TCP/IP networks developed collaboratively by
several organizations
http://www.cgl.ucsf.edu/cex
Chemical MIME
incorporates several popular formats into protocols for exchange of
molecular structures as e-mail attachments
http://www.ch.ic.ac.uk/chemime/
IUPAC Chemical Identifier (IChI)
34

Project being undertaken by International Union of


Pure and Applied Chemistry
Intended to provide unique identifier for
compounds, but with chemical intelligence
based on connection table
canonicalised (see lecture 3 on November 4)
compacted to short alphanumerical string
http://www.iupac.org/projects/2000/2000-025-1-800.html
see also Dr Nicklauss lecture on Oct 16
Topological Graph Theory
35

branch of mathematics
particularly useful in chemical informatics
and in computer science generally
study of graphs which
consist of
a set of nodes
a set of edges joining
pairs of nodes
Properties of graphs
36

graphs are only about connectivity


spatial position of nodes is irrelevant
length of edges are irrelevant
crossing edges are irrelevant
Properties of Graphs
37

nodes and edges can be coloured to distinguish


them
OH

CH2
O
H2N CH
OH
Structure Diagrams as Graphs
38

2D structure diagrams very like topological graphs


atoms nodes
bonds edges
terminal hydrogen atoms are not normally shown
as separate nodes (implicit hydrogens)
reduces number of nodes by ~50%
hydrogen count information used to colour
neighbouring heavy atom atom
separate nodes sometimes used for special hydrogens
deuterium, tritium
hydrogen bonded to more than one other atom
hydrogens attached to stereocentres
Advantages of using graphs
39

mathematical theory is well understood


graphs can be easily represented in computers
many useful algorithms are known
identical graphs identical molecules
different graphs different molecules
Disadvantages of using graphs
40

analogy between chemical structures and graphs is


not perfect
identical graphs identical molecules
different graphs /
different molecules
/
realities of chemical structures cause problems
aromaticity stereochemistry
tautomerism coordination compounds
multi-centre bonds inorganic compounds
macromolecules polymers
incompletely-defined substances
many graph algorithms are inherently slow
Lecture 1: Conclusions
41

There are lots of ways of storing a chemical


structure in a computer
including different amounts of information
Most important ones are
line notations (e.g. SMILES)
connection tables (e.g. MDL Molfile)
nomenclature
Structure diagrams used for input/output
Chemical structures can be regarded as topological
graphs
Lecture 2: Topics to be Covered
42

Special problems of structure representation


aromaticity and tautomerism
multi-centre bonds
stereochemistry and coordination compounds
inorganic compounds
macromolecules and polymers
incompletely-defined substances
Markush structures
Further reading
43

A. R. Leach and V. J. Gillet, An Introduction to


Chemoinformatics, Dordrecht: Kluwer, 2003
J. Gasteiger and T. Engel Chemoinformatics: a Textbook,
Wiley-VCH 2003
J. Gasteiger (ed.) Handbook of Chemoinformatics: From
Data to Knowledge, Wiley-VCH, 2003
Vol 1, Chapter II (Representation of chemical compounds)

Você também pode gostar