Package rdkit :: Package Chem :: Package fmcs :: Module fmcs
[hide private]
[frames] | no frames]

Source Code for Module rdkit.Chem.fmcs.fmcs

   1  #!/usr/bin/env python 
   2   
   3  # This work was funded by Roche and generously donated to the free 
   4  # and open source cheminformatics community. 
   5   
   6  ## Copyright (c) 2012 Andrew Dalke Scientific AB 
   7  ## Andrew Dalke <dalke@dalkescientific.com> 
   8  ## 
   9  ## All rights reserved. 
  10  ## 
  11  ## Redistribution and use in source and binary forms, with or without 
  12  ## modification, are permitted provided that the following conditions are 
  13  ## met: 
  14  ## 
  15  ##   * Redistributions of source code must retain the above copyright 
  16  ##     notice, this list of conditions and the following disclaimer. 
  17  ## 
  18  ##   * Redistributions in binary form must reproduce the above copyright 
  19  ##     notice, this list of conditions and the following disclaimer in 
  20  ##     the documentation and/or other materials provided with the 
  21  ##     distribution. 
  22  ## 
  23  ## THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  24  ## "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  25  ## LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 
  26  ## A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 
  27  ## HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 
  28  ## SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  29  ## LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 
  30  ## DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 
  31  ## THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  32  ## (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 
  33  ## OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 
  34  """FMCS - Find Maximum Common Substructure 
  35   
  36  This software finds the maximum common substructure of a set of 
  37  structures and reports it as a SMARTS strings. 
  38   
  39  This implements what I think is a new algorithm for the MCS problem. 
  40  The core description is: 
  41   
  42    best_substructure = None 
  43    pick one structure as the query, and other as the targets 
  44    for each substructure in the query graph: 
  45      convert it to a SMARTS string based on the desired match properties 
  46      if the SMARTS pattern exists in all of the targets: 
  47         then this is a common substructure 
  48         keep track of the maximum such common structure, 
  49   
  50  The SMARTS string depends on the desired match properties. For 
  51  example, if ring atoms are only allowed to match ring atoms then an 
  52  aliphatic ring carbon in the query is converted to the SMARTS "[C;R]", 
  53  and the double-bond ring bond converted to "=;@" while the respectice 
  54  chain-only version are "[C;!R]" and "=;!@". 
  55   
  56  The algorithm I outlined earlier will usually take a long time. There 
  57  are several ways to speed it up. 
  58   
  59  == Bond elimination == 
  60   
  61  As the first step, remove bonds which obviously cannot be part of the 
  62  MCS. 
  63   
  64  This requires atom and bond type information, which I store as SMARTS 
  65  patterns. A bond can only be in the MCS if its canonical bond type is 
  66  present in all of the structures. A bond type is string made of the 
  67  SMARTS for one atom, the SMARTS for the bond, and the SMARTS for the 
  68  other atom. The canonical bond type is the lexographically smaller of 
  69  the two possible bond types for a bond. 
  70   
  71  The atom and bond SMARTS depend on the type comparison used. 
  72   
  73  The "ring-matches-ring-only" option adds an "@" or "!@" to the bond 
  74  SMARTS, so that the canonical bondtype for "C-C" becomes [#6]-@[#6] or 
  75  [#6]-!@[#6] if the bond is in a ring or not in a ring, and if atoms 
  76  are compared by element and bonds are compared by bondtype. (This 
  77  option does not add "R" or "!R" to the atom SMARTS because there 
  78  should be a single bond in the MCS of c1ccccc1O and CO.) 
  79   
  80  The result of all of this atom and bond typing is a "TypedMolecule" 
  81  for each input structure. 
  82   
  83  I then find which canonical bondtypes are present in all of the 
  84  structures. I convert each TypedMolecule into a 
  85  FragmentedTypedMolecule which has the same atom information but only 
  86  those bonds whose bondtypes are in all of the structures. This can 
  87  break a structure into multiple, disconnected fragments, hence the 
  88  name. 
  89   
  90  (BTW, I would like to use the fragmented molecules as the targets 
  91  because I think the SMARTS match would go faster, but the RDKit SMARTS 
  92  matcher doesn't like them. I think it's because the new molecule 
  93  hasn't been sanitized and the underlying data structure the ring 
  94  information doesn't exist. Instead, I use the input structures for the 
  95  SMARTS match.) 
  96   
  97  == Use the structure with the smallest largest fragment as the query == 
  98  == and sort the targets by the smallest largest fragment             == 
  99   
 100  I pick one of the FragmentedTypedMolecule instances as the source of 
 101  substructure enumeration. Which one? 
 102   
 103  My heuristic is to use the one with the smallest largest fragment. 
 104  Hopefully it produces the least number of subgraphs, but that's also 
 105  related to the number of rings, so a large linear graph will product 
 106  fewer subgraphs than a small fused ring system. I don't know how to 
 107  quantify that. 
 108   
 109  For each of the fragmented structures, I find the number of atoms in 
 110  the fragment with the most atoms, and I find the number of bonds in 
 111  the fragment with the most bonds. These might not be the same 
 112  fragment. 
 113   
 114  I sort the input structures by the number of bonds in the largest 
 115  fragment, with ties broken first on the number of atoms, and then on 
 116  the input order. The smallest such structure is the query structure, 
 117  and the remaining are the targets. 
 118   
 119  == Use a breadth-first search and a priority queue to    == 
 120  == enumerate the fragment subgraphs                      == 
 121   
 122  I extract each of the fragments from the FragmentedTypedMolecule into 
 123  a TypedFragment, which I use to make an EnumerationMolecule. An 
 124  enumeration molecule contains a pair of directed edges for each atom, 
 125  which simplifies the enumeration algorithm. 
 126   
 127  The enumeration algorithm is based around growing a seed. A seed 
 128  contains the current subgraph atoms and bonds as well as an exclusion 
 129  set of bonds which cannot be used for future grown. The initial seed 
 130  is the first bond in the fragment, which may potentially grow to use 
 131  the entire fragment. The second seed is the second bond in the 
 132  fragment, which is excluded from using the first bond in future 
 133  growth. The third seed starts from the third bond, which may not use 
 134  the first or second bonds during growth, and so on. 
 135   
 136   
 137  A seed can grow along bonds connected to an atom in the seed but which 
 138  aren't already in the seed and aren't in the set of excluded bonds for 
 139  the seed. If there are no such bonds then subgraph enumeration ends 
 140  for this fragment. Given N bonds there are 2**N-1 possible ways to 
 141  grow, which is just the powerset of the available bonds, excluding the 
 142  no-growth case. 
 143   
 144  This breadth-first growth takes into account all possibilties of using 
 145  the available N bonds so all of those bonds are added to the exclusion 
 146  set of the newly expanded subgraphs. 
 147   
 148  For performance reasons, the bonds used for growth are separated into 
 149  'internal' bonds, which connect two atoms already in the subgraph, and 
 150  'external' bonds, which lead outwards to an atom not already in the 
 151  subgraph. 
 152   
 153  Each seed growth can add from 0 to N new atoms and bonds. The goal is 
 154  to maximize the subgraph size so the seeds are stored in a priority 
 155  queue, ranked so the seed with the most bonds is processed first. This 
 156  turns the enumeration into something more like a depth-first search. 
 157   
 158   
 159  == Prune seeds which aren't found in all of the structures == 
 160   
 161  At each stage of seed growth I check that the new seed exists in all 
 162  of the original structures. (Well, all except the one which I 
 163  enumerate over in the first place; by definition that one will match.) 
 164  If it doesn't match then there's no reason to include this seed or any 
 165  larger seeds made from it. 
 166   
 167  The check is easy; I turn the subgraph into its corresponding SMARTS 
 168  string and use RDKit's normal SMARTS matcher to test for a match. 
 169   
 170  There are three ways to generate a SMARTS string: 1) arbitrary, 2) 
 171  canonical, 3) hybrid. 
 172   
 173  I have not tested #1. During most of the development I assumed that 
 174  SMARTS matches across a few hundred structures would be slow, so that 
 175  the best solution is to generate a *canonical* SMARTS and cache the 
 176  match information. 
 177   
 178  Well, it turns out that my canonical SMARTS match code takes up most 
 179  of the FMCS run-time. If I drop the canonicalization step then the 
 180  code averages about 5-10% faster. This isn't the same as #1 - I still 
 181  do the initial atom assignment based on its neighborhood, which is 
 182  like a circular fingerprint of size 2 and *usually* gives a consistent 
 183  SMARTS pattern, which I can then cache. 
 184   
 185  However, there are times when the non-canonical SMARTS code is slower. 
 186  Obviously one is if there are a lot of structures, and another if is 
 187  there is a lot of symmetry. I'm still working on characterizing this. 
 188   
 189   
 190  == Maximize atoms? or bonds? == 
 191   
 192  The above algorithm enumerates all subgraphs of the query and 
 193  identifies those subgraphs which are common to all input structures. 
 194   
 195  It's trivial then to keep track of the current "best" subgraph, which 
 196  can defined as having the subgraph with the most atoms, or the most 
 197  bonds. Both of those options are implemented. 
 198   
 199  It would not be hard to keep track of all other subgraphs which are 
 200  the same size. 
 201   
 202  == --complete-ring-only implementation == 
 203   
 204  The "complete ring only" option is implemented by first enabling the 
 205  "ring-matches-ring-only" option, as otherwise it doesn't make sense. 
 206   
 207  Second, in order to be a "best" subgraph, all bonds in the subgraph 
 208  which are ring bonds in the original molecule must also be in a ring 
 209  in the subgraph. This is handled as a post-processing step. 
 210   
 211  (Note: some possible optimizations, like removing ring bonds from 
 212  structure fragments which are not in a ring, are not yet implemented.) 
 213   
 214   
 215  == Prune seeds which have no potential for growing large enough  == 
 216   
 217  Given a seed, its set of edges available for growth, and the set of 
 218  excluded bonds, figure out the maximum possible growth for the seed. 
 219  If this maximum possible is less than the current best subgraph then 
 220  prune. 
 221   
 222  This requires a graph search, currently done in Python, which is a bit 
 223  expensive. To speed things up, I precompute some edge information. 
 224  That is, if I know that a given bond is a chain bond (not in a ring) 
 225  then I can calculate the maximum number of atoms and bonds for seed 
 226  growth along that bond, in either direction. However, precomputation 
 227  doesn't take into account the excluded bonds, so after a while the 
 228  predicted value is too high. 
 229   
 230  Again, I'm still working on characterizing this, and an implementation 
 231  in C++ would have different tradeoffs. 
 232  """ 
 233   
 234  __version__ = "1.1" 
 235  __version_info = (1, 1, 0) 
 236   
 237  import sys 
 238   
 239  try: 
 240    from rdkit import Chem 
 241    from rdkit.six import next 
 242    from rdkit.six.moves import range 
 243  except ImportError: 
 244    sys.stderr.write("Please install RDKit from http://www.rdkit.org/\n") 
 245    raise 
 246   
 247  import copy 
 248  import itertools 
 249  import re 
 250  import weakref 
 251  from heapq import heappush, heappop, heapify 
 252  from itertools import chain, combinations, product 
 253  import collections 
 254  from collections import defaultdict 
 255  import time 
 256   
 257  ### A place to set global options 
 258  # (Is this really useful?) 
 259   
 260   
261 -class Default(object):
262 timeout = None 263 timeoutString = "none" 264 maximize = "bonds" 265 atomCompare = "elements" 266 bondCompare = "bondtypes" 267 matchValences = False 268 ringMatchesRingOnly = False 269 completeRingsOnly = False
270 271 ####### Atom type and bond type information ##### 272 273 # Lookup up the atomic symbol given its atomic number 274 _get_symbol = Chem.GetPeriodicTable().GetElementSymbol 275 276 277 # Lookup table to get the SMARTS for an atom given its element 278 # This uses the '#<n>' notation for atoms which may be aromatic. 279 # Eg, '#6' for carbon, instead of 'C,c'. 280 # Use the standard element symbol for atoms which can't be aromatic.
281 -class AtomSmartsNoAromaticity(dict):
282
283 - def __missing__(self, eleno):
284 value = _get_symbol(eleno) 285 self[eleno] = value 286 return value
287 288 289 _atom_smarts_no_aromaticity = AtomSmartsNoAromaticity() 290 # Initialize to the ones which need special treatment 291 # RDKit supports b, c, n, o, p, s, se, and te. 292 # Daylight and OpenSMILES don't 'te' but do support 'as' 293 # I don't want 'H'-is-hydrogen to get confused with 'H'-as-has-hydrogens. 294 # For better portability, I use the '#' notation for all of them. 295 for eleno in (1, 5, 6, 7, 8, 15, 16, 33, 34, 52): 296 _atom_smarts_no_aromaticity[eleno] = "#" + str(eleno) 297 assert _atom_smarts_no_aromaticity[6] == "#6" 298 assert _atom_smarts_no_aromaticity[2] == "He" 299 300 301 # Match any atom
302 -def atom_typer_any(atoms):
303 return ["*"] * len(atoms)
304 305 306 # Match atom by atomic element; usually by symbol
307 -def atom_typer_elements(atoms):
308 return [_atom_smarts_no_aromaticity[atom.GetAtomicNum()] for atom in atoms]
309 310 # Match atom by isotope number. This depends on the RDKit version 311 if hasattr(Chem.Atom, "GetIsotope"): 312
313 - def atom_typer_isotopes(atoms):
314 return ["%d*" % atom.GetIsotope() for atom in atoms]
315 else: 316 # Before mid-2012, RDKit only supported atomic mass, not isotope. 317 # [12*] matches atoms whose mass is 12.000 +/- 0.5/1000 318 # This generally works, excepting elements which have no 319 # Tc, Pm, Po, At, Rn, Fr, Ra, Ac, Np, Pu, Am, Cm, 320 # Bk, Cf, Es, Fm, Md, No, Lr 321 # natural abundance; [98Tc] is the same as [Tc], etc. 322 # This leads to problems because I don't have a way to 323 # define the SMARTS for "no defined isotope." In SMILES/SMARTS 324 # that's supposed to be through isotope 0. 325 # The best I can do is force the non-integer masses to 0 and 326 # use isotope 0 to match them. That's clumsy, but it gives 327 # the expected result.
328 - def atom_typer_isotopes(atoms):
329 atom_smarts_types = [] 330 for atom in atoms: 331 mass = atom.GetMass() 332 int_mass = int(round(mass * 1000)) 333 if int_mass % 1000 == 0: 334 # This is close enough that RDKit's match will work 335 atom_smarts = "%d*" % (int_mass // 1000) 336 else: 337 # Probably in natural abundance. In any case, 338 # there's no SMARTS for this pattern, so force 339 # everything to 0. 340 atom.SetMass(0.0) # XX warning; in-place modification of the input! 341 atom_smarts = "0*" 342 atom_smarts_types.append(atom_smarts) 343 return atom_smarts_types
344 345 346 # Match any bond
347 -def bond_typer_any(bonds):
348 return ["~"] * len(bonds)
349 350 # Match bonds based on bond type, including aromaticity 351 352
353 -def bond_typer_bondtypes(bonds):
354 # Aromaticity matches are important 355 bond_smarts_types = [] 356 for bond in bonds: 357 bond_term = bond.GetSmarts() 358 if not bond_term: 359 # The SMILES "", means "single or aromatic" as SMARTS. 360 # Figure out which one. 361 if bond.GetIsAromatic(): 362 bond_term = ':' 363 else: 364 bond_term = '-' 365 bond_smarts_types.append(bond_term) 366 367 return bond_smarts_types
368 369 370 atom_typers = { 371 "any": atom_typer_any, 372 "elements": atom_typer_elements, 373 "isotopes": atom_typer_isotopes, 374 } 375 376 bond_typers = { 377 "any": bond_typer_any, 378 "bondtypes": bond_typer_bondtypes, 379 } 380 default_atom_typer = atom_typers[Default.atomCompare] 381 default_bond_typer = bond_typers[Default.bondCompare] 382 383 ####### Support code for handling user-defined atom classes 384 385 # User-defined atom classes are handled in a round-about fashion. The 386 # fmcs code doesn't know atom classes, but it can handle isotopes. 387 # It's easy to label the atom isotopes and do an "isotopes" atom 388 # comparison. The hard part is if you want to get the match 389 # information back using the original structure data, without the 390 # tweaked isotopes. 391 392 # My solution uses "save_isotopes" and "save_atom_classes" to store 393 # the old isotope information and the atom class assignments (both 394 # ordered by atom position), associated with the molecule. 395 396 # Use "restore_isotopes()" to restore the molecule's isotope values 397 # from the saved values. Ise "get_selected_atom_classes" to get the 398 # atom classes used by specified atom indices. 399 400 if hasattr(Chem.Atom, "GetIsotope"): 401
402 - def get_isotopes(mol):
403 return [atom.GetIsotope() for atom in mol.GetAtoms()]
404
405 - def set_isotopes(mol, isotopes):
406 if mol.GetNumAtoms() != len(isotopes): 407 raise ValueError("Mismatch between the number of atoms and the number of isotopes") 408 for atom, isotope in zip(mol.GetAtoms(), isotopes): 409 atom.SetIsotope(isotope)
410 411 else: 412 # Backards compatibility. Before mid-2012, RDKit only supported atomic mass, not isotope.
413 - def get_isotopes(mol):
414 return [atom.GetMass() for atom in mol.GetAtoms()]
415
416 - def set_isotopes(mol, isotopes):
417 if mol.GetNumAtoms() != len(isotopes): 418 raise ValueError("Mismatch between the number of atoms and the number of isotopes") 419 for atom, isotope in zip(mol.GetAtoms(), isotopes): 420 atom.SetMass(isotope)
421 422 423 _isotope_dict = weakref.WeakKeyDictionary() 424 _atom_class_dict = weakref.WeakKeyDictionary() 425 426
427 -def save_isotopes(mol, isotopes):
428 _isotope_dict[mol] = isotopes
429 430
431 -def save_atom_classes(mol, atom_classes):
432 _atom_class_dict[mol] = atom_classes
433 434
435 -def get_selected_atom_classes(mol, atom_indices):
436 atom_classes = _atom_class_dict.get(mol, None) 437 if atom_classes is None: 438 return None 439 return [atom_classes[index] for index in atom_indices]
440 441
442 -def restore_isotopes(mol):
443 try: 444 isotopes = _isotope_dict[mol] 445 except KeyError: 446 raise ValueError("no isotopes to restore") 447 set_isotopes(mol, isotopes)
448 449
450 -def assign_isotopes_from_class_tag(mol, atom_class_tag):
451 try: 452 atom_classes = mol.GetProp(atom_class_tag) 453 except KeyError: 454 raise ValueError("Missing atom class tag %r" % (atom_class_tag, )) 455 fields = atom_classes.split() 456 if len(fields) != mol.GetNumAtoms(): 457 raise ValueError( 458 "Mismatch between the number of atoms (#%d) and the number of atom classes (%d)" % ( 459 mol.GetNumAtoms(), len(fields))) 460 new_isotopes = [] 461 for field in fields: 462 if not field.isdigit(): 463 raise ValueError("Atom class %r from tag %r must be a number" % (field, atom_class_tag)) 464 isotope = int(field) 465 if not (1 <= isotope <= 10000): 466 raise ValueError("Atom class %r from tag %r must be in the range 1 to 10000" % 467 (field, atom_class_tag)) 468 new_isotopes.append(isotope) 469 470 save_isotopes(mol, get_isotopes(mol)) 471 save_atom_classes(mol, new_isotopes) 472 set_isotopes(mol, new_isotopes)
473 474 ### Different ways of storing atom/bond information about the input structures ### 475 476 # A TypedMolecule contains the input molecule, unmodified, along with 477 # atom type, and bond type information; both as SMARTS fragments. The 478 # "canonical_bondtypes" uniquely charactizes a bond; two bonds will 479 # match if and only if their canonical bondtypes match. (Meaning: 480 # bonds must be of equivalent type, and must go between atoms of 481 # equivalent types.) 482 483
484 -class TypedMolecule(object):
485
486 - def __init__(self, rdmol, rdmol_atoms, rdmol_bonds, atom_smarts_types, bond_smarts_types, 487 canonical_bondtypes):
488 self.rdmol = rdmol 489 490 # These exist as a performance hack. It's faster to store the 491 # atoms and bond as a Python list than to do GetAtoms() and 492 # GetBonds() again. The stage 2 TypedMolecule does not use 493 # these. 494 495 self.rdmol_atoms = rdmol_atoms 496 self.rdmol_bonds = rdmol_bonds 497 498 # List of SMARTS to use for each atom and bond 499 self.atom_smarts_types = atom_smarts_types 500 self.bond_smarts_types = bond_smarts_types 501 502 # List of canonical bondtype strings 503 self.canonical_bondtypes = canonical_bondtypes
504 505 # Question: Do I also want the original_rdmol_indices? With 506 # the normal SMARTS I can always do the substructure match 507 # again to find the indices, but perhaps this will be needed 508 # when atom class patterns are fully implemented. 509 510 # Start with a set of TypedMolecules. Find the canonical_bondtypes 511 # which only exist in all them, then fragment each TypedMolecule to 512 # produce a FragmentedTypedMolecule containing the same atom 513 # information but containing only bonds with those 514 # canonical_bondtypes. 515 516
517 -class FragmentedTypedMolecule(object):
518
519 - def __init__(self, rdmol, rdmol_atoms, orig_atoms, orig_bonds, atom_smarts_types, 520 bond_smarts_types, canonical_bondtypes):
521 self.rdmol = rdmol 522 self.rdmol_atoms = rdmol_atoms 523 self.orig_atoms = orig_atoms 524 self.orig_bonds = orig_bonds 525 # List of SMARTS to use for each atom and bond 526 self.atom_smarts_types = atom_smarts_types 527 self.bond_smarts_types = bond_smarts_types 528 529 # List of canonical bondtype strings 530 self.canonical_bondtypes = canonical_bondtypes
531 532 # A FragmentedTypedMolecule can contain multiple fragments. Once I've 533 # picked the FragmentedTypedMolecule to use for enumeration, I extract 534 # each of the fragments as the basis for an EnumerationMolecule. 535 536
537 -class TypedFragment(object):
538
539 - def __init__(self, rdmol, orig_atoms, orig_bonds, atom_smarts_types, bond_smarts_types, 540 canonical_bondtypes):
541 self.rdmol = rdmol 542 self.orig_atoms = orig_atoms 543 self.orig_bonds = orig_bonds 544 self.atom_smarts_types = atom_smarts_types 545 self.bond_smarts_types = bond_smarts_types 546 self.canonical_bondtypes = canonical_bondtypes
547 548 # The two possible bond types are 549 # atom1_smarts + bond smarts + atom2_smarts 550 # atom2_smarts + bond smarts + atom1_smarts 551 # The canonical bond type is the lexically smaller of these two. 552 553
554 -def get_canonical_bondtypes(rdmol, bonds, atom_smarts_types, bond_smarts_types):
555 canonical_bondtypes = [] 556 for bond, bond_smarts in zip(bonds, bond_smarts_types): 557 atom1_smarts = atom_smarts_types[bond.GetBeginAtomIdx()] 558 atom2_smarts = atom_smarts_types[bond.GetEndAtomIdx()] 559 if atom1_smarts > atom2_smarts: 560 atom1_smarts, atom2_smarts = atom2_smarts, atom1_smarts 561 canonical_bondtypes.append("[%s]%s[%s]" % (atom1_smarts, bond_smarts, atom2_smarts)) 562 return canonical_bondtypes
563 564 # Create a TypedMolecule using the element-based typing scheme 565 566 # TODO: refactor this. It doesn't seem right to pass boolean flags. 567 568
569 -def get_typed_molecule(rdmol, atom_typer, bond_typer, matchValences=Default.matchValences, 570 ringMatchesRingOnly=Default.ringMatchesRingOnly):
571 atoms = list(rdmol.GetAtoms()) 572 atom_smarts_types = atom_typer(atoms) 573 574 # Get the valence information, if requested 575 if matchValences: 576 new_atom_smarts_types = [] 577 for (atom, atom_smarts_type) in zip(atoms, atom_smarts_types): 578 valence = atom.GetImplicitValence() + atom.GetExplicitValence() 579 valence_str = "v%d" % valence 580 if "," in atom_smarts_type: 581 atom_smarts_type += ";" + valence_str 582 else: 583 atom_smarts_type += valence_str 584 new_atom_smarts_types.append(atom_smarts_type) 585 atom_smarts_types = new_atom_smarts_types 586 587 # Store and reuse the bond information because I use it twice. 588 # In a performance test, the times went from 2.0 to 1.4 seconds by doing this. 589 bonds = list(rdmol.GetBonds()) 590 bond_smarts_types = bond_typer(bonds) 591 if ringMatchesRingOnly: 592 new_bond_smarts_types = [] 593 for bond, bond_smarts in zip(bonds, bond_smarts_types): 594 if bond.IsInRing(): 595 if bond_smarts == ":": 596 # No need to do anything; it has to be in a ring 597 pass 598 else: 599 if "," in bond_smarts: 600 bond_smarts += ";@" 601 else: 602 bond_smarts += "@" 603 else: 604 if "," in bond_smarts: 605 bond_smarts += ";!@" 606 else: 607 bond_smarts += "!@" 608 609 new_bond_smarts_types.append(bond_smarts) 610 bond_smarts_types = new_bond_smarts_types 611 612 canonical_bondtypes = get_canonical_bondtypes(rdmol, bonds, atom_smarts_types, bond_smarts_types) 613 return TypedMolecule(rdmol, atoms, bonds, atom_smarts_types, bond_smarts_types, 614 canonical_bondtypes)
615 616 # Create a TypedMolecule using the user-defined atom classes (Not implemented!) 617 618
619 -def get_specified_types(rdmol, atom_types, ringMatchesRingOnly):
620 raise NotImplementedError("not tested!") 621 # Make a copy because I will do some destructive edits 622 rdmol = copy.copy(rdmol) 623 624 atom_smarts_types = [] 625 atoms = list(mol.GetAtoms()) 626 for atom, atom_type in zip(atoms, atom_types): 627 atom.SetAtomicNum(0) 628 atom.SetMass(atom_type) 629 atom_term = "%d*" % (atom_type, ) 630 if ringMatchesRingOnly: 631 if atom.IsInRing(): 632 atom_term += "R" 633 else: 634 atom_term += "!R" 635 atom_smarts_types.append('[' + atom_term + ']') 636 637 bonds = list(rdmol.GetBonds()) 638 bond_smarts_types = get_bond_smarts_types(mol, bonds, ringMatchesRingOnly) 639 canonical_bondtypes = get_canonical_bondtypes(mol, bonds, atom_smarts_types, bond_smarts_types) 640 641 return TypedMolecule(mol, atoms, bonds, atom_smarts_types, bond_smarts_types, canonical_bondtypes)
642 643
644 -def convert_input_to_typed_molecules(mols, atom_typer, bond_typer, matchValences, 645 ringMatchesRingOnly):
646 typed_mols = [] 647 for molno, rdmol in enumerate(mols): 648 typed_mol = get_typed_molecule(rdmol, atom_typer, bond_typer, matchValences=matchValences, 649 ringMatchesRingOnly=ringMatchesRingOnly) 650 typed_mols.append(typed_mol) 651 652 return typed_mols
653 654
655 -def _check_atom_classes(molno, num_atoms, atom_classes):
656 if num_atoms != len(atom_classes): 657 raise ValueError("mols[%d]: len(atom_classes) must be the same as the number of atoms" % 658 (molno, )) 659 for atom_class in atom_classes: 660 if not isinstance(atom_class, int): 661 raise ValueError("mols[%d]: atom_class elements must be integers" % (molno, )) 662 if not (1 <= atom_class < 1000): 663 raise ValueError("mols[%d]: atom_class elements must be in the range 1 <= value < 1000" % 664 (molno, ))
665 666 ############################################# 667 668 # This section deals with finding the canonical bondtype counts and 669 # making new TypedMolecule instances where the atoms contain only the 670 # bond types which are in all of the structures. 671 672 # In the future I would like to keep track of the bond types which are 673 # in the current subgraph. If any subgraph bond type count is ever 674 # larger than the maximum counts computed across the whole set, then 675 # prune. But so far I don't have a test set which drives the need for 676 # that. 677 678 679 # Return a dictionary mapping iterator item to occurence count
680 -def get_counts(it):
681 d = defaultdict(int) 682 for item in it: 683 d[item] += 1 684 return dict(d)
685 686 687 # Merge two count dictionaries, returning the smallest count for any 688 # entry which is in both.
689 -def intersect_counts(counts1, counts2):
690 d = {} 691 for k, v1 in counts1.iteritems(): 692 if k in counts2: 693 v = min(v1, counts2[k]) 694 d[k] = v 695 return d
696 697 698 # Figure out which canonical bonds SMARTS occur in every molecule
699 -def get_canonical_bondtype_counts(typed_mols):
700 overall_counts = defaultdict(list) 701 for typed_mol in typed_mols: 702 bondtype_counts = get_counts(typed_mol.canonical_bondtypes) 703 for k, v in bondtype_counts.items(): 704 overall_counts[k].append(v) 705 return overall_counts
706 707 # If I know which bondtypes exist in all of the structures, I can 708 # remove all bonds which aren't in all structures. RDKit's Molecule 709 # class doesn't let me edit in-place, so I end up making a new one 710 # which doesn't have unsupported bond types. 711 712
713 -def remove_unknown_bondtypes(typed_mol, supported_canonical_bondtypes):
714 emol = Chem.EditableMol(Chem.Mol()) 715 716 # Copy all of the atoms, even those which don't have any bonds. 717 for atom in typed_mol.rdmol_atoms: 718 emol.AddAtom(atom) 719 720 # Copy over all the bonds with a supported bond type. 721 # Make sure to update the bond SMARTS and canonical bondtype lists. 722 orig_bonds = [] 723 new_bond_smarts_types = [] 724 new_canonical_bondtypes = [] 725 for bond, bond_smarts, canonical_bondtype in zip( 726 typed_mol.rdmol_bonds, typed_mol.bond_smarts_types, typed_mol.canonical_bondtypes): 727 if canonical_bondtype in supported_canonical_bondtypes: 728 orig_bonds.append(bond) 729 new_bond_smarts_types.append(bond_smarts) 730 new_canonical_bondtypes.append(canonical_bondtype) 731 emol.AddBond(bond.GetBeginAtomIdx(), bond.GetEndAtomIdx(), bond.GetBondType()) 732 733 new_mol = emol.GetMol() 734 return FragmentedTypedMolecule(new_mol, list(new_mol.GetAtoms()), typed_mol.rdmol_atoms, 735 orig_bonds, typed_mol.atom_smarts_types, new_bond_smarts_types, 736 new_canonical_bondtypes)
737 738 # The molecule at this point has been (potentially) fragmented by 739 # removing bonds with unsupported bond types. The MCS cannot contain 740 # more atoms than the fragment of a given molecule with the most 741 # atoms, and the same for bonds. Find those upper limits. Note that 742 # the fragment with the most atoms is not necessarily the one with the 743 # most bonds. 744 745
746 -def find_upper_fragment_size_limits(rdmol, atoms):
747 max_num_atoms = max_twice_num_bonds = 0 748 for atom_indices in Chem.GetMolFrags(rdmol): 749 num_atoms = len(atom_indices) 750 if num_atoms > max_num_atoms: 751 max_num_atoms = num_atoms 752 753 # Every bond is connected to two atoms, so this is the 754 # simplest way to count the number of bonds in the fragment. 755 twice_num_bonds = 0 756 for atom_index in atom_indices: 757 # XXX Why is there no 'atom.GetNumBonds()'? 758 twice_num_bonds += sum(1 for bond in atoms[atom_index].GetBonds()) 759 if twice_num_bonds > max_twice_num_bonds: 760 max_twice_num_bonds = twice_num_bonds 761 762 return max_num_atoms, max_twice_num_bonds // 2
763 764 ####### Convert the selected TypedMolecule into an EnumerationMolecule 765 766 # I convert one of the typed fragment molecules (specifically, the one 767 # with the smallest largest fragment score) into a list of 768 # EnumerationMolecule instances. Each fragment from the typed molecule 769 # gets turned into an EnumerationMolecule. 770 771 # An EnumerationMolecule contains the data I need to enumerate all of 772 # its subgraphs. 773 774 # An EnumerationMolecule contains a list of 'Atom's and list of 'Bond's. 775 # Atom and Bond indices are offsets into those respective lists. 776 # An Atom has a list of "bond_indices", which are offsets into the bonds. 777 # A Bond has a 2-element list of "atom_indices", which are offsets into the atoms. 778 779 EnumerationMolecule = collections.namedtuple("Molecule", "rdmol atoms bonds directed_edges") 780 Atom = collections.namedtuple("Atom", "real_atom atom_smarts bond_indices is_in_ring") 781 Bond = collections.namedtuple("Bond", 782 "real_bond bond_smarts canonical_bondtype atom_indices is_in_ring") 783 784 # A Bond is linked to by two 'DirectedEdge's; one for each direction. 785 # The DirectedEdge.bond_index references the actual RDKit bond instance. 786 # 'end_atom_index' is the index of the destination atom of the directed edge 787 # This is used in a 'directed_edges' dictionary so that 788 # [edge.end_atom_index for edge in directed_edges[atom_index]] 789 # is the list of all atom indices connected to 'atom_index' 790 DirectedEdge = collections.namedtuple("DirectedEdge", "bond_index end_atom_index") 791 792 # A Subgraph is a list of atom and bond indices in an EnumerationMolecule 793 Subgraph = collections.namedtuple("Subgraph", "atom_indices bond_indices") 794 795
796 -def get_typed_fragment(typed_mol, atom_indices):
797 rdmol = typed_mol.rdmol 798 rdmol_atoms = typed_mol.rdmol_atoms 799 800 # I need to make a new RDKit Molecule containing only the fragment. 801 # XXX Why is that? Do I use the molecule for more than the number of atoms and bonds? 802 803 # Copy over the atoms 804 emol = Chem.EditableMol(Chem.Mol()) 805 atom_smarts_types = [] 806 atom_map = {} 807 for i, atom_index in enumerate(atom_indices): 808 atom = rdmol_atoms[atom_index] 809 emol.AddAtom(atom) 810 atom_smarts_types.append(typed_mol.atom_smarts_types[atom_index]) 811 atom_map[atom_index] = i 812 813 # Copy over the bonds. 814 orig_bonds = [] 815 bond_smarts_types = [] 816 new_canonical_bondtypes = [] 817 for bond, orig_bond, bond_smarts, canonical_bondtype in zip( 818 rdmol.GetBonds(), typed_mol.orig_bonds, typed_mol.bond_smarts_types, 819 typed_mol.canonical_bondtypes): 820 begin_atom_idx = bond.GetBeginAtomIdx() 821 end_atom_idx = bond.GetEndAtomIdx() 822 count = (begin_atom_idx in atom_map) + (end_atom_idx in atom_map) 823 # Double check that I have a proper fragment 824 if count == 2: 825 bond_smarts_types.append(bond_smarts) 826 new_canonical_bondtypes.append(canonical_bondtype) 827 emol.AddBond(atom_map[begin_atom_idx], atom_map[end_atom_idx], bond.GetBondType()) 828 orig_bonds.append(orig_bond) 829 elif count == 1: 830 raise AssertionError("connected/disconnected atoms?") 831 return TypedFragment(emol.GetMol(), 832 [typed_mol.orig_atoms[atom_index] for atom_index in atom_indices], 833 orig_bonds, atom_smarts_types, bond_smarts_types, new_canonical_bondtypes)
834 835
836 -def fragmented_mol_to_enumeration_mols(typed_mol, minNumAtoms=2):
837 if minNumAtoms < 2: 838 raise ValueError("minNumAtoms must be at least 2") 839 840 fragments = [] 841 for atom_indices in Chem.GetMolFrags(typed_mol.rdmol): 842 # No need to even look at fragments which are too small. 843 if len(atom_indices) < minNumAtoms: 844 continue 845 846 # Convert a fragment from the TypedMolecule into a new 847 # TypedMolecule containing only that fragment. 848 849 # You might think I could merge 'get_typed_fragment()' with 850 # the code to generate the EnumerationMolecule. You're 851 # probably right. This code reflects history. My original code 852 # didn't break the typed molecule down to its fragments. 853 typed_fragment = get_typed_fragment(typed_mol, atom_indices) 854 rdmol = typed_fragment.rdmol 855 atoms = [] 856 for atom, orig_atom, atom_smarts_type in zip(rdmol.GetAtoms(), typed_fragment.orig_atoms, 857 typed_fragment.atom_smarts_types): 858 bond_indices = [bond.GetIdx() for bond in atom.GetBonds()] 859 #assert atom.GetSymbol() == orig_atom.GetSymbol() 860 atom_smarts = '[' + atom_smarts_type + ']' 861 atoms.append(Atom(atom, atom_smarts, bond_indices, orig_atom.IsInRing())) 862 863 directed_edges = collections.defaultdict(list) 864 bonds = [] 865 for bond_index, (bond, orig_bond, bond_smarts, canonical_bondtype) in enumerate( 866 zip(rdmol.GetBonds(), typed_fragment.orig_bonds, typed_fragment.bond_smarts_types, 867 typed_fragment.canonical_bondtypes)): 868 atom_indices = [bond.GetBeginAtomIdx(), bond.GetEndAtomIdx()] 869 bonds.append(Bond(bond, bond_smarts, canonical_bondtype, atom_indices, orig_bond.IsInRing())) 870 871 directed_edges[atom_indices[0]].append(DirectedEdge(bond_index, atom_indices[1])) 872 directed_edges[atom_indices[1]].append(DirectedEdge(bond_index, atom_indices[0])) 873 874 fragment = EnumerationMolecule(rdmol, atoms, bonds, dict(directed_edges)) 875 fragments.append(fragment) 876 877 # Optimistically try the largest fragments first 878 fragments.sort(key=lambda fragment: len(fragment.atoms), reverse=True) 879 return fragments
880 881 ####### Canonical SMARTS generation using Weininger, Weininger, and Weininger's CANGEN 882 883 # CANGEN "combines two separate algorithms, CANON and GENES. The 884 # first stage, CANON, labels a molecualr structure with canonical 885 # labels. ... Each atom is given a numerical label on the basis of its 886 # topology. In the second stage, GENES generates the unique SMILES 887 # ... . [It] selects the starting atom and makes branching decisions 888 # by referring to the canonical labels as needed." 889 890 # CANON is based on the fundamental theorem of arithmetic, that is, 891 # the unique prime factorization theorem. Which means I need about as 892 # many primes as I have atoms. 893 894 895 # I could have a fixed list of a few thousand primes but I don't like 896 # having a fixed upper limit to my molecule size. I modified the code 897 # Georg Schoelly posted at http://stackoverflow.com/a/568618/64618 . 898 # This is one of many ways to generate an infinite sequence of primes.
899 -def gen_primes():
900 d = defaultdict(list) 901 q = 2 902 while 1: 903 if q not in d: 904 yield q 905 d[q * q].append(q) 906 else: 907 for p in d[q]: 908 d[p + q].append(p) 909 del d[q] 910 q += 1
911 912 913 _prime_stream = gen_primes() 914 915 # Code later on uses _primes[n] and if that fails, calls _get_nth_prime(n) 916 _primes = [] 917 918
919 -def _get_nth_prime(n):
920 # Keep appending new primes from the stream until I have enough. 921 current_size = len(_primes) 922 while current_size <= n: 923 _primes.append(next(_prime_stream)) 924 current_size += 1 925 return _primes[n]
926 927 # Prime it with more values then will likely occur 928 _get_nth_prime(1000) 929 930 ### 931 932 # The CANON algorithm is documented as: 933 # (1) Set atomic vector to initial invariants. Go to step 3. 934 # (2) Set vector to product of primes corresponding to neighbors' ranks. 935 # (3) Sort vector, maintaining stability over previous ranks. 936 # (4) Rank atomic vector. 937 # (5) If not invariants partitioning, go to step 2. 938 # (6) On first pass, save partitioning as symmetry classes [fmcs doesn't need this] 939 # (7) If highest rank is smaller than number of nodes, break ties, go to step 2 940 # (8) ... else done. 941 942 # I track the atom information as a list of CangenNode instances. 943 944
945 -class CangenNode(object):
946 # Using __slots__ improves get_initial_cangen_nodes performance by over 10% 947 # and dropped my overall time (in one benchmark) from 0.75 to 0.73 seconds 948 __slots__ = ["index", "atom_smarts", "value", "neighbors", "rank", "outgoing_edges"] 949
950 - def __init__(self, index, atom_smarts):
951 self.index = index 952 self.atom_smarts = atom_smarts # Used to generate the SMARTS output 953 self.value = 0 954 self.neighbors = [] 955 self.rank = 0 956 self.outgoing_edges = []
957 958 # The outgoing edge information is used to generate the SMARTS output 959 # The index numbers are offsets in the subgraph, not in the original molecule 960 OutgoingEdge = collections.namedtuple( 961 "OutgoingEdge", "from_atom_index bond_index bond_smarts other_node_idx other_node") 962 963 964 # Convert a Subgraph of a given EnumerationMolecule into a list of 965 # CangenNodes. This contains the more specialized information I need 966 # for canonicalization and for SMARTS generation.
967 -def get_initial_cangen_nodes(subgraph, enumeration_mol, atom_assignment, 968 do_initial_assignment=True):
969 # The subgraph contains a set of atom and bond indices in the enumeration_mol. 970 # The CangenNode corresponds to an atom in the subgraph, plus relations 971 # to other atoms in the subgraph. 972 # I need to convert from offsets in molecule space to offset in subgraph space. 973 974 # Map from enumeration mol atom indices to subgraph/CangenNode list indices 975 atom_map = {} 976 977 cangen_nodes = [] 978 atoms = enumeration_mol.atoms 979 canonical_labels = [] 980 for i, atom_index in enumerate(subgraph.atom_indices): 981 atom_map[atom_index] = i 982 cangen_nodes.append(CangenNode(i, atoms[atom_index].atom_smarts)) 983 canonical_labels.append([]) 984 985 # Build the neighbor and directed edge lists 986 987 for bond_index in subgraph.bond_indices: 988 bond = enumeration_mol.bonds[bond_index] 989 from_atom_index, to_atom_index = bond.atom_indices 990 from_subgraph_atom_index = atom_map[from_atom_index] 991 to_subgraph_atom_index = atom_map[to_atom_index] 992 993 from_node = cangen_nodes[from_subgraph_atom_index] 994 to_node = cangen_nodes[to_subgraph_atom_index] 995 from_node.neighbors.append(to_node) 996 to_node.neighbors.append(from_node) 997 998 canonical_bondtype = bond.canonical_bondtype 999 canonical_labels[from_subgraph_atom_index].append(canonical_bondtype) 1000 canonical_labels[to_subgraph_atom_index].append(canonical_bondtype) 1001 1002 from_node.outgoing_edges.append( 1003 OutgoingEdge(from_subgraph_atom_index, bond_index, bond.bond_smarts, to_subgraph_atom_index, 1004 to_node)) 1005 to_node.outgoing_edges.append( 1006 OutgoingEdge(to_subgraph_atom_index, bond_index, bond.bond_smarts, from_subgraph_atom_index, 1007 from_node)) 1008 1009 if do_initial_assignment: 1010 # Do the initial graph invariant assignment. (Step 1 of the CANON algorithm) 1011 # These are consistent only inside of the given 'atom_assignment' lookup. 1012 for atom_index, node, canonical_label in zip(subgraph.atom_indices, cangen_nodes, 1013 canonical_labels): 1014 # The initial invariant is the sorted canonical bond labels 1015 # plus the atom smarts, separated by newline characters. 1016 # 1017 # This is equivalent to a circular fingerprint of width 2, and 1018 # gives more unique information than the Weininger method. 1019 canonical_label.sort() 1020 canonical_label.append(atoms[atom_index].atom_smarts) 1021 label = "\n".join(canonical_label) 1022 1023 # The downside of using a string is that I need to turn it 1024 # into a number which is consistent across all of the SMARTS I 1025 # generate as part of the MCS search. Use a lookup table for 1026 # that which creates a new number of the label wasn't seen 1027 # before, or uses the old one if it was. 1028 node.value = atom_assignment[label] 1029 1030 return cangen_nodes
1031 1032 1033 # Rank a sorted list (by value) of CangenNodes
1034 -def rerank(cangen_nodes):
1035 rank = 0 # Note: Initial rank is 1, in line with the Weininger paper 1036 prev_value = -1 1037 for node in cangen_nodes: 1038 if node.value != prev_value: 1039 rank += 1 1040 prev_value = node.value 1041 node.rank = rank
1042 1043 1044 # Given a start/end range in the CangenNodes, sorted by value, 1045 # find the start/end for subranges with identical values
1046 -def find_duplicates(cangen_nodes, start, end):
1047 result = [] 1048 prev_value = -1 1049 count = 0 1050 for index in range(start, end): 1051 node = cangen_nodes[index] 1052 if node.value == prev_value: 1053 count += 1 1054 else: 1055 if count > 1: 1056 # New subrange containing duplicates 1057 result.append((start, index)) 1058 count = 1 1059 prev_value = node.value 1060 start = index 1061 if count > 1: 1062 # Last elements were duplicates 1063 result.append((start, end)) 1064 return result
1065 1066 1067 #@profile
1068 -def canon(cangen_nodes):
1069 # Precondition: node.value is set to the initial invariant 1070 # (1) Set atomic vector to initial invariants (assumed on input) 1071 1072 # Do the initial ranking 1073 cangen_nodes.sort(key=lambda node: node.value) 1074 rerank(cangen_nodes) 1075 1076 # Keep refining the sort order until it's unambiguous 1077 master_sort_order = cangen_nodes[:] 1078 1079 # Find the start/end range for each stretch of duplicates 1080 duplicates = find_duplicates(cangen_nodes, 0, len(cangen_nodes)) 1081 1082 PRIMES = _primes # micro-optimization; make this a local name lookup 1083 1084 while duplicates: 1085 # (2) Set vector to product of primes corresponding to neighbor's ranks 1086 for node in cangen_nodes: 1087 try: 1088 node.value = PRIMES[node.rank] 1089 except IndexError: 1090 node.value = _get_nth_prime(node.rank) 1091 for node in cangen_nodes: 1092 # Apply the fundamental theorem of arithmetic; compute the 1093 # product of the neighbors' primes 1094 p = 1 1095 for neighbor in node.neighbors: 1096 p *= neighbor.value 1097 node.value = p 1098 1099 # (3) Sort vector, maintaining stability over previous ranks 1100 # (I maintain stability by refining ranges in the 1101 # master_sort_order based on the new ranking) 1102 cangen_nodes.sort(key=lambda node: node.value) 1103 1104 # (4) rank atomic vector 1105 rerank(cangen_nodes) 1106 1107 # See if any of the duplicates have been resolved. 1108 new_duplicates = [] 1109 unchanged = True # This is buggy? Need to check the entire state XXX 1110 for (start, end) in duplicates: 1111 # Special case when there's only two elements to store. 1112 # This optimization sped up cangen by about 8% because I 1113 # don't go through the sort machinery 1114 if start + 2 == end: 1115 node1, node2 = master_sort_order[start], master_sort_order[end - 1] 1116 if node1.value > node2.value: 1117 master_sort_order[start] = node2 1118 master_sort_order[end - 1] = node1 1119 else: 1120 subset = master_sort_order[start:end] 1121 subset.sort(key=lambda node: node.value) 1122 master_sort_order[start:end] = subset 1123 1124 subset_duplicates = find_duplicates(master_sort_order, start, end) 1125 new_duplicates.extend(subset_duplicates) 1126 if unchanged: 1127 # Have we distinguished any of the duplicates? 1128 if not (len(subset_duplicates) == 1 and subset_duplicates[0] == (start, end)): 1129 unchanged = False 1130 1131 # (8) ... else done 1132 # Yippee! No duplicates left. Everything has a unique value. 1133 if not new_duplicates: 1134 break 1135 1136 # (5) If not invariant partitioning, go to step 2 1137 if not unchanged: 1138 duplicates = new_duplicates 1139 continue 1140 1141 duplicates = new_duplicates 1142 1143 # (6) On first pass, save partitioning as symmetry classes 1144 pass # I don't need this information 1145 1146 # (7) If highest rank is smaller than number of nodes, break ties, go to step 2 1147 # I follow the Weininger algorithm and use 2*rank or 2*rank-1. 1148 # This requires that the first rank is 1, not 0. 1149 for node in cangen_nodes: 1150 node.value = node.rank * 2 1151 1152 # The choice of tie is arbitrary. Weininger breaks the first tie. 1153 # I break the last tie because it's faster in Python to delete 1154 # from the end than the beginning. 1155 start, end = duplicates[-1] 1156 cangen_nodes[start].value -= 1 1157 if end == start + 2: 1158 # There were only two nodes with the same value. Now there 1159 # are none. Remove information about that duplicate. 1160 del duplicates[-1] 1161 else: 1162 # The first N-1 values are still duplicates. 1163 duplicates[-1] = (start + 1, end) 1164 rerank(cangen_nodes) 1165 1166 # Restore to the original order (ordered by subgraph atom index) 1167 # because the bond information used during SMARTS generation 1168 # references atoms by that order. 1169 cangen_nodes.sort(key=lambda node: node.index)
1170 1171
1172 -def get_closure_label(bond_smarts, closure):
1173 if closure < 10: 1174 return bond_smarts + str(closure) 1175 else: 1176 return bond_smarts + "%%%02d" % closure
1177 1178 # Precompute the initial closure heap. *Overall* performance went from 0.73 to 0.64 seconds! 1179 _available_closures = list(range(1, 101)) 1180 heapify(_available_closures) 1181 1182 # The Weininger paper calls this 'GENES'; I call it "generate_smiles." 1183 1184 # I use a different algorithm than GENES. It's still use two 1185 # passes. The first pass identifies the closure bonds using a 1186 # depth-first search. The second pass builds the SMILES string. 1187 1188
1189 -def generate_smarts(cangen_nodes):
1190 start_index = 0 1191 best_rank = cangen_nodes[0].rank 1192 for i, node in enumerate(cangen_nodes): 1193 if node.rank < best_rank: 1194 best_rank = node.rank 1195 start_index = i 1196 node.outgoing_edges.sort(key=lambda edge: edge.other_node.rank) 1197 1198 visited_atoms = [0] * len(cangen_nodes) 1199 closure_bonds = set() 1200 1201 ## First, find the closure bonds using a DFS 1202 stack = [] 1203 atom_idx = start_index 1204 stack.extend(reversed(cangen_nodes[atom_idx].outgoing_edges)) 1205 visited_atoms[atom_idx] = True 1206 1207 while stack: 1208 edge = stack.pop() 1209 if visited_atoms[edge.other_node_idx]: 1210 closure_bonds.add(edge.bond_index) 1211 else: 1212 visited_atoms[edge.other_node_idx] = 1 1213 for next_edge in reversed(cangen_nodes[edge.other_node_idx].outgoing_edges): 1214 if next_edge.other_node_idx == edge.from_atom_index: 1215 # Don't worry about going back along the same route 1216 continue 1217 stack.append(next_edge) 1218 1219 available_closures = _available_closures[:] 1220 unclosed_closures = {} 1221 1222 # I've identified the closure bonds. 1223 # Use a stack machine to traverse the graph and build the SMARTS. 1224 # The instruction contains one of 4 instructions, with associated data 1225 # 0: add the atom's SMARTS and put its connections on the machine 1226 # 1: add the bond's SMARTS and put the other atom on the machine 1227 # 3: add a ')' to the SMARTS 1228 # 4: add a '(' and the bond SMARTS 1229 1230 smiles_terms = [] 1231 stack = [(0, (start_index, -1))] 1232 while stack: 1233 action, data = stack.pop() 1234 if action == 0: 1235 # Add an atom. 1236 1237 # The 'while 1:' emulates a goto for the special case 1238 # where the atom is connected to only one other atom. I 1239 # don't need to use the stack machinery for that case, and 1240 # can speed up this function by about 10%. 1241 while 1: 1242 # Look at the bonds starting from this atom 1243 num_neighbors = 0 1244 atom_idx, prev_bond_idx = data 1245 smiles_terms.append(cangen_nodes[atom_idx].atom_smarts) 1246 outgoing_edges = cangen_nodes[atom_idx].outgoing_edges 1247 for outgoing_edge in outgoing_edges: 1248 bond_idx = outgoing_edge.bond_index 1249 1250 # Is this a ring closure bond? 1251 if bond_idx in closure_bonds: 1252 # Have we already seen it before? 1253 if bond_idx not in unclosed_closures: 1254 # This is new. Add as a ring closure. 1255 closure = heappop(available_closures) 1256 smiles_terms.append(get_closure_label(outgoing_edge.bond_smarts, closure)) 1257 unclosed_closures[bond_idx] = closure 1258 else: 1259 closure = unclosed_closures[bond_idx] 1260 smiles_terms.append(get_closure_label(outgoing_edge.bond_smarts, closure)) 1261 heappush(available_closures, closure) 1262 del unclosed_closures[bond_idx] 1263 else: 1264 # This is a new outgoing bond. 1265 if bond_idx == prev_bond_idx: 1266 # Don't go backwards along the bond I just came in on 1267 continue 1268 if num_neighbors == 0: 1269 # This is the first bond. There's a good chance that 1270 # it's the only bond. 1271 data = (outgoing_edge.other_node_idx, bond_idx) 1272 bond_smarts = outgoing_edge.bond_smarts 1273 else: 1274 # There are multiple bonds. Can't shortcut. 1275 if num_neighbors == 1: 1276 # Capture the information for the first bond 1277 # This direction doesn't need the (branch) characters. 1278 stack.append((0, data)) 1279 stack.append((1, bond_smarts)) 1280 1281 # Add information for this bond 1282 stack.append((3, None)) 1283 stack.append((0, (outgoing_edge.other_node_idx, bond_idx))) 1284 stack.append((4, outgoing_edge.bond_smarts)) 1285 1286 num_neighbors += 1 1287 if num_neighbors != 1: 1288 # If there's only one item then goto action==0 again. 1289 break 1290 smiles_terms.append(bond_smarts) 1291 elif action == 1: 1292 # Process a bond which does not need '()'s 1293 smiles_terms.append(data) # 'data' is bond_smarts 1294 continue 1295 1296 elif action == 3: 1297 smiles_terms.append(')') 1298 elif action == 4: 1299 smiles_terms.append('(' + data) # 'data' is bond_smarts 1300 else: 1301 raise AssertionError 1302 1303 return "".join(smiles_terms)
1304 1305 1306 # Full canonicalization is about 5% slower unless there are well over 100 structures 1307 # in the data set, which is not expected to be common. 1308 # Commented out the canon() step until there's a better solution (eg, adapt based 1309 # in the input size.)
1310 -def make_canonical_smarts(subgraph, enumeration_mol, atom_assignment):
1311 cangen_nodes = get_initial_cangen_nodes(subgraph, enumeration_mol, atom_assignment, True) 1312 #canon(cangen_nodes) 1313 smarts = generate_smarts(cangen_nodes) 1314 return smarts
1315 1316 ## def make_semicanonical_smarts(subgraph, enumeration_mol, atom_assignment): 1317 ## cangen_nodes = get_initial_cangen_nodes(subgraph, enumeration_mol, atom_assignment, True) 1318 ## # There's still some order because of the canonical bond typing, but it isn't perfect 1319 ## #canon(cangen_nodes) 1320 ## smarts = generate_smarts(cangen_nodes) 1321 ## return smarts 1322 1323
1324 -def make_arbitrary_smarts(subgraph, enumeration_mol, atom_assignment):
1325 cangen_nodes = get_initial_cangen_nodes(subgraph, enumeration_mol, atom_assignment, False) 1326 # Use an arbitrary order 1327 for i, node in enumerate(cangen_nodes): 1328 node.value = i 1329 smarts = generate_smarts(cangen_nodes) 1330 return smarts
1331 1332 ############## Subgraph enumeration ################## 1333 1334 # A 'seed' is a subgraph containing a subset of the atoms and bonds in 1335 # the graph. The idea is to try all of the ways in which to grow the 1336 # seed to make a new seed which contains the original seed. 1337 1338 # There are two ways to grow a seed: 1339 # - add a bond which is not in the seed but where both of its 1340 # atoms are in the seed 1341 # - add a bond which is not in the seed but where one of its 1342 # atoms is in the seed (and the other is not) 1343 1344 # The algorithm takes the seed, and finds all of both categories of 1345 # bonds. If there are N total such bonds then there are 2**N-1 1346 # possible new seeds which contain the original seed. This is simply 1347 # the powerset of the possible bonds, excepting the case with no 1348 # bonds. 1349 1350 # Generate all 2**N-1 new seeds. Place the new seeds back in the 1351 # priority queue to check for additional growth. 1352 1353 # I place the seeds in priority queue, sorted by score (typically the 1354 # number of atoms) to preferentially search larger structures first. A 1355 # simple stack or deque wouldn't work because the new seeds have 1356 # between 1 to N-1 new atoms and bonds. 1357 1358 # Some useful preamble code 1359 1360 1361 # Taken from the Python documentation
1362 -def powerset(iterable):
1363 "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)" 1364 s = list(iterable) 1365 return chain.from_iterable(combinations(s, r) for r in range(len(s) + 1))
1366 1367 1368 # Same as the above except the empty term is not returned
1369 -def nonempty_powerset(iterable):
1370 "nonempty_powerset([1,2,3]) --> (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)" 1371 s = list(iterable) 1372 it = chain.from_iterable(combinations(s, r) for r in range(len(s) + 1)) 1373 next(it) 1374 return it
1375 1376 1377 # Call this to get a new unique function. Used to break ties in the 1378 # priority queue. 1379 #tiebreaker = itertools.count().next
1380 -def _Counter():
1381 c = itertools.count() 1382 return lambda: next(c)
1383 1384 1385 tiebreaker = _Counter() 1386 1387 ### The enumeration code 1388 1389 1390 # Given a set of atoms, find all of the ways to leave those atoms. 1391 # There are two possibilities: 1392 # 1) bonds; which connect two atoms which are already in 'atom_indices' 1393 # 2) directed edges; which go to atoms that aren't in 'atom_indices' 1394 # and which aren't already in visited_bond_indices. These are external 1395 # to the subgraph. 1396 # The return is a 2-element tuple containing: 1397 # (the list of bonds from (1), the list of directed edges from (2))
1398 -def find_extensions(atom_indices, visited_bond_indices, directed_edges):
1399 internal_bonds = set() 1400 external_edges = [] 1401 for atom_index in atom_indices: 1402 for directed_edge in directed_edges[atom_index]: 1403 # Skip outgoing edges which have already been evaluated 1404 if directed_edge.bond_index in visited_bond_indices: 1405 continue 1406 1407 if directed_edge.end_atom_index in atom_indices: 1408 # case 1: This bond goes to another atom which is already in the subgraph. 1409 internal_bonds.add(directed_edge.bond_index) 1410 else: 1411 # case 2: This goes to a new (external) atom 1412 external_edges.append(directed_edge) 1413 1414 # I don't think I need the list() 1415 return list(internal_bonds), external_edges
1416 1417 # Given the 2-element tuple (internal_bonds, external_edges), 1418 # construct all of the ways to combine them to generate a new subgraph 1419 # from the old one. This is done via a powerset. 1420 # This generates a two-element tuple containing: 1421 # - the set of newly added atom indices (or None) 1422 # - the new subgraph 1423 1424
1425 -def all_subgraph_extensions(enumeration_mol, subgraph, visited_bond_indices, internal_bonds, 1426 external_edges):
1427 #print "Subgraph", len(subgraph.atom_indices), len(subgraph.bond_indices), "X", enumeration_mol.rdmol.GetNumAtoms() 1428 #print "subgraph atoms", subgraph.atom_indices 1429 #print "subgraph bonds", subgraph.bond_indices 1430 #print "internal", internal_bonds, "external", external_edges 1431 # only internal bonds 1432 if not external_edges: 1433 #assert internal_bonds, "Must have at least one internal bond" 1434 it = nonempty_powerset(internal_bonds) 1435 for internal_bond in it: 1436 # Make the new subgraphs 1437 bond_indices = set(subgraph.bond_indices) 1438 bond_indices.update(internal_bond) 1439 yield None, Subgraph(subgraph.atom_indices, frozenset(bond_indices)), 0, 0 1440 return 1441 1442 # only external edges 1443 if not internal_bonds: 1444 it = nonempty_powerset(external_edges) 1445 exclude_bonds = set(chain(visited_bond_indices, (edge.bond_index for edge in external_edges))) 1446 for external_ext in it: 1447 new_atoms = frozenset(ext.end_atom_index for ext in external_ext) 1448 atom_indices = frozenset(chain(subgraph.atom_indices, new_atoms)) 1449 bond_indices = frozenset( 1450 chain(subgraph.bond_indices, (ext.bond_index for ext in external_ext))) 1451 num_possible_atoms, num_possible_bonds = find_extension_size(enumeration_mol, new_atoms, 1452 exclude_bonds, external_ext) 1453 1454 #num_possible_atoms = len(enumeration_mol.atoms) - len(atom_indices) 1455 #num_possible_bonds = len(enumeration_mol.bonds) - len(bond_indices) 1456 yield new_atoms, Subgraph(atom_indices, bond_indices), num_possible_atoms, num_possible_bonds 1457 return 1458 1459 # Both internal bonds and external edges 1460 internal_powerset = list(powerset(internal_bonds)) 1461 external_powerset = powerset(external_edges) 1462 1463 exclude_bonds = set(chain(visited_bond_indices, (edge.bond_index for edge in external_edges))) 1464 1465 for external_ext in external_powerset: 1466 if not external_ext: 1467 # No external extensions. Must have at least one internal bond. 1468 for internal_bond in internal_powerset[1:]: 1469 bond_indices = set(subgraph.bond_indices) 1470 bond_indices.update(internal_bond) 1471 yield None, Subgraph(subgraph.atom_indices, bond_indices), 0, 0 1472 else: 1473 new_atoms = frozenset(ext.end_atom_index for ext in external_ext) 1474 atom_indices = frozenset(chain(subgraph.atom_indices, new_atoms)) 1475 # no_go_bond_indices = set(chain(visited_bond_indices, extern 1476 1477 bond_indices = frozenset( 1478 chain(subgraph.bond_indices, (ext.bond_index for ext in external_ext))) 1479 num_possible_atoms, num_possible_bonds = find_extension_size(enumeration_mol, atom_indices, 1480 exclude_bonds, external_ext) 1481 #num_possible_atoms = len(enumeration_mol.atoms) - len(atom_indices) 1482 for internal_bond in internal_powerset: 1483 bond_indices2 = frozenset(chain(bond_indices, internal_bond)) 1484 #num_possible_bonds = len(enumeration_mol.bonds) - len(bond_indices2) 1485 yield new_atoms, Subgraph(atom_indices, 1486 bond_indices2), num_possible_atoms, num_possible_bonds
1487 1488
1489 -def find_extension_size(enumeration_mol, known_atoms, exclude_bonds, directed_edges):
1490 num_remaining_atoms = num_remaining_bonds = 0 1491 visited_atoms = set(known_atoms) 1492 visited_bonds = set(exclude_bonds) 1493 #print "start atoms", visited_atoms 1494 #print "start bonds", visited_bonds 1495 #print "Along", [directed_edge.bond_index for directed_edge in directed_edges] 1496 for directed_edge in directed_edges: 1497 #print "Take", directed_edge 1498 stack = [directed_edge.end_atom_index] 1499 1500 # simple depth-first search search 1501 while stack: 1502 atom_index = stack.pop() 1503 for next_edge in enumeration_mol.directed_edges[atom_index]: 1504 #print "Visit", next_edge.bond_index, next_edge.end_atom_index 1505 bond_index = next_edge.bond_index 1506 if bond_index in visited_bonds: 1507 #print "Seen bond", bond_index 1508 continue 1509 num_remaining_bonds += 1 1510 visited_bonds.add(bond_index) 1511 #print "New BOND!", bond_index, "count", num_remaining_bonds 1512 1513 next_atom_index = next_edge.end_atom_index 1514 if next_atom_index in visited_atoms: 1515 #print "Seen atom" 1516 continue 1517 num_remaining_atoms += 1 1518 #print "New atom!", next_atom_index, "count", num_remaining_atoms 1519 visited_atoms.add(next_atom_index) 1520 1521 stack.append(next_atom_index) 1522 1523 #print "==>", num_remaining_atoms, num_remaining_bonds 1524 return num_remaining_atoms, num_remaining_bonds
1525 1526 # Check if a SMARTS is in all targets. 1527 # Uses a dictionary-style API, but please only use matcher[smarts] 1528 # Caches all previous results. 1529 1530
1531 -class CachingTargetsMatcher(dict):
1532
1533 - def __init__(self, targets, required_match_count=None):
1534 self.targets = targets 1535 if required_match_count is None: 1536 required_match_count = len(targets) 1537 self.required_match_count = required_match_count 1538 self._num_allowed_errors = len(targets) - required_match_count 1539 super(dict, self).__init__()
1540
1541 - def shift_targets(self):
1542 assert self._num_allowed_errors >= 0, (self.required_match_count, self._num_allowed_errors) 1543 self.targets = self.targets[1:] 1544 self._num_allowed_errors = len(self.targets) - self.required_match_count
1545
1546 - def __missing__(self, smarts):
1547 num_allowed_errors = self._num_allowed_errors 1548 if num_allowed_errors < 0: 1549 raise AssertionError("I should never be called") 1550 self[smarts] = False 1551 return False 1552 1553 pat = Chem.MolFromSmarts(smarts) 1554 if pat is None: 1555 raise AssertionError("Bad SMARTS: %r" % (smarts, )) 1556 1557 num_allowed_errors = self._num_allowed_errors 1558 for target in self.targets: 1559 if not MATCH(target, pat): 1560 if num_allowed_errors == 0: 1561 # Does not match. No need to continue processing 1562 self[smarts] = False 1563 return False 1564 num_allowed_errors -= 1 1565 # Matches enough structures, which means it will always 1566 # match enough structures. (Even after shifting.) 1567 self[smarts] = True 1568 return True
1569 1570
1571 -class VerboseCachingTargetsMatcher(object):
1572
1573 - def __init__(self, targets, required_match_count=None):
1574 self.targets = targets 1575 if required_match_count is None: 1576 required_match_count = len(targets) 1577 self.cache = {} 1578 self.required_match_count = required_match_count 1579 self._num_allowed_errors = len(targets) - required_match_count 1580 self.num_lookups = self.num_cached_true = self.num_cached_false = 0 1581 self.num_search_true = self.num_search_false = self.num_matches = 0
1582
1583 - def shift_targets(self):
1584 assert self._num_allowed_errors >= 0, (self.required_match_count, self._num_allowed_errors) 1585 if self._num_allowed_errors > 1: 1586 self.targets = self.targets[1:] 1587 self._num_allowed_errors = len(self.targets) - self.required_match_count
1588
1589 - def __getitem__(self, smarts, missing=object()):
1590 self.num_lookups += 1 1591 x = self.cache.get(smarts, missing) 1592 if x is not missing: 1593 if x: 1594 self.num_cached_true += 1 1595 else: 1596 self.num_cached_false += 1 1597 return x 1598 1599 pat = Chem.MolFromSmarts(smarts) 1600 if pat is None: 1601 raise AssertionError("Bad SMARTS: %r" % (smarts, )) 1602 1603 for i, target in enumerate(self.targets): 1604 if not MATCH(target, pat): 1605 # Does not match. No need to continue processing 1606 self.num_search_false += 1 1607 self.num_matches += i + 1 1608 self.cache[smarts] = False 1609 N = len(self.targets) 1610 return False 1611 # TODO: should I move the mismatch structure forward 1612 # so that it's tested earlier next time? 1613 # Matches everything 1614 self.num_matches += i + 1 1615 self.num_search_true += 1 1616 self.cache[smarts] = True 1617 return True
1618
1619 - def report(self):
1620 print >> sys.stderr, "%d tests of %d unique SMARTS, cache: %d True %d False, search: %d True %d False (%d substructure tests)" % ( 1621 self.num_lookups, len(self.cache), self.num_cached_true, self.num_cached_false, 1622 self.num_search_true, self.num_search_false, self.num_matches)
1623 1624 1625 ##### Different maximization algorithms ######
1626 -def prune_maximize_bonds(subgraph, mol, num_remaining_atoms, num_remaining_bonds, best_sizes):
1627 # Quick check if this is a viable search direction 1628 num_atoms = len(subgraph.atom_indices) 1629 num_bonds = len(subgraph.bond_indices) 1630 best_num_atoms, best_num_bonds = best_sizes 1631 1632 # Prune subgraphs which are too small can never become big enough 1633 diff_bonds = (num_bonds + num_remaining_bonds) - best_num_bonds 1634 if diff_bonds < 0: 1635 return True 1636 elif diff_bonds == 0: 1637 # Then we also maximize the number of atoms 1638 diff_atoms = (num_atoms + num_remaining_atoms) - best_num_atoms 1639 if diff_atoms <= 0: 1640 return True 1641 1642 return False
1643 1644
1645 -def prune_maximize_atoms(subgraph, mol, num_remaining_atoms, num_remaining_bonds, best_sizes):
1646 # Quick check if this is a viable search direction 1647 num_atoms = len(subgraph.atom_indices) 1648 num_bonds = len(subgraph.bond_indices) 1649 best_num_atoms, best_num_bonds = best_sizes 1650 1651 # Prune subgraphs which are too small can never become big enough 1652 diff_atoms = (num_atoms + num_remaining_atoms) - best_num_atoms 1653 if diff_atoms < 0: 1654 return True 1655 elif diff_atoms == 0: 1656 diff_bonds = (num_bonds + num_remaining_bonds) - best_num_bonds 1657 if diff_bonds <= 0: 1658 return True 1659 else: 1660 #print "Could still have", diff_atoms 1661 #print num_atoms, num_remaining_atoms, best_num_atoms 1662 pass 1663 1664 return False
1665 1666 ##### Callback handlers for storing the "best" information #####x 1667 1668
1669 -class _SingleBest(object):
1670
1671 - def __init__(self, timer, verbose):
1672 self.best_num_atoms = self.best_num_bonds = -1 1673 self.best_smarts = None 1674 self.sizes = (-1, -1) 1675 self.timer = timer 1676 self.verbose = verbose
1677
1678 - def _new_best(self, num_atoms, num_bonds, smarts):
1679 self.best_num_atoms = num_atoms 1680 self.best_num_bonds = num_bonds 1681 self.best_smarts = smarts 1682 self.sizes = sizes = (num_atoms, num_bonds) 1683 self.timer.mark("new best") 1684 if self.verbose: 1685 dt = self.timer.mark_times["new best"] - self.timer.mark_times["start fmcs"] 1686 sys.stderr.write("Best after %.1fs: %d atoms %d bonds %s\n" % 1687 (dt, num_atoms, num_bonds, smarts)) 1688 return sizes
1689
1690 - def get_result(self, completed):
1691 return MCSResult(self.best_num_atoms, self.best_num_bonds, self.best_smarts, completed)
1692 1693
1694 -class MCSResult(object):
1695
1696 - def __init__(self, num_atoms, num_bonds, smarts, completed):
1697 self.num_atoms = num_atoms 1698 self.num_bonds = num_bonds 1699 self.smarts = smarts 1700 self.completed = completed
1701
1702 - def __nonzero__(self):
1703 return self.smarts is not None
1704 1705
1706 -class SingleBestAtoms(_SingleBest):
1707
1708 - def add_new_match(self, subgraph, mol, smarts):
1709 sizes = self.sizes 1710 1711 # See if the subgraph match is better than the previous best 1712 num_subgraph_atoms = len(subgraph.atom_indices) 1713 if num_subgraph_atoms < sizes[0]: 1714 return sizes 1715 1716 num_subgraph_bonds = len(subgraph.bond_indices) 1717 if num_subgraph_atoms == sizes[0]: 1718 if num_subgraph_bonds <= sizes[1]: 1719 return sizes 1720 1721 return self._new_best(num_subgraph_atoms, num_subgraph_bonds, smarts)
1722 1723
1724 -class SingleBestBonds(_SingleBest):
1725
1726 - def add_new_match(self, subgraph, mol, smarts):
1727 sizes = self.sizes 1728 1729 # See if the subgraph match is better than the previous best 1730 num_subgraph_bonds = len(subgraph.bond_indices) 1731 if num_subgraph_bonds < sizes[1]: 1732 return sizes 1733 1734 num_subgraph_atoms = len(subgraph.atom_indices) 1735 if num_subgraph_bonds == sizes[1] and num_subgraph_atoms <= sizes[0]: 1736 return sizes 1737 return self._new_best(num_subgraph_atoms, num_subgraph_bonds, smarts)
1738 1739 ### Check if there are any ring atoms; used in --complete-rings-only 1740 1741 # This is (yet) another depth-first graph search algorithm 1742 1743
1744 -def check_completeRingsOnly(smarts, subgraph, enumeration_mol):
1745 #print "check", smarts, len(subgraph.atom_indices), len(subgraph.bond_indices) 1746 1747 atoms = enumeration_mol.atoms 1748 bonds = enumeration_mol.bonds 1749 1750 # First, are any of bonds in the subgraph ring bonds in the original structure? 1751 ring_bonds = [] 1752 for bond_index in subgraph.bond_indices: 1753 bond = bonds[bond_index] 1754 if bond.is_in_ring: 1755 ring_bonds.append(bond_index) 1756 1757 #print len(ring_bonds), "ring bonds" 1758 if not ring_bonds: 1759 # No need to check .. this is an acceptable structure 1760 return True 1761 1762 if len(ring_bonds) <= 2: 1763 # No need to check .. there are no rings of size 2 1764 return False 1765 1766 # Otherwise there's more work. Need to ensure that 1767 # all ring atoms are still in a ring in the subgraph. 1768 1769 confirmed_ring_bonds = set() 1770 subgraph_ring_bond_indices = set(ring_bonds) 1771 for bond_index in ring_bonds: 1772 #print "start with", bond_index, "in?", bond_index in confirmed_ring_bonds 1773 if bond_index in confirmed_ring_bonds: 1774 continue 1775 # Start a new search, starting from this bond 1776 from_atom_index, to_atom_index = bonds[bond_index].atom_indices 1777 1778 # Map from atom index to depth in the bond stack 1779 atom_depth = {from_atom_index: 0, to_atom_index: 1} 1780 bond_stack = [bond_index] 1781 backtrack_stack = [] 1782 prev_bond_index = bond_index 1783 current_atom_index = to_atom_index 1784 1785 while 1: 1786 # Dive downwards, ever downwards 1787 next_bond_index = next_atom_index = None 1788 this_is_a_ring = False 1789 for outgoing_edge in enumeration_mol.directed_edges[current_atom_index]: 1790 if outgoing_edge.bond_index == prev_bond_index: 1791 # Don't loop back 1792 continue 1793 if outgoing_edge.bond_index not in subgraph_ring_bond_indices: 1794 # Only advance along ring edges which are in the subgraph 1795 continue 1796 1797 if outgoing_edge.end_atom_index in atom_depth: 1798 #print "We have a ring" 1799 # It's a ring! Mark everything as being in a ring 1800 confirmed_ring_bonds.update(bond_stack[atom_depth[outgoing_edge.end_atom_index]:]) 1801 confirmed_ring_bonds.add(outgoing_edge.bond_index) 1802 if len(confirmed_ring_bonds) == len(ring_bonds): 1803 #print "Success!" 1804 return True 1805 this_is_a_ring = True 1806 continue 1807 1808 # New atom. Need to explore it. 1809 #print "we have a new bond", outgoing_edge.bond_index, "to atom", outgoing_edge.end_atom_index 1810 if next_bond_index is None: 1811 # This will be the immediate next bond to search in the DFS 1812 next_bond_index = outgoing_edge.bond_index 1813 next_atom_index = outgoing_edge.end_atom_index 1814 else: 1815 # Otherwise, backtrack and examine the other bonds 1816 backtrack_stack.append( 1817 (len(bond_stack), outgoing_edge.bond_index, outgoing_edge.end_atom_index)) 1818 1819 if next_bond_index is None: 1820 # Could not find a path to take. Might be because we looped back. 1821 if this_is_a_ring: 1822 #assert prev_bond_index in confirmed_ring_bonds, (prev_bond_index, confirmed_ring_bonds) 1823 # We did! That means we can backtrack 1824 while backtrack_stack: 1825 old_size, prev_bond_index, current_atom_index = backtrack_stack.pop() 1826 if bond_index not in confirmed_ring_bonds: 1827 # Need to explore this path. 1828 # Back up and start the search from here 1829 del bond_stack[old_size:] 1830 break 1831 else: 1832 # No more backtracking. We fail. Try next bond? 1833 # (If it had been sucessful then the 1834 # len(confirmed_ring_bonds) == len(ring_bonds) 1835 # would have return True) 1836 break 1837 else: 1838 # Didn't find a ring, nowhere to advance 1839 return False 1840 else: 1841 # Continue deeper 1842 bond_stack.append(next_bond_index) 1843 atom_depth[next_atom_index] = len(bond_stack) 1844 prev_bond_index = next_bond_index 1845 current_atom_index = next_atom_index
1846 1847 # If we reached here then try the next bond 1848 #print "Try again" 1849 1850
1851 -class SingleBestAtomsCompleteRingsOnly(_SingleBest):
1852
1853 - def add_new_match(self, subgraph, mol, smarts):
1854 sizes = self.sizes 1855 1856 # See if the subgraph match is better than the previous best 1857 num_subgraph_atoms = len(subgraph.atom_indices) 1858 if num_subgraph_atoms < sizes[0]: 1859 return sizes 1860 1861 num_subgraph_bonds = len(subgraph.bond_indices) 1862 if num_subgraph_atoms == sizes[0] and num_subgraph_bonds <= sizes[1]: 1863 return sizes 1864 1865 if check_completeRingsOnly(smarts, subgraph, mol): 1866 return self._new_best(num_subgraph_atoms, num_subgraph_bonds, smarts) 1867 return sizes
1868 1869
1870 -class SingleBestBondsCompleteRingsOnly(_SingleBest):
1871
1872 - def add_new_match(self, subgraph, mol, smarts):
1873 sizes = self.sizes 1874 1875 # See if the subgraph match is better than the previous best 1876 num_subgraph_bonds = len(subgraph.bond_indices) 1877 if num_subgraph_bonds < sizes[1]: 1878 return sizes 1879 1880 num_subgraph_atoms = len(subgraph.atom_indices) 1881 if num_subgraph_bonds == sizes[1] and num_subgraph_atoms <= sizes[0]: 1882 return sizes 1883 1884 if check_completeRingsOnly(smarts, subgraph, mol): 1885 return self._new_best(num_subgraph_atoms, num_subgraph_bonds, smarts) 1886 return sizes
1887 1888 1889 _maximize_options = { 1890 ("atoms", False): (prune_maximize_atoms, SingleBestAtoms), 1891 ("atoms", True): (prune_maximize_atoms, SingleBestAtomsCompleteRingsOnly), 1892 ("bonds", False): (prune_maximize_bonds, SingleBestBonds), 1893 ("bonds", True): (prune_maximize_bonds, SingleBestBondsCompleteRingsOnly), 1894 } 1895 1896 ###### The engine of the entire system. Enumerate subgraphs and see if they match. ##### 1897 1898
1899 -def enumerate_subgraphs(enumeration_mols, prune, atom_assignment, matches_all_targets, hits, 1900 timeout, heappush, heappop):
1901 if timeout is None: 1902 end_time = None 1903 else: 1904 end_time = time.time() + timeout 1905 1906 seeds = [] 1907 1908 best_sizes = (0, 0) 1909 # Do a quick check for the not uncommon case where one of the input fragments 1910 # is the largest substructure or one off from the largest. 1911 for mol in enumeration_mols: 1912 atom_range = range(len(mol.atoms)) 1913 bond_set = set(range(len(mol.bonds))) 1914 subgraph = Subgraph(atom_range, bond_set) 1915 if not prune(subgraph, mol, 0, 0, best_sizes): 1916 # Micro-optimization: the largest fragment SMARTS doesn't 1917 # need to be canonicalized because there will only ever be 1918 # one match. It's also unlikely that the other largest 1919 # fragments need canonicalization. 1920 smarts = make_arbitrary_smarts(subgraph, mol, atom_assignment) 1921 if matches_all_targets[smarts]: 1922 best_sizes = hits.add_new_match(subgraph, mol, smarts) 1923 1924 for mol in enumeration_mols: 1925 directed_edges = mol.directed_edges 1926 # Using 20001 random ChEMBL pairs, timeout=15.0 seconds 1927 # 1202.6s with original order 1928 # 1051.9s sorting by (bond.is_in_ring, bond_index) 1929 # 1009.7s sorting by (bond.is_in_ring + atom1.is_in_ring + atom2.is_in_ring) 1930 # 1055.2s sorting by (if bond.is_in_ring: 2; else: -(atom1.is_in_ring + atom2.is_in_ring)) 1931 # 1037.4s sorting by (atom1.is_in_ring + atom2.is_in_ring) 1932 sorted_bonds = list(enumerate(mol.bonds)) 1933 1934 def get_bond_ring_score(bond_data, atoms=mol.atoms): 1935 bond_index, bond = bond_data 1936 a1, a2 = bond.atom_indices 1937 return bond.is_in_ring + atoms[a1].is_in_ring + atoms[a2].is_in_ring
1938 1939 sorted_bonds.sort(key=get_bond_ring_score) 1940 1941 visited_bond_indices = set() 1942 num_remaining_atoms = len(mol.atoms) - 2 1943 num_remaining_bonds = len(mol.bonds) 1944 for bond_index, bond in sorted_bonds: #enumerate(mol.bonds): # 1945 #print "bond_index", bond_index, len(mol.bonds) 1946 visited_bond_indices.add(bond_index) 1947 num_remaining_bonds -= 1 1948 subgraph = Subgraph(bond.atom_indices, frozenset([bond_index])) 1949 1950 # I lie about the remaining atom/bond sizes here. 1951 if prune(subgraph, mol, num_remaining_atoms, num_remaining_bonds, best_sizes): 1952 continue 1953 # bond.canonical_bondtype doesn't necessarily give the same 1954 # SMARTS as make_canonical_smarts, but that doesn't matter. 1955 # 1) I know it's canonical, 2) it's faster, and 3) there is 1956 # no place else which generates single-bond canonical SMARTS. 1957 #smarts = make_canonical_smarts(subgraph, mol, atom_assignment) 1958 smarts = bond.canonical_bondtype 1959 if matches_all_targets[smarts]: 1960 best_sizes = hits.add_new_match(subgraph, mol, smarts) 1961 else: 1962 # This can happen if there's a threshold 1963 #raise AssertionError("This should never happen: %r" % (smarts,)) 1964 continue 1965 1966 a1, a2 = bond.atom_indices 1967 outgoing_edges = [ 1968 e for e in (directed_edges[a1] + directed_edges[a2]) 1969 if e.end_atom_index not in bond.atom_indices and e.bond_index not in visited_bond_indices 1970 ] 1971 1972 empty_internal = frozenset() 1973 if not outgoing_edges: 1974 pass 1975 else: 1976 # The priority is the number of bonds in the subgraph, ordered so 1977 # that the subgraph with the most bonds comes first. Since heapq 1978 # puts the smallest value first, I reverse the number. The initial 1979 # subgraphs have 1 bond, so the initial score is -1. 1980 heappush(seeds, (-1, tiebreaker(), subgraph, visited_bond_indices.copy(), empty_internal, 1981 outgoing_edges, mol, directed_edges)) 1982 1983 # I made so many subtle mistakes where I used 'subgraph' instead 1984 # of 'new_subgraph' in the following section that I finally 1985 # decided to get rid of 'subgraph' and use 'old_subgraph' instead. 1986 del subgraph 1987 1988 while seeds: 1989 if end_time: 1990 if time.time() >= end_time: 1991 return False 1992 1993 #print "There are", len(seeds), "seeds", seeds[0][:2] 1994 score, _, old_subgraph, visited_bond_indices, internal_bonds, external_edges, mol, directed_edges = heappop( 1995 seeds) 1996 1997 new_visited_bond_indices = visited_bond_indices.copy() 1998 new_visited_bond_indices.update(internal_bonds) 1999 ## for edge in external_edges: 2000 ## assert edge.bond_index not in new_visited_bond_indices 2001 new_visited_bond_indices.update(edge.bond_index for edge in external_edges) 2002 2003 for new_atoms, new_subgraph, num_remaining_atoms, num_remaining_bonds in \ 2004 all_subgraph_extensions(mol, old_subgraph, visited_bond_indices, internal_bonds, external_edges): 2005 if prune(new_subgraph, mol, num_remaining_atoms, num_remaining_bonds, best_sizes): 2006 #print "PRUNE", make_canonical_smarts(new_subgraph, mol, atom_assignment) 2007 continue 2008 smarts = make_canonical_smarts(new_subgraph, mol, atom_assignment) 2009 if matches_all_targets[smarts]: 2010 #print "YES", smarts 2011 best_sizes = hits.add_new_match(new_subgraph, mol, smarts) 2012 else: 2013 #print "NO", smarts 2014 continue 2015 2016 if not new_atoms: 2017 continue 2018 2019 new_internal_bonds, new_external_edges = find_extensions(new_atoms, new_visited_bond_indices, 2020 directed_edges) 2021 2022 if new_internal_bonds or new_external_edges: 2023 # Rank so the subgraph with the highest number of bonds comes first 2024 heappush(seeds, (-len(new_subgraph.bond_indices), tiebreaker(), new_subgraph, 2025 new_visited_bond_indices, new_internal_bonds, new_external_edges, mol, 2026 directed_edges)) 2027 2028 return True 2029 2030 2031 # Assign a unique identifier to every unique key
2032 -class Uniquer(dict):
2033
2034 - def __init__(self):
2035 self.counter = _Counter()
2036
2037 - def __missing__(self, key):
2038 self[key] = count = self.counter() 2039 return count
2040 2041 2042 # This is here only so I can see it in the profile statistics
2043 -def MATCH(mol, pat):
2044 return mol.HasSubstructMatch(pat)
2045 2046
2047 -class VerboseHeapOps(object):
2048
2049 - def __init__(self, trigger, verboseDelay):
2050 self.num_seeds_added = 0 2051 self.num_seeds_processed = 0 2052 self.verboseDelay = verboseDelay 2053 self._time_for_next_report = time.time() + verboseDelay 2054 self.trigger = trigger
2055
2056 - def heappush(self, seeds, item):
2057 self.num_seeds_added += 1 2058 return heappush(seeds, item)
2059
2060 - def heappop(self, seeds):
2061 if time.time() >= self._time_for_next_report: 2062 self.trigger() 2063 self.report() 2064 self._time_for_next_report = time.time() + self.verboseDelay 2065 self.num_seeds_processed += 1 2066 return heappop(seeds)
2067
2068 - def trigger_report(self):
2069 self.trigger() 2070 self.report()
2071
2072 - def report(self):
2073 print >> sys.stderr, " %d subgraphs enumerated, %d processed" % (self.num_seeds_added, 2074 self.num_seeds_processed)
2075 2076
2077 -def compute_mcs(fragmented_mols, typed_mols, minNumAtoms, threshold_count=None, 2078 maximize=Default.maximize, completeRingsOnly=Default.completeRingsOnly, 2079 timeout=Default.timeout, timer=None, verbose=False, verboseDelay=1.0):
2080 assert timer is not None 2081 assert 0 < threshold_count <= len(fragmented_mols), threshold_count 2082 assert len(fragmented_mols) == len(typed_mols) 2083 assert len(fragmented_mols) >= 2 2084 if threshold_count is None: 2085 threshold_count = len(fragmented_mols) 2086 else: 2087 assert threshold_count >= 2, threshold_count 2088 2089 atom_assignment = Uniquer() 2090 if verbose: 2091 if verboseDelay < 0.0: 2092 raise ValueError("verboseDelay may not be negative") 2093 matches_all_targets = VerboseCachingTargetsMatcher(typed_mols[1:], threshold_count - 1) 2094 heapops = VerboseHeapOps(matches_all_targets.report, verboseDelay) 2095 push = heapops.heappush 2096 pop = heapops.heappop 2097 end_verbose = heapops.trigger_report 2098 else: 2099 matches_all_targets = CachingTargetsMatcher(typed_mols[1:], threshold_count - 1) 2100 push = heappush 2101 pop = heappop 2102 end_verbose = lambda: 1 2103 2104 try: 2105 prune, hits_class = _maximize_options[(maximize, bool(completeRingsOnly))] 2106 except KeyError: 2107 raise ValueError("Unknown 'maximize' option %r" % (maximize, )) 2108 2109 hits = hits_class(timer, verbose) 2110 2111 remaining_time = None 2112 if timeout is not None: 2113 stop_time = time.time() + timeout 2114 2115 for query_index, fragmented_query_mol in enumerate(fragmented_mols): 2116 enumerated_query_fragments = fragmented_mol_to_enumeration_mols(fragmented_query_mol, 2117 minNumAtoms) 2118 2119 targets = typed_mols 2120 if timeout is not None: 2121 remaining_time = stop_time - time.time() 2122 success = enumerate_subgraphs(enumerated_query_fragments, prune, atom_assignment, 2123 matches_all_targets, hits, remaining_time, push, pop) 2124 if query_index + threshold_count >= len(fragmented_mols): 2125 break 2126 if not success: 2127 break 2128 matches_all_targets.shift_targets() 2129 2130 end_verbose() 2131 2132 result = hits.get_result(success) 2133 if result.num_atoms < minNumAtoms: 2134 return MCSResult(-1, -1, None, result.completed) 2135 return result
2136 2137 ########## Main driver for the MCS code 2138 2139
2140 -class Timer(object):
2141
2142 - def __init__(self):
2143 self.mark_times = {}
2144
2145 - def mark(self, name):
2146 self.mark_times[name] = time.time()
2147 2148
2149 -def _update_times(timer, times):
2150 if times is None: 2151 return 2152 for (dest, start, end) in ( 2153 ("fragment", "start fmcs", "end fragment"), ("select", "end fragment", "end select"), 2154 ("enumerate", "end select", "end fmcs"), ("best_found", "start fmcs", "new best"), 2155 ("mcs", "start fmcs", "end fmcs")): 2156 try: 2157 diff = timer.mark_times[end] - timer.mark_times[start] 2158 except KeyError: 2159 diff = None 2160 times[dest] = diff
2161 2162
2163 -def _get_threshold_count(num_mols, threshold):
2164 if threshold is None: 2165 return num_mols 2166 2167 x = num_mols * threshold 2168 threshold_count = int(x) 2169 if threshold_count < x: 2170 threshold_count += 1 2171 2172 if threshold_count < 2: 2173 # You can specify 0.00001 or -2.3 but you'll still get 2174 # at least one *common* substructure. 2175 threshold_count = 2 2176 2177 return threshold_count
2178 2179
2180 -def fmcs(mols, 2181 minNumAtoms=2, 2182 maximize=Default.maximize, 2183 atomCompare=Default.atomCompare, 2184 bondCompare=Default.bondCompare, 2185 threshold=1.0, 2186 matchValences=Default.matchValences, 2187 ringMatchesRingOnly=False, 2188 completeRingsOnly=False, 2189 timeout=Default.timeout, 2190 times=None, 2191 verbose=False, 2192 verboseDelay=1.0, ):
2193 2194 timer = Timer() 2195 timer.mark("start fmcs") 2196 2197 if minNumAtoms < 2: 2198 raise ValueError("minNumAtoms must be at least 2") 2199 if timeout is not None: 2200 if timeout <= 0.0: 2201 raise ValueError("timeout must be None or a positive value") 2202 2203 threshold_count = _get_threshold_count(len(mols), threshold) 2204 if threshold_count > len(mols): 2205 # Threshold is too high. No possible matches. 2206 return MCSResult(-1, -1, None, 1) 2207 2208 if completeRingsOnly: 2209 ringMatchesRingOnly = True 2210 2211 try: 2212 atom_typer = atom_typers[atomCompare] 2213 except KeyError: 2214 raise ValueError("Unknown atomCompare option %r" % (atomCompare, )) 2215 try: 2216 bond_typer = bond_typers[bondCompare] 2217 except KeyError: 2218 raise ValueError("Unknown bondCompare option %r" % (bondCompare, )) 2219 2220 # Make copies of all of the molecules so I can edit without worrying about the original 2221 typed_mols = convert_input_to_typed_molecules(mols, atom_typer, bond_typer, 2222 matchValences=matchValences, 2223 ringMatchesRingOnly=ringMatchesRingOnly) 2224 bondtype_counts = get_canonical_bondtype_counts(typed_mols) 2225 supported_bondtypes = set() 2226 for bondtype, count_list in bondtype_counts.items(): 2227 if len(count_list) >= threshold_count: 2228 supported_bondtypes.add(bondtype) 2229 # For better filtering, find the largest count which is in threshold 2230 # Keep track of the counts while building the subgraph. 2231 # The subgraph can never have more types of a given count. 2232 2233 fragmented_mols = [remove_unknown_bondtypes(typed_mol, bondtype_counts) 2234 for typed_mol in typed_mols] 2235 timer.mark("end fragment") 2236 2237 sizes = [] 2238 max_num_atoms = fragmented_mols[0].rdmol.GetNumAtoms() 2239 max_num_bonds = fragmented_mols[0].rdmol.GetNumBonds() 2240 ignored_count = 0 2241 for tiebreaker, (typed_mol, fragmented_mol) in enumerate(zip(typed_mols, fragmented_mols)): 2242 num_atoms, num_bonds = find_upper_fragment_size_limits(fragmented_mol.rdmol, 2243 fragmented_mol.rdmol_atoms) 2244 if num_atoms < minNumAtoms: 2245 # This isn't big enough to be in the MCS 2246 ignored_count += 1 2247 if ignored_count + threshold_count > len(mols): 2248 # I might be able to exit because enough of the molecules don't have 2249 # a large enough fragment to be part of the MCS 2250 timer.mark("end select") 2251 timer.mark("end fmcs") 2252 _update_times(timer, times) 2253 return MCSResult(-1, -1, None, True) 2254 else: 2255 if num_atoms < max_num_atoms: 2256 max_num_atoms = num_atoms 2257 if num_bonds < max_num_bonds: 2258 max_num_bonds = num_bonds 2259 sizes.append((num_bonds, num_atoms, tiebreaker, typed_mol, fragmented_mol)) 2260 2261 if len(sizes) < threshold_count: 2262 timer.mark("end select") 2263 timer.mark("end fmcs") 2264 _update_times(timer, times) 2265 return MCSResult(-1, -1, None, True) 2266 2267 assert min(size[1] for size in sizes) >= minNumAtoms 2268 2269 # Sort so the molecule with the smallest largest fragment (by bonds) comes first. 2270 # Break ties with the smallest number of atoms. 2271 # Break secondary ties by position. 2272 sizes.sort() 2273 #print "Using", Chem.MolToSmiles(sizes[0][4].rdmol) 2274 2275 timer.mark("end select") 2276 2277 # Extract the (typed mol, fragmented mol) pairs. 2278 fragmented_mols = [size_info[4] for size_info in sizes] # used as queries 2279 typed_mols = [size_info[3].rdmol for size_info in sizes] # used as targets 2280 2281 timer.mark("start enumeration") 2282 mcs_result = compute_mcs(fragmented_mols, typed_mols, minNumAtoms, 2283 threshold_count=threshold_count, maximize=maximize, 2284 completeRingsOnly=completeRingsOnly, timeout=timeout, timer=timer, 2285 verbose=verbose, verboseDelay=verboseDelay) 2286 timer.mark("end fmcs") 2287 _update_times(timer, times) 2288 return mcs_result
2289 2290 ######### Helper functions to generate structure/fragment output given an MCS match 2291 2292 # Given a Subgraph (with atom and bond indices) describing a 2293 # fragment, make a new molecule object with only that fragment 2294 2295
2296 -def subgraph_to_fragment(mol, subgraph):
2297 emol = Chem.EditableMol(Chem.Mol()) 2298 atom_map = {} 2299 for atom_index in subgraph.atom_indices: 2300 emol.AddAtom(mol.GetAtomWithIdx(atom_index)) 2301 atom_map[atom_index] = len(atom_map) 2302 2303 for bond_index in subgraph.bond_indices: 2304 bond = mol.GetBondWithIdx(bond_index) 2305 emol.AddBond(atom_map[bond.GetBeginAtomIdx()], atom_map[bond.GetEndAtomIdx()], 2306 bond.GetBondType()) 2307 2308 return emol.GetMol()
2309 2310 2311 # Convert a subgraph into a SMILES
2312 -def make_fragment_smiles(mcs, mol, subgraph, args=None):
2313 fragment = subgraph_to_fragment(mol, subgraph) 2314 new_smiles = Chem.MolToSmiles(fragment) 2315 return "%s %s\n" % (new_smiles, mol.GetProp("_Name"))
2316 2317
2318 -def _copy_sd_tags(mol, fragment):
2319 fragment.SetProp("_Name", mol.GetProp("_Name")) 2320 # Copy the existing names over 2321 for name in mol.GetPropNames(): 2322 if name.startswith("_"): 2323 continue 2324 fragment.SetProp(name, mol.GetProp(name))
2325 2326
2327 -def _MolToSDBlock(mol):
2328 # Huh?! There's no way to get the entire SD record? 2329 mol_block = Chem.MolToMolBlock(mol, kekulize=False) 2330 tag_data = [] 2331 for name in mol.GetPropNames(): 2332 if name.startswith("_"): 2333 continue 2334 value = mol.GetProp(name) 2335 tag_data.append("> <" + name + ">\n") 2336 tag_data.append(value + "\n") 2337 tag_data.append("\n") 2338 tag_data.append("$$$$\n") 2339 return mol_block + "".join(tag_data)
2340 2341
2342 -def _save_other_tags(mol, fragment, mcs, orig_mol, subgraph, args):
2343 if args.save_counts_tag is not None: 2344 if not mcs: 2345 line = "-1 -1 -1" 2346 elif mcs.num_atoms == 0: 2347 line = "0 0 0" 2348 else: 2349 line = "1 %d %d" % (mcs.num_atoms, mcs.num_bonds) 2350 mol.SetProp(args.save_counts_tag, line) 2351 2352 if args.save_smiles_tag is not None: 2353 if mcs and mcs.num_atoms > 0: 2354 smiles = Chem.MolToSmiles(fragment) 2355 else: 2356 smiles = "-" 2357 mol.SetProp(args.save_smiles_tag, smiles) 2358 2359 if args.save_smarts_tag is not None: 2360 if mcs and mcs.num_atoms > 0: 2361 smarts = mcs.smarts 2362 else: 2363 smarts = "-" 2364 mol.SetProp(args.save_smarts_tag, smarts)
2365 2366 2367 # Convert a subgraph into an SD file
2368 -def make_fragment_sdf(mcs, mol, subgraph, args):
2369 fragment = subgraph_to_fragment(mol, subgraph) 2370 Chem.FastFindRings(fragment) 2371 _copy_sd_tags(mol, fragment) 2372 2373 if args.save_atom_class_tag is not None: 2374 output_tag = args.save_atom_class_tag 2375 atom_classes = get_selected_atom_classes(mol, subgraph.atom_indices) 2376 if atom_classes is not None: 2377 fragment.SetProp(output_tag, " ".join(map(str, atom_classes))) 2378 2379 _save_other_tags(fragment, fragment, mcs, mol, subgraph, args) 2380 2381 return _MolToSDBlock(fragment)
2382 2383 2384 #
2385 -def make_complete_sdf(mcs, mol, subgraph, args):
2386 fragment = copy.copy(mol) 2387 _copy_sd_tags(mol, fragment) 2388 2389 if args.save_atom_indices_tag is not None: 2390 output_tag = args.save_atom_indices_tag 2391 s = " ".join(str(index) for index in subgraph.atom_indices) 2392 fragment.SetProp(output_tag, s) 2393 2394 _save_other_tags(fragment, subgraph_to_fragment(mol, subgraph), mcs, mol, subgraph, args) 2395 2396 return _MolToSDBlock(fragment)
2397 2398 2399 structure_format_functions = { 2400 "fragment-smiles": make_fragment_smiles, 2401 "fragment-sdf": make_fragment_sdf, 2402 "complete-sdf": make_complete_sdf, 2403 } 2404 2405
2406 -def make_structure_format(format_name, mcs, mol, subgraph, args):
2407 try: 2408 func = structure_format_functions[format_name] 2409 except KeyError: 2410 raise ValueError("Unknown format %r" % (format_name, )) 2411 return func(mcs, mol, subgraph, args)
2412 2413
2414 -def parse_num_atoms(s):
2415 num_atoms = int(s) 2416 if num_atoms < 2: 2417 raise argparse.ArgumentTypeError("must be at least 2, not %s" % s) 2418 return num_atoms
2419 2420
2421 -def parse_threshold(s):
2422 try: 2423 import fractions 2424 except ImportError: 2425 threshold = float(s) 2426 one = 1.0 2427 else: 2428 threshold = fractions.Fraction(s) 2429 one = fractions.Fraction(1) 2430 if not (0 <= threshold <= one): 2431 raise argparse.ArgumentTypeError("must be a value between 0.0 and 1.0, not %s" % s) 2432 return threshold
2433 2434
2435 -def parse_timeout(s):
2436 if s == "none": 2437 return None 2438 timeout = float(s) 2439 if timeout < 0.0: 2440 raise argparse.ArgumentTypeError("Must be a non-negative value, not %r" % (s, )) 2441 return timeout
2442 2443
2444 -class starting_from(object):
2445
2446 - def __init__(self, left):
2447 self.left = left
2448
2449 - def __contains__(self, value):
2450 return self.left <= value
2451 2452 2453 range_pat = re.compile(r"(\d+)-(\d*)") 2454 value_pat = re.compile("(\d+)") 2455 2456
2457 -def parse_select(s):
2458 ranges = [] 2459 start = 0 2460 while 1: 2461 m = range_pat.match(s, start) 2462 if m is not None: 2463 # Selected from 'left' to (and including) 'right' 2464 # Convert into range fields, starting from 0 2465 left = int(m.group(1)) 2466 right = m.group(2) 2467 if not right: 2468 ranges.append(starting_from(left - 1)) 2469 else: 2470 ranges.append(range(left - 1, int(right))) 2471 else: 2472 # Selected a single value 2473 m = value_pat.match(s, start) 2474 if m is not None: 2475 val = int(m.group(1)) 2476 ranges.append(range(val - 1, val)) 2477 else: 2478 raise argparse.ArgumentTypeError("Unknown character at position %d of %r" % (start + 1, s)) 2479 start = m.end() 2480 # Check if this is the end of string or a ',' 2481 t = s[start:start + 1] 2482 if not t: 2483 break 2484 if t == ",": 2485 start += 1 2486 continue 2487 raise argparse.ArgumentTypeError("Unknown character at position %d of %r" % (start + 1, s)) 2488 return ranges
2489 2490 2491 compare_shortcuts = { 2492 "topology": ("any", "any"), 2493 "elements": ("elements", "any"), 2494 "types": ("elements", "bondtypes"), 2495 } 2496 2497 2498 # RDKit's match function only returns the atom indices of the match. 2499 # To get the bond indices, I need to go through the pattern molecule.
2500 -def _get_match_bond_indices(pat, mol, match_atom_indices):
2501 bond_indices = [] 2502 for bond in pat.GetBonds(): 2503 mol_atom1 = match_atom_indices[bond.GetBeginAtomIdx()] 2504 mol_atom2 = match_atom_indices[bond.GetEndAtomIdx()] 2505 bond = mol.GetBondBetweenAtoms(mol_atom1, mol_atom2) 2506 assert bond is not None 2507 bond_indices.append(bond.GetIdx()) 2508 return bond_indices
2509 2510
2511 -def main(args=None):
2512 parser = argparse.ArgumentParser( 2513 description="Find the maximum common substructure of a set of structures", 2514 epilog="For more details on these options, see https://bitbucket.org/dalke/fmcs/") 2515 parser.add_argument("filename", nargs=1, help="SDF or SMILES file") 2516 2517 parser.add_argument("--maximize", choices=["atoms", "bonds"], default=Default.maximize, 2518 help="Maximize the number of 'atoms' or 'bonds' in the MCS. (Default: %s)" % 2519 (Default.maximize, )) 2520 parser.add_argument("--min-num-atoms", type=parse_num_atoms, default=2, metavar="INT", 2521 help="Minimimum number of atoms in the MCS (Default: 2)") 2522 2523 class CompareAction(argparse.Action): 2524 2525 def __call__(self, parser, namespace, value, option_string=None): 2526 atomCompare_name, bondCompare_name = compare_shortcuts[value] 2527 namespace.atomCompare = atomCompare_name 2528 namespace.bondCompare = bondCompare_name
2529 2530 parser.add_argument( 2531 "--compare", choices=["topology", "elements", "types"], default=None, action=CompareAction, 2532 help="Use 'topology' as a shorthand for '--atom-compare any --bond-compare any', " 2533 "'elements' is '--atom-compare elements --bond-compare any', " 2534 "and 'types' is '--atom-compare elements --bond-compare bondtypes' " 2535 "(Default: types)") 2536 2537 parser.add_argument( 2538 "--atom-compare", choices=["any", "elements", "isotopes"], default=None, help=( 2539 "Specify the atom comparison method. With 'any', every atom matches every " 2540 "other atom. With 'elements', atoms match only if they contain the same element. " 2541 "With 'isotopes', atoms match only if they have the same isotope number; element " 2542 "information is ignored so [5C] and [5P] are identical. This can be used to " 2543 "implement user-defined atom typing. " 2544 "(Default: elements)")) 2545 2546 parser.add_argument("--bond-compare", choices=["any", "bondtypes"], default="bondtypes", help=( 2547 "Specify the bond comparison method. With 'any', every bond matches every " 2548 "other bond. With 'bondtypes', bonds are the same only if their bond types " 2549 "are the same. (Default: bondtypes)")) 2550 2551 parser.add_argument( 2552 "--threshold", default="1.0", type=parse_threshold, 2553 help="Minimum structure match threshold. A value of 1.0 means that the common " 2554 "substructure must be in all of the input structures. A value of 0.8 finds " 2555 "the largest substructure which is common to at least 80%% of the input " 2556 "structures. (Default: 1.0)") 2557 2558 parser.add_argument( 2559 "--atom-class-tag", metavar="TAG", 2560 help="Use atom class assignments from the field 'TAG'. The tag data must contain a space " 2561 "separated list of integers in the range 1-10000, one for each atom. Atoms are " 2562 "identical if and only if their corresponding atom classes are the same. Note " 2563 "that '003' and '3' are treated as identical values. (Not used by default)") 2564 2565 ## parser.add_argument("--match-valences", action="store_true", 2566 ## help= 2567 ## "Modify the atom comparison so that two atoms must also have the same total " 2568 ## "bond order in order to match.") 2569 2570 parser.add_argument( 2571 "--ring-matches-ring-only", action="store_true", 2572 help="Modify the bond comparison so that ring bonds only match ring bonds and chain " 2573 "bonds only match chain bonds. (Ring atoms can still match non-ring atoms.) ") 2574 2575 parser.add_argument( 2576 "--complete-rings-only", action="store_true", 2577 help="If a bond is a ring bond in the input structures and a bond is in the MCS " 2578 "then the bond must also be in a ring in the MCS. Selecting this option also " 2579 "enables --ring-matches-ring-only.") 2580 2581 parser.add_argument( 2582 "--select", type=parse_select, action="store", default="1-", 2583 help="Select a subset of the input records to process. Example: 1-10,13,20,50- " 2584 "(Default: '1-', which selects all structures)") 2585 2586 parser.add_argument("--timeout", type=parse_timeout, metavar="SECONDS", default=Default.timeout, 2587 help="Report the best solution after running for at most 'timeout' seconds. " 2588 "Use 'none' for no timeout. (Default: %s)" % (Default.timeoutString, )) 2589 2590 parser.add_argument("--output", "-o", metavar="FILENAME", 2591 help="Write the results to FILENAME (Default: use stdout)") 2592 2593 parser.add_argument( 2594 "--output-format", choices=["smarts", "fragment-smiles", "fragment-sdf", "complete-sdf"], 2595 default="smarts", 2596 help="'smarts' writes the SMARTS pattern including the atom and bond criteria. " 2597 "'fragment-smiles' writes a matching fragment as a SMILES string. " 2598 "'fragment-sdf' writes a matching fragment as a SD file; see --save-atom-class for " 2599 "details on how atom class information is saved. " 2600 "'complete-sdf' writes the entire SD file with the fragment information stored in " 2601 "the tag specified by --save-fragment-indices-tag. (Default: smarts)") 2602 2603 parser.add_argument( 2604 "--output-all", action="store_true", 2605 help="By default the structure output formats only show an MCS for the first input structure. " 2606 "If this option is enabled then an MCS for all of the structures are shown.") 2607 2608 parser.add_argument( 2609 "--save-atom-class-tag", metavar="TAG", 2610 help="If atom classes are specified (via --class-tag) and the output format is 'fragment-sdf' " 2611 "then save the substructure atom classes to the tag TAG, in fragment atom order. By " 2612 "default this is the value of --atom-class-tag.") 2613 2614 parser.add_argument( 2615 "--save-counts-tag", metavar="TAG", 2616 help="Save the fragment count, atom count, and bond count to the specified SD tag as " 2617 "space separated integers, like '1 9 8'. (The fragment count will not be larger than " 2618 "1 until fmcs supports disconnected MCSes.)") 2619 2620 parser.add_argument("--save-atom-indices-tag", metavar="TAG", 2621 help="If atom classes are specified and the output format is 'complete-sdf' " 2622 "then save the MCS fragment atom indices to the tag TAG, in MCS order. " 2623 "(Default: mcs-atom-indices)") 2624 2625 parser.add_argument( 2626 "--save-smarts-tag", metavar="TAG", 2627 help="Save the MCS SMARTS to the specified SD tag. Uses '-' if there is no MCS") 2628 2629 parser.add_argument( 2630 "--save-smiles-tag", metavar="TAG", 2631 help="Save the fragment SMILES to the specified SD tag. Uses '-' if there is no MCS") 2632 2633 parser.add_argument("--times", action="store_true", help="Print timing information to stderr") 2634 parser.add_argument("-v", "--verbose", action="count", dest="verbosity", 2635 help="Print progress statistics to stderr. Use twice for higher verbosity.") 2636 parser.add_argument("--version", action="version", version="%(prog)s " + __version__) 2637 2638 args = parser.parse_args(args) 2639 2640 filename = args.filename[0] 2641 fname = filename.lower() 2642 if fname.endswith(".smi"): 2643 try: 2644 reader = Chem.SmilesMolSupplier(filename, titleLine=False) 2645 except IOError: 2646 raise SystemExit("Unable to open SMILES file %r" % (filename, )) 2647 elif fname.endswith(".sdf"): 2648 try: 2649 reader = Chem.SDMolSupplier(filename) 2650 except IOError: 2651 raise SystemExit("Unable to open SD file %r" % (filename, )) 2652 elif fname.endswith(".gz"): 2653 raise SystemExit("gzip compressed files not yet supported") 2654 else: 2655 raise SystemExit("Only SMILES (.smi) and SDF (.sdf) files are supported") 2656 2657 if args.minNumAtoms < 2: 2658 parser.error("--min-num-atoms must be at least 2") 2659 2660 if args.atomCompare is None: 2661 if args.atom_class_tag is None: 2662 args.atomCompare = "elements" # Default atom comparison 2663 else: 2664 args.atomCompare = "isotopes" # Assing the atom classes to the isotope fields 2665 else: 2666 if args.atom_class_tag is not None: 2667 parser.error("Cannot specify both --atom-compare and --atom-class-tag fields") 2668 2669 # RDKit uses special property names starting with "_" 2670 # It's dangerous to use some of them directly 2671 for name in ("atom_class_tag", "save_atom_class_tag", "save_counts_tag", "save_atom_indices_tag", 2672 "save_smarts_tag", "save_smiles_tag"): 2673 value = getattr(args, name) 2674 if value is not None: 2675 if value.startswith("_"): 2676 parser.error("--%s value may not start with a '_': %r" % (name.replace("_", "-"), value)) 2677 2678 # Set up some defaults depending on the output format 2679 atom_class_tag = args.atom_class_tag 2680 if args.output_format == "fragment-sdf": 2681 if atom_class_tag is not None: 2682 if args.save_atom_class_tag is None: 2683 args.save_atom_class_tag = atom_class_tag 2684 2685 if args.output_format == "complete-sdf": 2686 if (args.save_atom_indices_tag is None and args.save_counts_tag is None and 2687 args.save_smiles_tag is None and args.save_smarts_tag is None): 2688 parser.error("Using --output-format complete-sdf is useless without at least one " 2689 "of --save-atom-indices-tag, --save-smarts-tag, --save-smiles-tag, " 2690 "or --save-counts-tag") 2691 2692 t1 = time.time() 2693 structures = [] 2694 if args.verbosity > 1: 2695 sys.stderr.write("Loading structures from %s ..." % (filename, )) 2696 2697 for molno, mol in enumerate(reader): 2698 if not any(molno in range_ for range_ in args.select): 2699 continue 2700 if mol is None: 2701 print >> sys.stderr, "Skipping unreadable structure #%d" % (molno + 1, ) 2702 continue 2703 if atom_class_tag is not None: 2704 try: 2705 assign_isotopes_from_class_tag(mol, atom_class_tag) 2706 except ValueError as err: 2707 raise SystemExit("Structure #%d: %s" % (molno + 1, err)) 2708 structures.append(mol) 2709 if args.verbosity > 1: 2710 if len(structures) % 100 == 0: 2711 sys.stderr.write("\rLoaded %d structures from %s ..." % (len(structures), filename)) 2712 sys.stderr.flush() # not needed; it's stderr. But I'm cautious. 2713 2714 if args.verbosity > 1: 2715 sys.stderr.write("\r") 2716 2717 times = {"load": time.time() - t1} 2718 2719 if args.verbosity: 2720 print >> sys.stderr, "Loaded", len(structures), "structures from", filename, " " 2721 2722 if len(structures) < 2: 2723 raise SystemExit("Input file %r must contain at least two structures" % (filename, )) 2724 2725 mcs = fmcs(structures, 2726 minNumAtoms=args.minNumAtoms, 2727 maximize=args.maximize, 2728 atomCompare=args.atomCompare, 2729 bondCompare=args.bondCompare, 2730 threshold=args.threshold, 2731 #matchValences = args.matchValences, 2732 matchValences=False, # Do I really want to support this? 2733 ringMatchesRingOnly=args.ringMatchesRingOnly, 2734 completeRingsOnly=args.completeRingsOnly, 2735 timeout=args.timeout, 2736 times=times, 2737 verbose=args.verbosity > 1, 2738 verboseDelay=1.0, ) 2739 2740 msg_format = "Total time %(total).2f seconds: load %(load).2f fragment %(fragment).2f select %(select).2f enumerate %(enumerate).2f" 2741 times["total"] = times["mcs"] + times["load"] 2742 2743 if mcs and mcs.completed: 2744 msg_format += " (MCS found after %(best_found).2f)" 2745 2746 del mol 2747 2748 if args.output: 2749 outfile = open(args.output, "w") 2750 else: 2751 outfile = sys.stdout 2752 2753 if args.output_format == "smarts": 2754 if not mcs: 2755 outfile.write("No MCS found\n") 2756 else: 2757 if mcs.completed: 2758 status = "(complete search)" 2759 else: 2760 status = "(timed out)" 2761 outfile.write("%s %d atoms %d bonds %s\n" % (mcs.smarts, mcs.num_atoms, mcs.num_bonds, 2762 status)) 2763 2764 else: 2765 if mcs.smarts is None: 2766 # There is no MCS. Use something which can't match. 2767 pat = Chem.MolFromSmarts("[CN]") 2768 else: 2769 # Need to make a structure output 2770 pat = Chem.MolFromSmarts(mcs.smarts) 2771 for structure in structures: 2772 atom_indices = structure.GetSubstructMatch(pat) 2773 if not atom_indices: 2774 # The only time that a SMARTS shouldn't match an input 2775 # structure is if there's a threshold cutoff and this 2776 # structure didn't make it. 2777 assert args.threshold < 1, "No indices but should have matched everything!" 2778 continue 2779 bond_indices = _get_match_bond_indices(pat, structure, atom_indices) 2780 subgraph = Subgraph(atom_indices, bond_indices) 2781 if atom_class_tag: 2782 restore_isotopes(structure) 2783 2784 outfile.write(make_structure_format(args.output_format, mcs, structure, subgraph, args)) 2785 2786 if not args.output_all: 2787 break 2788 2789 if args.output: 2790 outfile.close() 2791 2792 if args.times or args.verbosity: 2793 print >> sys.stderr, msg_format % times 2794 2795 2796 if __name__ == "__main__": 2797 import argparse 2798 main(sys.argv[1:]) 2799