Skip to content

justinufoman819/thomas0809

Repository files navigation

MolScribe

This is the repository for MolScribe, an image-to-graph model that translates a molecular image to its chemical structure. Try our demo on HuggingFace!

MolScribe

If you use MolScribe in your research, please cite our paper.

@article{
    MolScribe,
    title = {{MolScribe}: Robust Molecular Structure Recognition with Image-to-Graph Generation},
    author = {Yujie Qian and Jiang Guo and Zhengkai Tu and Zhening Li and Connor W. Coley and Regina Barzilay},
    journal = {Journal of Chemical Information and Modeling},
    publisher = {American Chemical Society ({ACS})},
    doi = {10.1021/acs.jcim.2c01480},
    year = 2023,
}

Please check out our subsequent works on parsing chemical diagrams:

Quick Start

Installation

Option 1: Install MolScribe with pip

pip install MolScribe

Option 2: Run the following command to install the package and its dependencies

git clone git@github.com:thomas0809/MolScribe.git
cd MolScribe
python setup.py install

Example

Download the MolScribe checkpoint from HuggingFace Hub and predict molecular structures:

import torch
from molscribe import MolScribe
from huggingface_hub import hf_hub_download

ckpt_path = hf_hub_download('yujieq/MolScribe', 'swin_base_char_aux_1m.pth')

model = MolScribe(ckpt_path, device=torch.device('cpu'))
output = model.predict_image_file('assets/example.png', return_atoms_bonds=True, return_confidence=True)

The output is a dictionary, with the following format

{
    'smiles': 'Fc1ccc(-c2cc(-c3ccccc3)n(-c3ccccc3)c2)cc1',
    'molfile': '***', 
    'confidence': 0.9175,
    'atoms': [{'atom_symbol': '[Ph]', 'x': 0.5714, 'y': 0.9523, 'confidence': 0.9127}, ... ],
    'bonds': [{'bond_type': 'single', 'endpoint_atoms': [0, 1], 'confidence': 0.9999}, ... ]
}

Please refer to molscribe/interface.py and notebook/predict.ipynb for details and other available APIs.

For development or reproducing the experiments, please follow the instructions below.

Experiments

Requirements

Install the required packages

pip install -r requirements.txt

Data

For training or evaluation, please download the corresponding datasets to data/.

Training data:

Datasets Description
USPTO
Download
Downloaded from USPTO, Grant Red Book.
PubChem
Download
Molecules are downloaded from PubChem, and images are dynamically rendered during training.

Benchmarks:

Category Datasets Description
Synthetic
Download
Indigo
ChemDraw
Images are rendered by Indigo and ChemDraw.
Realistic
Download
CLEF
UOB
USPTO
Staker
ACS
CLEF, UOB, and USPTO are downloaded from https://github.com/Kohulan/OCSR_Review.
Staker is downloaded from https://drive.google.com/drive/folders/16OjPwQ7bQ486VhdX4DWpfYzRsTGgJkSu.
ACS is a new dataset collected by ourself.
Perturbed
Download
CLEF
UOB
USPTO
Staker
Downloaded from https://github.com/bayer-science-for-a-better-life/Img2Mol/

Model

Our model checkpoints can be downloaded from Dropbox or HuggingFace Hub.

Model architecture:

  • Encoder: Swin Transformer, Swin-B.
  • Decoder: Transformer, 6 layers, hidden_size=256, attn_heads=8.
  • Input size: 384x384

Download the model checkpoint to reproduce our experiments:

mkdir -p ckpts
wget -P ckpts https://huggingface.co/yujieq/MolScribe/resolve/main/swin_base_char_aux_1m680k.pth

Prediction

python predict.py --model_path ckpts/swin_base_char_aux_1m680k.pth --image_path assets/example.png

MolScribe prediction interface is in molscribe/interface.py. See python script predict.py or jupyter notebook notebook/predict.ipynb for example usage.

Evaluate MolScribe

bash scripts/eval_uspto_joint_chartok_1m680k.sh

The script uses one GPU and batch size of 64 by default. If more GPUs are available, update NUM_GPUS_PER_NODE and BATCH_SIZE for faster evaluation.

Train MolScribe

bash scripts/train_uspto_joint_chartok_1m680k.sh

The script uses four GPUs and batch size of 256 by default. It takes about one day to train the model with four A100 GPUs. During training, we use a modified code of Indigo (included in molscribe/indigo/).

Evaluation Script

We implement a standalone evaluation script evaluate.py. Example usage:

python evaluate.py \
    --gold_file data/real/acs.csv \
    --pred_file output/uspto/swin_base_char_aux_1m680k/prediction_acs.csv \
    --pred_field post_SMILES

The prediction should be saved in a csv file, with columns image_id for the index (must match the gold file), and SMILES for predicted SMILES. If prediction has a different column name, specify it with --pred_field.

The result contains three scores:

  • canon_smiles: our main metric, exact matching accuracy.
  • graph: graph exact matching accuracy, ignoring tetrahedral chirality.
  • chiral: exact matching accuracy on chiral molecules.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 7

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy