Skip to content

ihdia/sanskrit-ocr

Repository files navigation

sanskrit-ocr

Note: This branch contains code for IndicOCR-v2. For IndicOCR-v1, kindly visit the this branch.


This repository contains code for various OCR models for classical Sanskrit Document Images. For a quick understanding of how to get the IndicOCR and CNN-RNN up and running, kindly continue to read this Readme. For more detailed instructions, visit our Wiki page.

The IndicOCR model and CNN-RNN models are best run on a GPU.

Please cite our paper if you end up using it for your own research.

@InProceedings{Dwivedi_2020_CVPR_Workshops,
author = {Dwivedi, Agam and Saluja, Rohit and Kiran Sarvadevabhatla, Ravi},
title = {An OCR for Classical Indic Documents Containing Arbitrarily Long Words},
booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2020}
}

Results:

The following table shows the comparitive results for the IndicOCR-v2 model with other state of the art models.

Row Dataset Model Training Config CER (%) WER (%)
1 new IndicOCR-v2 C3:mix training + real finetune 3.86 13.86
2 new IndicOCR-v2 C1:mix training 4.77 16.84
3 new CNN-RNN C3:mix training + real finetune 3.77 14.38
4 new CNN-RNN C1:mix training 3.67 13.86
5 new Google-OCR -- 6.95 34.64
6 new Ind.senz -- 20.55 57.92
7 new Tesseract (Devanagiri) -- 13.23 52.75
8 new Tesseract (Sanskrit) -- 21.06 62.34

IndicOCR-v2:

Details:

The code is written in tensorflow framework.

Pre-Trained Models:

  • Download pre-trained C1 models from here

  • Download pre-trained C3 models from here

Setup:

In the model/attention-lstm directory, run the following commands:

create conda create -n indicOCR python=3.6.10
conda activate indicOCR
conda install pip
pip install -r requirements.txt

Installation:

To install the aocr (attention-ocr) library, from the model/attention-lstm directory, run:

python setup.py install

tfrecords creation:

Make sure to have/create a .txt file with every line of the file in the following format:

path/to/image<space>annotation

ex: /user/sanskrit-ocr/datasets/train/1.jpg I am the annotated text

aocr dataset /path/to/txt/file/ /path/to/data.tfrecords

Train:

To train the data.tfrecords file created as described above, run the following command.

CUDA_VISIBLE_DEVICES=0 aocr train /path/to/tfrecords/file --batch-size <batch-size> --max-width <max-width> --max-height <max-height> --max-prediction <max-predicted-label-length> --num-epoch <num-epoch>

Validate:

To validate many checkpoints, run

python ./model/evaluate/attention_predictions.py <initial_ckpt_no> <final_ckpt_step> <steps_per_checkpoint>

This will create a val_preds.txt file in the model/attention-lstm/logs folder.

Test

To test a single checkpoint, run the following command:

CUDA_VISIBLE_DEVICES=0 aocr test /path/to/test.tfrecords --batch-size <batch-size> --max-width <max-width> --max-height <max-height> --max-prediction <max-predicted-label-length> --model-dir ./modelss

Note: If you want to test multiple checkpoints which are evenly spaced (numbering wise), use the method described in the validation section.

Computing Error Rates:

To compute the CER and WER of the predictions, run the following command:

python ./model/evaluate/get_errorrates.py <predicted_file_name>

ex: python model/evaluate/get_errorrates.py val_preds.txt

The results of error rates will be written to a file output.json in the visualize directory.

CNN-RNN:

Details:

The code is written in tensorflow framework.

Pre-Trained Models:

To download the best CNN-RNN model, kindly visit this page.

Setup:

In the model/CNN-RNN directory, run the following commands:

create conda create -n crnn python=3.6.10
conda activate crnn
conda install pip
pip install -r requirements.txt

tfrecords creation:

Make sure to have/create a .txt file with every line of the file in the following format:

path/to/image<space>annotation

ex: /user/sanskrit-ocr/datasets/train/1.jpg I am the annotated text

python model/CRNN/create_tfrecords.py /path/to/.txt/file ./model/CRNN/data/tfReal/data.tfrecords

Train:

To train the data.tfrecords file created as described above, run the following command.

python model/CRNN/train.py <training tfrecords filename> <train_epochs> <path_to_previous_saved_model> <steps-per_checkpoint>

ex: python ./model/CRNN/train.py train_feature.tfrecords 20 model/CRNN/model/shadownet/shadownet_-40 200

Note: If you are training from scratch just set the <path_to_previous_saved_model> arguement to 0.

ex: python model/CRNN/train.py data.tfrecords 100 0 <steps-per_checkpoint>

Validate:

To validate many checkpoints, run

python ./model/evaluate/crnn_predictions.py <tfrecords_file_name> <initial_step> <final_step> <steps_per_checkpoint> <out_file>

This will create a out_file in the model/CRNN/logs folder.

Note: the tfrecords_file_name should be relative to the model/CRNN/data/tfReal/ directory.

Test

Same as above

Computing Error Rates:

To compute the CER and WER of the predictions, run the following command:

Validation:

python model/evaluate/get_errorrates_crnn.py <path_to_predicted_file>

Test:

python model/evaluate/get_errorrates_crnn.py <path_to_predicted_file>

ex: python model/evaluate/get_errorrates_crnn.py model/CRNN/logs/test_preds_final.txt

Creating Synthetic Data, Obtaining results for Tesseract and Google-OCR etc.

Visit our Wiki page.


Other-Analysis:

WA-ECR Plot:

To gain a better insight into performance, we compute the word-averaged erroneous character rate (WA-ECR). This is defined as follows:

WA-ECR: = E/N

Where:

  • E: number of erroneous characters across all words of length L
  • N : number of L length words in the test set.

Paper Results

Figure: Distribution of word-averaged erroneous character rate (WA-ECR) as a function of length, for different models. The lower WA-ECR the better. The test words histogram in terms of word lengths can also be seen in the plot (red dots, log scale).


Sample Results:

Paper Results

Figure: Qualitative results for different models. Errors relative to ground truth are highlighted in red. Blue highlighting indicates text missing from at least one of the OCRs. A larger amount of blue within a line for an OCR indicates better coverage relative to others OCRs. Smaller amount of red indicates absence of errors.

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy