|
1 | 1 | # Dedoc
|
2 | 2 |
|
| 3 | +[](http://www.apache.org/licenses/LICENSE-2.0.html) |
3 | 4 | [](https://dedoc.readthedocs.io/en/latest/?badge=latest)
|
| 5 | +[](https://github.com/ispras/dedoc/releases/) |
| 6 | +[](https://dedoc-readme.hf.space) |
| 7 | +[](https://hub.docker.com/r/dedocproject/dedoc/ "Docker Pulls") |
4 | 8 |
|
5 | 9 | 
|
6 | 10 |
|
@@ -39,52 +43,53 @@ In 2022, the system won a grant to support the development of promising AI proje
|
39 | 43 | ## Document format description
|
40 | 44 | The system processes different document formats. The main formats are listed below:
|
41 | 45 |
|
42 |
| -| Format group | Description | |
43 |
| -|-----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
44 |
| -| Office formats | DOCX, XLSX, PPTX and formats that canbe converted to them. Handling of these for-mats is held by analysis of format inner rep-resentation and using specialized libraries ([python-docx](https://python-docx.readthedocs.io/en/latest/), [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)) | |
45 |
| -| HTML, EML, MHTML | HTML documents are parsed using tagsanalysis, HTML handler is used for han-dling documents of other formats in thisgroup | |
46 |
| -| TXT | Only raw textual content is analyzed | |
47 |
| -| Archives | Attachments of the archive are analyzed | | |
48 |
| -| PDF,document images | Copyable PDF documents (with a textual layer) can be handled using [pdfminer-six](https://pdfminersix.readthedocs.io/en/latest/) library or [tabby](https://github.com/sunveil/ispras_tbl_extr) software. Non-copyable PDF documents or imagesare handled using [Tesseract-OCR](https://github.com/tesseract-ocr/tesseract), machine learning methods (including neural network methods) and [image processing methods](https://opencv.org/) | |
| 46 | +| Format group | Description | |
| 47 | +|----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
| 48 | +| Office formats | DOCX, XLSX, PPTX and formats that can be converted to them. Handling of these formats is held by analysis of format inner representation and using specialized libraries ([python-docx](https://python-docx.readthedocs.io/en/latest/), [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)) | |
| 49 | +| HTML, EML, MHTML | HTML documents are parsed using tags analysis, HTML handler is used for handling documents of other formats in this group | |
| 50 | +| TXT | Only raw textual content is analyzed | |
| 51 | +| Archives | Attachments of the archive are analyzed | | |
| 52 | +| PDF, document images | Copyable PDF documents (with a textual layer) can be handled using [pdfminer-six](https://pdfminersix.readthedocs.io/en/latest/) library or [tabby](https://github.com/sunveil/ispras_tbl_extr) software. Non-copyable PDF documents or images are handled using [Tesseract-OCR](https://github.com/tesseract-ocr/tesseract), machine learning methods (including neural network methods) and [image processing methods](https://opencv.org/) | |
49 | 53 |
|
50 | 54 | ## Examples of processed scanned documents
|
51 | 55 | * Dedoc can only process scanned black and white documents, such as technical specifications, regulations, articles, etc.
|
52 |
| -<img src="docs/source/_static/doc_examples.png" alt="Document examples" style="width:800px;"/> |
53 |
| -<!--{:height="150px"}--> |
| 56 | +<img src="https://github.com/ispras/dedoc/raw/master/docs/source/_static/doc_examples.png" alt="Document examples" style="width:800px;"/> |
| 57 | + |
54 | 58 | * In particular, dedoc recognizes tabular information only from tables with explicit boundaries. Here are examples of documents that can be processed by an dedoc's image handler:
|
55 |
| -<img src="docs/source/_static/example_table.jpg" alt="Table parsing example" style="width:600px;"/> |
56 |
| -<!----> |
| 59 | +<img src="https://github.com/ispras/dedoc/raw/master/docs/source/_static/example_table.jpg" alt="Table parsing example" style="width:600px;"/> |
| 60 | + |
57 | 61 | * The system also automatically detects and corrects the orientation of scanned documents
|
58 | 62 |
|
59 |
| -## Example of structure extractor |
60 |
| -<img src="docs/source/_static/str_ext_example_law.png" alt="Law structure example"/> |
61 |
| -<img src="docs/source/_static/str_ext_example_tz.png" alt="Tz structure example"/> |
| 63 | +## Examples of structure extractors |
| 64 | +<img src="https://github.com/ispras/dedoc/raw/master/docs/source/_static/str_ext_example_law.png" alt="Law structure example"/> |
| 65 | +<img src="https://github.com/ispras/dedoc/raw/master/docs/source/_static/str_ext_example_tz.png" alt="Tz structure example"/> |
62 | 66 |
|
63 | 67 |
|
64 | 68 | ## Impact
|
65 | 69 | This project may be useful as a first step of automatic document analysis pipeline (e.g. before the NLP part).
|
66 | 70 | Dedoc is in demand for information analytic systems, information leak monitoring systems, as well as for natural language processing systems.
|
67 | 71 | The library is intended for application use by developers of systems for automatic analysis and structuring of electronic documents, including for further search in electronic documents.
|
68 | 72 |
|
69 |
| -# Online-Documentation |
70 |
| -Relevant documentation of the dedoc is available [here](https://dedoc.readthedocs.io/en/latest/) |
| 73 | +# Documentation |
| 74 | +Relevant documentation of dedoc is available [here](https://dedoc.readthedocs.io/en/latest/) |
71 | 75 |
|
72 | 76 | # Demo
|
73 |
| -You can try dedoc's demo: https://dedoc-readme.hf.space. |
74 | 77 |
|
75 |
| -We have a video to demonstrate how to use the system: https://www.youtube.com/watch?v=ZUnPYV8rd9A. |
| 78 | +* You can try [dedoc demo](https://dedoc-readme.hf.space) |
| 79 | +* You can watch [video about dedoc](https://www.youtube.com/watch?v=ZUnPYV8rd9A) |
76 | 80 |
|
77 |
| - |
| 81 | + |
78 | 82 |
|
79 |
| - |
| 83 | + |
80 | 84 |
|
81 |
| -# Some our publications |
| 85 | +# Publications related to dedoc |
82 | 86 |
|
83 |
| -* Article on [Habr](https://habr.com/ru/companies/isp_ras/articles/779390/), where we describe our system in detail |
84 |
| -* [Our article](https://aclanthology.org/2022.fnp-1.13.pdf) from the FINTOC 2022 competition. We are the winners :smiley: :trophy:! |
| 87 | +* Article [ISPRAS@FinTOC-2022 shared task: Two-stage TOC generation model](https://aclanthology.org/2022.fnp-1.13.pdf) for the [FinTOC 2022 Shared Task](https://wp.lancs.ac.uk/cfie/fintoc2022/). We are the winners :smiley: :trophy:! |
| 88 | +* Article on habr.com [Dedoc: как автоматически извлечь из текстового документа всё и даже немного больше](https://habr.com/ru/companies/isp_ras/articles/779390/) in Russian (2023) |
| 89 | +* Article [Dedoc: A Universal System for Extracting Content and Logical Structure From Textual Documents](https://ieeexplore.ieee.org/abstract/document/10508151/) in English (2023) |
85 | 90 |
|
86 | 91 | # Installation instructions
|
87 |
| -**************************************** |
| 92 | + |
88 | 93 | This project has REST Api and you can run it in Docker container.
|
89 | 94 | Also, dedoc can be installed as a library via `pip`.
|
90 | 95 | There are two ways to install and run dedoc as a web application or a library that are described below.
|
|
0 commit comments