Skip to content

Commit 0acfb3d

Browse files
authored
update master (#433)
* TLDR-659 added references to nodes into HTML return format (#427) * TLDR-660 fixes in article type (#428) * TLDR-642 FinTOC benchmarks (#426) * TLDR-657 remove other_fields from LineMetadata and DocumentMetadata (#430) * Readme fixes (#431) * new version 2.2.1 (#432)
1 parent 16747b0 commit 0acfb3d

File tree

74 files changed

+5067
-4142
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

74 files changed

+5067
-4142
lines changed

.flake8

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,12 +16,15 @@ exclude =
1616
resources,
1717
venv,
1818
build,
19-
dedoc.egg-info
20-
docs/_build
19+
dedoc.egg-info,
20+
docs/_build,
21+
scripts/fintoc2022/metric.py
2122
2223
# ANN101 - type annotations for self
24+
# T201 - prints found
25+
# JS101 - Multi-line container not broken after opening character
2326
ignore =
2427
ANN101
2528
per-file-ignores =
2629
scripts/*:T201
27-
scripts/benchmark_pdf_performance*:JS101,T201
30+
scripts/benchmark_pdf_performance*:JS101

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ repos:
33
rev: 5.0.4
44
hooks:
55
- id: flake8
6-
exclude: \.github|.*__init__\.py|resources|docs|venv|build|dedoc\.egg-info
6+
exclude: \.github|.*__init__\.py|resources|docs|venv|build|dedoc\.egg-info|scripts/fintoc2022/metric.py
77
args:
88
- "--config=.flake8"
99
additional_dependencies: [

README.md

Lines changed: 29 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
# Dedoc
22

3+
[![License](http://img.shields.io/:license-apache-blue.svg)](http://www.apache.org/licenses/LICENSE-2.0.html)
34
[![Documentation Status](https://readthedocs.org/projects/dedoc/badge/?version=latest)](https://dedoc.readthedocs.io/en/latest/?badge=latest)
5+
[![GitHub release](https://img.shields.io/github/release/ispras/dedoc.svg)](https://github.com/ispras/dedoc/releases/)
6+
[![Demo dedoc-readme.hf.space](https://img.shields.io/website-up-down-green-red/https/huggingface.co/spaces/dedoc/README.svg)](https://dedoc-readme.hf.space)
7+
[![Docker Hub](https://img.shields.io/docker/pulls/dedocproject/dedoc.svg)](https://hub.docker.com/r/dedocproject/dedoc/ "Docker Pulls")
48

59
![Dedoc](https://github.com/ispras/dedoc/raw/master/dedoc_logo.png)
610

@@ -39,52 +43,53 @@ In 2022, the system won a grant to support the development of promising AI proje
3943
## Document format description
4044
The system processes different document formats. The main formats are listed below:
4145

42-
| Format group | Description |
43-
|-----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
44-
| Office formats | DOCX, XLSX, PPTX and formats that canbe converted to them. Handling of these for-mats is held by analysis of format inner rep-resentation and using specialized libraries ([python-docx](https://python-docx.readthedocs.io/en/latest/), [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)) |
45-
| HTML, EML, MHTML | HTML documents are parsed using tagsanalysis, HTML handler is used for han-dling documents of other formats in thisgroup |
46-
| TXT | Only raw textual content is analyzed |
47-
| Archives | Attachments of the archive are analyzed | |
48-
| PDF,document images | Copyable PDF documents (with a textual layer) can be handled using [pdfminer-six](https://pdfminersix.readthedocs.io/en/latest/) library or [tabby](https://github.com/sunveil/ispras_tbl_extr) software. Non-copyable PDF documents or imagesare handled using [Tesseract-OCR](https://github.com/tesseract-ocr/tesseract), machine learning methods (including neural network methods) and [image processing methods](https://opencv.org/) |
46+
| Format group | Description |
47+
|----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
48+
| Office formats | DOCX, XLSX, PPTX and formats that can be converted to them. Handling of these formats is held by analysis of format inner representation and using specialized libraries ([python-docx](https://python-docx.readthedocs.io/en/latest/), [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)) |
49+
| HTML, EML, MHTML | HTML documents are parsed using tags analysis, HTML handler is used for handling documents of other formats in this group |
50+
| TXT | Only raw textual content is analyzed |
51+
| Archives | Attachments of the archive are analyzed | |
52+
| PDF, document images | Copyable PDF documents (with a textual layer) can be handled using [pdfminer-six](https://pdfminersix.readthedocs.io/en/latest/) library or [tabby](https://github.com/sunveil/ispras_tbl_extr) software. Non-copyable PDF documents or images are handled using [Tesseract-OCR](https://github.com/tesseract-ocr/tesseract), machine learning methods (including neural network methods) and [image processing methods](https://opencv.org/) |
4953

5054
## Examples of processed scanned documents
5155
* Dedoc can only process scanned black and white documents, such as technical specifications, regulations, articles, etc.
52-
<img src="docs/source/_static/doc_examples.png" alt="Document examples" style="width:800px;"/>
53-
<!--![Document examples](docs/source/_static/doc_examples.png){:height="150px"}-->
56+
<img src="https://github.com/ispras/dedoc/raw/master/docs/source/_static/doc_examples.png" alt="Document examples" style="width:800px;"/>
57+
5458
* In particular, dedoc recognizes tabular information only from tables with explicit boundaries. Here are examples of documents that can be processed by an dedoc's image handler:
55-
<img src="docs/source/_static/example_table.jpg" alt="Table parsing example" style="width:600px;"/>
56-
<!--![Table Example](docs/source/_static/example_table.jpg)-->
59+
<img src="https://github.com/ispras/dedoc/raw/master/docs/source/_static/example_table.jpg" alt="Table parsing example" style="width:600px;"/>
60+
5761
* The system also automatically detects and corrects the orientation of scanned documents
5862

59-
## Example of structure extractor
60-
<img src="docs/source/_static/str_ext_example_law.png" alt="Law structure example"/>
61-
<img src="docs/source/_static/str_ext_example_tz.png" alt="Tz structure example"/>
63+
## Examples of structure extractors
64+
<img src="https://github.com/ispras/dedoc/raw/master/docs/source/_static/str_ext_example_law.png" alt="Law structure example"/>
65+
<img src="https://github.com/ispras/dedoc/raw/master/docs/source/_static/str_ext_example_tz.png" alt="Tz structure example"/>
6266

6367

6468
## Impact
6569
This project may be useful as a first step of automatic document analysis pipeline (e.g. before the NLP part).
6670
Dedoc is in demand for information analytic systems, information leak monitoring systems, as well as for natural language processing systems.
6771
The library is intended for application use by developers of systems for automatic analysis and structuring of electronic documents, including for further search in electronic documents.
6872

69-
# Online-Documentation
70-
Relevant documentation of the dedoc is available [here](https://dedoc.readthedocs.io/en/latest/)
73+
# Documentation
74+
Relevant documentation of dedoc is available [here](https://dedoc.readthedocs.io/en/latest/)
7175

7276
# Demo
73-
You can try dedoc's demo: https://dedoc-readme.hf.space.
7477

75-
We have a video to demonstrate how to use the system: https://www.youtube.com/watch?v=ZUnPYV8rd9A.
78+
* You can try [dedoc demo](https://dedoc-readme.hf.space)
79+
* You can watch [video about dedoc](https://www.youtube.com/watch?v=ZUnPYV8rd9A)
7680

77-
![Web_interface](docs/source/_static/web_interface.png)
81+
![](https://github.com/ispras/dedoc/raw/master/docs/source/_static/web_interface.png)
7882

79-
![dedoc_demo](docs/source/_static/dedoc_short.gif)
83+
![](https://github.com/ispras/dedoc/raw/master/docs/source/_static/dedoc_short.gif)
8084

81-
# Some our publications
85+
# Publications related to dedoc
8286

83-
* Article on [Habr](https://habr.com/ru/companies/isp_ras/articles/779390/), where we describe our system in detail
84-
* [Our article](https://aclanthology.org/2022.fnp-1.13.pdf) from the FINTOC 2022 competition. We are the winners :smiley: :trophy:!
87+
* Article [ISPRAS@FinTOC-2022 shared task: Two-stage TOC generation model](https://aclanthology.org/2022.fnp-1.13.pdf) for the [FinTOC 2022 Shared Task](https://wp.lancs.ac.uk/cfie/fintoc2022/). We are the winners :smiley: :trophy:!
88+
* Article on habr.com [Dedoc: как автоматически извлечь из текстового документа всё и даже немного больше](https://habr.com/ru/companies/isp_ras/articles/779390/) in Russian (2023)
89+
* Article [Dedoc: A Universal System for Extracting Content and Logical Structure From Textual Documents](https://ieeexplore.ieee.org/abstract/document/10508151/) in English (2023)
8590

8691
# Installation instructions
87-
****************************************
92+
8893
This project has REST Api and you can run it in Docker container.
8994
Also, dedoc can be installed as a library via `pip`.
9095
There are two ways to install and run dedoc as a web application or a library that are described below.

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.2
1+
2.2.1

dedoc/api/api_args.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
@dataclass
88
class QueryParameters:
99
# type of document structure parsing
10-
document_type: str = Form("", enum=["", "law", "tz", "diploma"], description="Document domain")
10+
document_type: str = Form("", enum=["", "law", "tz", "diploma", "article", "fintoc"], description="Document domain")
1111
structure_type: str = Form("tree", enum=["linear", "tree"], description="Output structure type")
1212
return_format: str = Form("json", enum=["json", "html", "plain_text", "tree", "collapsed_tree", "ujson", "pretty_json"],
1313
description="Response representation, most types (except json) are used for debug purposes only")
@@ -29,7 +29,7 @@ class QueryParameters:
2929
# pdf handling
3030
pdf_with_text_layer: str = Form("auto_tabby", enum=["true", "false", "auto", "auto_tabby", "tabby"],
3131
description="Extract text from a text layer of PDF or using OCR methods for image-like documents")
32-
language: str = Form("rus+eng", enum=["rus+eng", "rus", "eng"], description="Recognition language")
32+
language: str = Form("rus+eng", enum=["rus+eng", "rus", "eng", "fra", "spa"], description="Recognition language")
3333
pages: str = Form(":", description='Page numbers range for reading PDF or images, "left:right" means read pages from left to right')
3434
is_one_column_document: str = Form("auto", enum=["auto", "true", "false"],
3535
description='One or multiple column document, "auto" - predict number of page columns automatically')

dedoc/api/api_utils.py

Lines changed: 20 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
from dedoc.data_structures import LineMetadata
44
from dedoc.data_structures.concrete_annotations.bold_annotation import BoldAnnotation
55
from dedoc.data_structures.concrete_annotations.italic_annotation import ItalicAnnotation
6+
from dedoc.data_structures.concrete_annotations.reference_annotation import ReferenceAnnotation
67
from dedoc.data_structures.concrete_annotations.strike_annotation import StrikeAnnotation
78
from dedoc.data_structures.concrete_annotations.subscript_annotation import SubscriptAnnotation
89
from dedoc.data_structures.concrete_annotations.superscript_annotation import SuperscriptAnnotation
@@ -116,7 +117,7 @@ def json2html(text: str, paragraph: TreeNode, tables: Optional[List[Table]], tab
116117
if table2id is None:
117118
table2id = {table.metadata.uid: table_id for table_id, table in enumerate(tables)}
118119

119-
ptext = __annotations2html(paragraph, table2id)
120+
ptext = __annotations2html(paragraph, table2id, tabs=tabs)
120121

121122
if paragraph.metadata.hierarchy_level.line_type in [HierarchyLevel.header, HierarchyLevel.root]:
122123
ptext = f"<strong>{ptext.strip()}</strong>"
@@ -125,7 +126,10 @@ def json2html(text: str, paragraph: TreeNode, tables: Optional[List[Table]], tab
125126
else:
126127
ptext = ptext.strip()
127128

128-
text += f'<p> {"&nbsp;" * tabs} {ptext} <sub> id = {paragraph.node_id} ; type = {paragraph.metadata.hierarchy_level.line_type} </sub></p>'
129+
ptext = f'<p> {"&nbsp;" * tabs} {ptext} <sub> id = {paragraph.node_id} ; type = {paragraph.metadata.hierarchy_level.line_type} </sub></p>'
130+
if hasattr(paragraph.metadata, "uid"):
131+
ptext = f'<div id="{paragraph.metadata.uid}">{ptext}</div>'
132+
text += ptext
129133

130134
for subparagraph in paragraph.subparagraphs:
131135
text = json2html(text=text, paragraph=subparagraph, tables=None, tabs=tabs + 4, table2id=table2id)
@@ -157,14 +161,17 @@ def __value2tag(name: str, value: str) -> str:
157161
if name == UnderlinedAnnotation.name:
158162
return "u"
159163

164+
if name == ReferenceAnnotation.name:
165+
return "a"
166+
160167
if value.startswith("heading "):
161168
level = value[len("heading "):]
162169
return "h" + level if level.isdigit() and int(level) < 7 else "strong"
163170

164171
return value
165172

166173

167-
def __annotations2html(paragraph: TreeNode, table2id: Dict[str, int]) -> str:
174+
def __annotations2html(paragraph: TreeNode, table2id: Dict[str, int], tabs: int = 0) -> str:
168175
indexes = dict()
169176

170177
for annotation in paragraph.annotations:
@@ -177,7 +184,7 @@ def __annotations2html(paragraph: TreeNode, table2id: Dict[str, int]) -> str:
177184
SubscriptAnnotation.name,
178185
SuperscriptAnnotation.name,
179186
UnderlinedAnnotation.name]
180-
check_annotations = bool_annotations + ["table"]
187+
check_annotations = bool_annotations + ["table", "reference"]
181188
if name not in check_annotations and not value.startswith("heading "):
182189
continue
183190
elif name in bool_annotations and annotation.value == "False":
@@ -187,23 +194,27 @@ def __annotations2html(paragraph: TreeNode, table2id: Dict[str, int]) -> str:
187194
indexes.setdefault(annotation.start, "")
188195
indexes.setdefault(annotation.end, "")
189196
if name == "table":
190-
indexes[annotation.start] += f'<p> <sub> <a href="#{tag}"> table#{table2id[tag]} </a> </sub> </p>'
197+
indexes[annotation.end] += f' (<a href="#{tag}">table {table2id[tag]}</a>)'
198+
elif name == "reference":
199+
indexes[annotation.start] += f'<{tag} href="#{value}">'
200+
indexes[annotation.end] = f"</{tag}>" + indexes[annotation.end]
191201
else:
192-
indexes[annotation.start] += "<" + tag + ">"
193-
indexes[annotation.end] = "</" + tag + ">" + indexes[annotation.end]
202+
indexes[annotation.start] += f"<{tag}>"
203+
indexes[annotation.end] = f"</{tag}>" + indexes[annotation.end]
194204

195205
insert_tags = sorted([(index, tag) for index, tag in indexes.items()], reverse=True)
196206
text = paragraph.text
197207

198208
for index, tag in insert_tags:
199209
text = text[:index] + tag + text[index:]
200210

201-
return text.replace("\n", "<br>")
211+
return text.replace("\n", f'<br>{"&nbsp;" * tabs}')
202212

203213

204214
def table2html(table: Table, table2id: Dict[str, int]) -> str:
205215
uid = table.metadata.uid
206-
text = f"<h4> table {table2id[uid]}:</h4>"
216+
table_title = f" {table.metadata.title}" if table.metadata.title else ""
217+
text = f"<h4> table {table2id[uid]}:{table_title}</h4>"
207218
text += f'<table border="1" id={uid} style="border-collapse: collapse; width: 100%;">\n<tbody>\n'
208219
for row in table.cells:
209220
text += "<tr>\n"

dedoc/api/schema/document_metadata.py

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
1-
from typing import Optional
2-
31
from pydantic import BaseModel, Extra, Field
42

53

@@ -18,4 +16,3 @@ class Config:
1816
created_time: int = Field(description="Creation time of the document in the UnixTime format", example=1590579805)
1917
access_time: int = Field(description="File access time in the UnixTime format", example=1590579805)
2018
file_type: str = Field(description="Mime type of the file", example="application/vnd.oasis.opendocument.text")
21-
other_fields: Optional[dict] = Field(description="Other optional fields")

dedoc/api/schema/line_metadata.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,4 +13,3 @@ class Config:
1313
paragraph_type: str = Field(description="Type of the document line/paragraph (header, list_item, list) and etc.", example="raw_text")
1414
page_id: int = Field(description="Page number of the line/paragraph beginning", example=0)
1515
line_id: Optional[int] = Field(description="Line number", example=1)
16-
other_fields: Optional[dict] = Field(description="Some other fields")

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy