Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLDR-853 added info about GOST frame processing into docs #506

Merged
merged 2 commits into from
Nov 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion dedoc/api/web/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ <h4>Tables handling </h4>

<div class="parameters">
<h4>PDF handling</h4>
<details><summary>pdf_with_text_layer, fast_textual_layer_detection, language, pages, is_one_column_document, document_orientation, need_header_footer_analysis, need_binarization</summary>
<details><summary>pdf_with_text_layer, fast_textual_layer_detection, language, pages, is_one_column_document, document_orientation, need_header_footer_analysis, need_binarization, need_gost_frame_analysis</summary>
<br>
<p>
<label>
Expand Down
4 changes: 4 additions & 0 deletions dedoc/readers/pdf_reader/pdf_image_reader/pdf_image_reader.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

from numpy import ndarray

from dedoc.data_structures.unstructured_document import UnstructuredDocument
from dedoc.readers.pdf_reader.data_classes.line_with_location import LineWithLocation
from dedoc.readers.pdf_reader.data_classes.pdf_image_attachment import PdfImageAttachment
from dedoc.readers.pdf_reader.data_classes.tables.scantable import ScanTable
Expand Down Expand Up @@ -53,6 +54,9 @@ def __init__(self, *, config: Optional[dict] = None) -> None:
self.binarizer = AdaptiveBinarizer()
self.ocr = OCRLineExtractor(config=self.config)

def read(self, file_path: str, parameters: Optional[dict] = None) -> UnstructuredDocument:
return super().read(file_path, parameters)

def _process_one_page(self,
image: ndarray,
parameters: ParametersForParseDoc,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from dedocutils.data_structures import BBox
from numpy import ndarray

from dedoc.data_structures.unstructured_document import UnstructuredDocument
from dedoc.readers.pdf_reader.data_classes.line_with_location import LineWithLocation
from dedoc.readers.pdf_reader.data_classes.pdf_image_attachment import PdfImageAttachment
from dedoc.readers.pdf_reader.data_classes.tables.scantable import ScanTable
Expand Down Expand Up @@ -37,6 +38,9 @@ def can_read(self, file_path: Optional[str] = None, mime: Optional[str] = None,
from dedoc.utils.parameter_utils import get_param_pdf_with_txt_layer
return super().can_read(file_path=file_path, mime=mime, extension=extension) and get_param_pdf_with_txt_layer(parameters) == "true"

def read(self, file_path: str, parameters: Optional[dict] = None) -> UnstructuredDocument:
return super().read(file_path, parameters)

def _process_one_page(self,
image: ndarray,
parameters: ParametersForParseDoc,
Expand Down
Binary file modified docs/source/_static/code_examples/test_dir/example.docx
Binary file not shown.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
("py:class", "abc.ABC"),
("py:class", "pydantic.main.BaseModel"),
("py:class", "scipy.stats._multivariate.dirichlet_multinomial_gen.cov"),
("py:class", "scipy.stats._multivariate.random_table_gen.rvs"),
("py:class", "pandas.core.series.Series"),
("py:class", "numpy.ndarray"),
("py:class", "pandas.core.frame.DataFrame"),
Expand Down
66 changes: 66 additions & 0 deletions docs/source/parameters/gost_frame_handling.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
.. _gost_frame_handling:

GOST frame handling
====================

.. flat-table:: Parameters for GOST frame handling
:widths: 5 5 3 15 72
:header-rows: 1
:class: tight-table

* - Parameter
- Possible values
- Default value
- Where can be used
- Description

* - need_gost_frame_analysis
- True, False
- False
- * :meth:`dedoc.DedocManager.parse`
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfTabbyReader.read`, :meth:`dedoc.readers.PdfTxtlayerReader.read`, :meth:`dedoc.readers.PdfImageReader.read`
* :meth:`dedoc.readers.ReaderComposition.read`
- This option is used to enable GOST (Russian government standard "ГОСТ Р 21.1101") frame recognition for PDF documents or images.


The content of each page of some technical documents is placed in special GOST frames. An example of GOST frames is shown in the example below (:ref:`example_gost_frame`).
Such frames contain meta-information and are not part of the text content of the document. Based on this, we have implemented the functionality for ignoring GOST frames in documents, which works for:

* Copyable PDF documents (:class:`dedoc.readers.PdfTxtlayerReader` and :class:`dedoc.readers.PdfTabbyReader`);
* Non-copyable PDF documents and Images (:class:`dedoc.readers.PdfImageReader`).

If parameter ``need_gost_frame_analysis=True``, the GOST frame itself is ignored and only the contents inside the frame are extracted.

.. _example_gost_frame:

Examples of GOST frame
----------------------
For example, your send PDF-document with two pages :download:`PDF-document with two pages <../_static/gost_frame_data/document_with_gost_frame.pdf>`:

.. image:: ../_static/gost_frame_data/page_with_gost_frame_1.png
:width: 30%
.. image:: ../_static/gost_frame_data/page_with_gost_frame_2.png
:width: 30%

Parameter's usage
-----------------

.. code-block:: python

import requests

data = {
"pdf_with_text_layer": "auto_tabby",
"need_gost_frame_analysis": "true",
"return_format": "html"
}
with open(filename, "rb") as file:
files = {"file": (filename, file)}
r = requests.post("http://localhost:1231/upload", files=files, data=data)
result = r.content.decode("utf-8")

Request's result
----------------

.. image:: ../_static/gost_frame_data/result_gost_frame.png
:width: 50%
31 changes: 18 additions & 13 deletions docs/source/parameters/pdf_handling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ PDF and images handling
- rus, eng, rus+eng, fra, spa
- rus+eng
- * :meth:`dedoc.DedocManager.parse`
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfBaseReader.read`
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfImageReader.read`
* :meth:`dedoc.readers.ReaderComposition.read`
* :meth:`dedoc.structure_extractors.FintocStructureExtractor.extract`
- Language of the document without a textual layer. The following values are available:
Expand All @@ -77,7 +77,7 @@ PDF and images handling
- :, start:, :end, start:end
- :
- * :meth:`dedoc.DedocManager.parse`
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfBaseReader.read`, :meth:`dedoc.readers.PdfTabbyReader.read`
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfImageReader.read`, :meth:`dedoc.readers.PdfTxtlayerReader.read`, :meth:`dedoc.readers.PdfTabbyReader.read`
* :meth:`dedoc.readers.ReaderComposition.read`
- If you need to read a part of the PDF document, you can use page slice to define the reading range.
If the range is set like ``start_page:end_page``, document will be processed from ``start_page`` to ``end_page``
Expand All @@ -96,7 +96,7 @@ PDF and images handling
- true, false, auto
- auto
- * :meth:`dedoc.DedocManager.parse`
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfBaseReader.read`
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfImageReader.read`
* :meth:`dedoc.readers.ReaderComposition.read`
- This option is used to set the number of columns if the PDF document is without a textual layer in case it's known beforehand.
The following values are available:
Expand All @@ -111,7 +111,7 @@ PDF and images handling
- auto, no_change
- auto
- * :meth:`dedoc.DedocManager.parse`
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfBaseReader.read`
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfImageReader.read`
* :meth:`dedoc.readers.ReaderComposition.read`
- This option is used to control document orientation analysis for PDF documents without a textual layer.
The following values are available:
Expand All @@ -125,7 +125,7 @@ PDF and images handling
- True, False
- False
- * :meth:`dedoc.DedocManager.parse`
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfBaseReader.read`
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfImageReader.read`, :meth:`dedoc.readers.PdfTxtlayerReader.read`
* :meth:`dedoc.readers.ReaderComposition.read`
- This option is used to **remove** headers and footers of PDF documents from the output result.
If ``need_header_footer_analysis=False``, header and footer lines will present in the output as well as all other document lines.
Expand All @@ -134,7 +134,7 @@ PDF and images handling
- True, False
- False
- * :meth:`dedoc.DedocManager.parse`
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfBaseReader.read`
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfImageReader.read`
* :meth:`dedoc.readers.ReaderComposition.read`
- This option is used to clean background (binarize) for pages of PDF documents without a textual layer.
If the document's background is heterogeneous, this option may help to improve the result of document text recognition.
Expand All @@ -144,7 +144,7 @@ PDF and images handling
- True, False
- True
- * :meth:`dedoc.DedocManager.parse`
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfBaseReader.read`
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfTxtlayerReader.read`, :meth:`dedoc.readers.PdfImageReader.read`
* :meth:`dedoc.readers.ReaderComposition.read`
- This option is used to enable table recognition for PDF documents or images.
The table recognition method is used in :class:`dedoc.readers.PdfImageReader` and :class:`dedoc.readers.PdfTxtlayerReader`.
Expand All @@ -155,18 +155,17 @@ PDF and images handling
- True, False
- False
- * :meth:`dedoc.DedocManager.parse`
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfBaseReader.read`
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfTabbyReader.read`, :meth:`dedoc.readers.PdfTxtlayerReader.read`, :meth:`dedoc.readers.PdfImageReader.read`
* :meth:`dedoc.readers.ReaderComposition.read`
- This option is used to enable GOST (Russian government standard) frame recognition for PDF documents or images.
The GOST frame recognizer is used in :meth:`dedoc.readers.PdfBaseReader.read`. Its main function is to recognize and
ignore the GOST frame on the document. It allows :class:`dedoc.readers.PdfImageReader` and :class:`dedoc.readers.PdfTxtlayerReader`
to properly process the content of the document containing GOST frame.
It allows :class:`dedoc.readers.PdfImageReader`, :class:`dedoc.readers.PdfTxtlayerReader` and :class:`dedoc.readers.PdfTabbyReader`
to properly process the content of the document containing GOST frame, see :ref:`gost_frame_handling` for more details.

* - orient_analysis_cells
- True, False
- False
- * :meth:`dedoc.DedocManager.parse`
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfBaseReader.read`
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfTxtlayerReader.read`, :meth:`dedoc.readers.PdfImageReader.read`
* :meth:`dedoc.readers.ReaderComposition.read`
- This option is used for a table recognition for PDF documents or images.
It is ignored when ``need_pdf_table_analysis=False``.
Expand All @@ -177,11 +176,17 @@ PDF and images handling
- 90, 270
- 90
- * :meth:`dedoc.DedocManager.parse`
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfBaseReader.read`
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfTxtlayerReader.read`, :meth:`dedoc.readers.PdfImageReader.read`
* :meth:`dedoc.readers.ReaderComposition.read`
- This option is used for a table recognition for PDF documents or images.
It is ignored when ``need_pdf_table_analysis=False`` or ``orient_analysis_cells=False``.
The option is used to set orientation of cells in table headers:

* **270** -- cells are rotated 90 degrees clockwise;
* **90** -- cells are rotated 90 degrees counterclockwise (or 270 clockwise).


.. toctree::
:maxdepth: 1

gost_frame_handling
Loading