Страница публикации

PyTabby: A Docreader’s module for extracting text and tables from PDF with a text layer

Тип публикации: Статья в журнале

Тип материала: Текст

Авторы: Mikhailov A.A., Shigarov A., Kozlov I.S.

Журнал: CEUR Workshop Proceedings: 4th Scientific-Practical Workshop Information Technologies: Algorithms, Models, Systems (ITAMS 2021, Irkutsk, 14 September 2021)

Язык публикации: english

Серия книг: CEUR Workshop Proceedings

Том: 2984

Номера страниц: 120-126

Количество страниц: 7

Год публикации: 2021

Отчетный год: 2021

Аннотация: This paper presents a complete solution for extraction of textual information and tables from PDF with a text layer. The presented solution consist of two parts: PyTabby is a tool for extracting text and tables from PDF with a complex background and layout, and Python wrapper module for Docreader tool. The PyTabby tool extracts text and tables from the low level representation of the PDF format. It enables employment of the additional information excluded in scanned documents and provides improvement of quality and performance compared with Optical Character Recognition (OCR) methods. The presented solution is incorporated into Docreader tool to parse PDF files with a text layer and is used as a part of the TALISMAN technology for social analytics.

Индексируется WOS: Нет

Индексируется Scopus: Нет

Индексируется УБС: Нет

Индексируется РИНЦ: Да

Индексируется ВАК: Нет

Индексируется CORE: Нет