Страница публикации

TabbyPDF: Web-Based System for PDF Table Extraction

Тип публикации: Статья в журнале

Тип материала: Текст

Авторы: Shigarov A., Altaev A., Mikhailov A., Paramonov V., Cherkashin E.

Журнал: Communications in Computer and Information Science

Язык публикации: english

Том: 920

Номера страниц: 257-269

Количество страниц: 13

Год публикации: 2018

Отчетный год: 2018

DOI: 10.1007/978-3-319-99972-2_20

Аннотация: PDF is one of the most widespread ways to represent non-editable documents. Many of PDF documents are machine-readable but remain untagged. They have no tags for identifying layout items such as paragraphs, columns, or tables. One of the important challenges with these documents is how to extract tabular data from them. The paper presents a novel web-based system for extracting tables located in untagged PDF documents with a complex layout, for recovering their cell structures, and for exporting them into a tagged form (e.g. in CSV or HTML format). The system uses a heuristic-based approach to table detection and structure recognition. It mainly relies on recovering a human reading order of text, including document paragraphs and table cells. A prototype of the system was evaluated, using the methodology and dataset of “ICDAR 2013 Table Competition”. The standard metric F-score is 93.64% for the structure recognition phase and 83.18% for the table extraction with automatic table detection. The results are comparable with the state-of-the-art academic solutions.

Индексируется WOS: Q5

Индексируется Scopus: Нет

Индексируется УБС: Нет

Индексируется РИНЦ: Да

Индексируется ВАК: Нет

Индексируется CORE: Нет