Страница публикации
TabbyPDF: Web-Based System for PDF Table Extraction
Авторы: Shigarov A., Altaev A., Mikhailov A., Paramonov V., Cherkashin E.
Журнал: Communications in Computer and Information Science
Том: 920
Номер:
Год: 2018
Отчётный год: 2018
Издательство:
Местоположение издательства:
URL:
Проекты:
DOI: 10.1007/978-3-319-99972-2_20
Аннотация: PDF is one of the most widespread ways to represent non-editable documents. Many of PDF documents are machine-readable but remain untagged. They have no tags for identifying layout items such as paragraphs, columns, or tables. One of the important challenges with these documents is how to extract tabular data from them. The paper presents a novel web-based system for extracting tables located in untagged PDF documents with a complex layout, for recovering their cell structures, and for exporting them into a tagged form (e.g. in CSV or HTML format). The system uses a heuristic-based approach to table detection and structure recognition. It mainly relies on recovering a human reading order of text, including document paragraphs and table cells. A prototype of the system was evaluated, using the methodology and dataset of “ICDAR 2013 Table Competition”. The standard metric F-score is 93.64% for the structure recognition phase and 83.18% for the table extraction with automatic table detection. The results are comparable with the state-of-the-art academic solutions.
Индексируется WOS: Q5
Индексируется Scopus: Нет
Индексируется УБС: Нет
Индексируется РИНЦ: Да
Индексируется ВАК: Нет
Индексируется CORE: Нет
Публикация в печати: 0