Страница публикации

Table Header Correction Algorithm Based on Heuristics for Improving Spreadsheet Data Extraction

Тип публикации: Статья в журнале

Тип материала: Текст

Авторы: Paramonov V., Shigarov A., Vetrova V.

Журнал: Communications in Computer and Information Science: 26th Intern. Conf. on Information and Software Technologies (ICIST 2020; Kaunas, Lithuania; 15-17 October 2020)

Язык публикации: english

Серия книг: Communications in Computer and Information Science

Том: 1283

Номера страниц: 147-158

Количество страниц: 12

Год публикации: 2020

Отчетный год: 2020

DOI: 10.1007/978-3-030-59506-7_13

Аннотация: A spreadsheet is one of the most commonly used forms of representation for datasets of similar type. Spreadsheets provide considerable flexibility for data structure organisation. As a result of this flexibility, tables with very complex data structures could be created. In turn, such complexity makes automatic table processing and data extraction a challenging task. Therefore, table preproccessing step is often required in the data extraction pipeline. This paper proposes a heuristic algorithm for the correction of a table header in a spreadsheet. The aim of the proposed algorithm is to transform a machine-readable structure of the table header into its visual representation. The algorithm achieves this aim by iterating through table header cells and merging some of them according to proposed heuristics. The transformed structure, in turn, allows to improve quality of spreadsheet understanding and data extraction further in the pipeline. The proposed algorithm was implemented in the TabbyXL toolset.

Индексируется WOS: Нет

Индексируется Scopus: Нет

Индексируется УБС: Нет

Индексируется РИНЦ: Да

Индексируется ВАК: Нет

Индексируется CORE: Нет