Article information

2025 , Volume 30, ¹ 3, p.127-144

Shigarov A.O.

Extracting data from arbitrary tables presented in electronic documents of editable formats driven by user-defined rules

Currently, a vast amount of relational data is available in unstructured sources, particularly as arbitrary tables within electronic documents in various editable formats. Unlike databases, these tables are primarily designed for human comprehension rather than for direct interpretation by computer programs. To be utilized in applications requiring automatic data interpretation, these tables have to be converted into a structured representation first. This work addresses the complex task of automating the extraction of relational data from arbitrary tables within electronic documents. A novel method is proposed to tackle this problem, leveraging end-user programming to create rules for analyzing and interpreting tables that have been converted into spreadsheet formats (Excel/Sheets). Unlike the existing rule-based methods devoted to the stated task, this approach decouples the rules from the underlying representation models and processing algorithms. Additionally, the proposed method accommodates arbitrary table layouts, structured cells, and header hierarchies. The tools implementing this method offer several notable advantages. The developed model of a document table supports arbitrary layouts, structured cells, and header hierarchies, providing flexibility in handling diverse table formats. The problem-oriented language, CRL, simplifies syntax specifically designed for end-user programming. Unlike general-purpose rule languages, CRL conceals irrelevant details, allowing users to focus exclusively on the logic of table analysis and interpretation. CRL rules can effectively replace specialized algorithms typically implemented using imperative programming or supervised learning. A single set of CRL rules can process a class of tables that share similar layout, formatting, and content characteristics. The complexity of implementing these rules varies depending on the diversity of these shared properties. The tools for table analysis and interpretation are integrated into the TabbyXL software package, which specializes in extracting data from tables. TabbyXL supports workbooks in formats (Excel/Sheets). It is important to note that document tables in other editable formats (Word/Docs, HTML, etc.) can be converted to the supported formats using existing conversion tools. The effectiveness of TabbyXL is substantiated by experimental data, both qualitative and quantitative comparisons with existing analogues. Additionally, the software has been successfully implemented in various scientific and industrial projects, addressing practical challenges such as data integration, constructing domain-specific ontologies, and managing the template document exchange.


Keywords: table understanding, table analysis and interpretation, data extraction, unstructured data, document tables

doi: 10.25743/ICT.2025.30.3.010

Author(s):
Shigarov Alexei Olegovich
PhD.
Position: Leading research officer
Office: Institute for System Dynamics and Control Theory, Siberian Branch of RAS
Address: 664033, Russia, Irkutsk
Phone Office: (3952) 45-31-07
E-mail: shigarov@icc.ru
SPIN-code: 5159-9006


Bibliography link:
Shigarov A.O. Extracting data from arbitrary tables presented in electronic documents of editable formats driven by user-defined rules // Computational technologies. 2025. V. 30. ¹ 3. P. 127-144
Home| Scope| Editorial Board| Content| Search| Subscription| Rules| Contacts
ISSN 1560-7534
© 2025 FRC ICT