Article information
2025 , Volume 30, ¹ 3, p.127-144
Shigarov A.O.
Extracting data from arbitrary tables presented in electronic documents of editable formats driven by user-defined rules
Currently, a vast amount of relational data is available in unstructured sources, particularly as arbitrary tables within electronic documents in various editable formats. Unlike databases, these tables are primarily designed for human comprehension rather than for direct interpretation by computer programs. To be utilized in applications requiring automatic data interpretation, these tables have to be converted into a structured representation first. This work addresses the complex task of automating the extraction of relational data from arbitrary tables within electronic documents. A novel method is proposed to tackle this problem, leveraging end-user programming to create rules for analyzing and interpreting tables that have been converted into spreadsheet formats (Excel/Sheets). Unlike the existing rule-based methods devoted to the stated task, this approach decouples the rules from the underlying representation models and processing algorithms. Additionally, the proposed method accommodates arbitrary table layouts, structured cells, and header hierarchies. The tools implementing this method offer several notable advantages. The developed model of a document table supports arbitrary layouts, structured cells, and header hierarchies, providing flexibility in handling diverse table formats. The problem-oriented language, CRL, simplifies syntax specifically designed for end-user programming. Unlike general-purpose rule languages, CRL conceals irrelevant details, allowing users to focus exclusively on the logic of table analysis and interpretation. CRL rules can effectively replace specialized algorithms typically implemented using imperative programming or supervised learning. A single set of CRL rules can process a class of tables that share similar layout, formatting, and content characteristics. The complexity of implementing these rules varies depending on the diversity of these shared properties. The tools for table analysis and interpretation are integrated into the TabbyXL software package, which specializes in extracting data from tables. TabbyXL supports workbooks in formats (Excel/Sheets). It is important to note that document tables in other editable formats (Word/Docs, HTML, etc.) can be converted to the supported formats using existing conversion tools. The effectiveness of TabbyXL is substantiated by experimental data, both qualitative and quantitative comparisons with existing analogues. Additionally, the software has been successfully implemented in various scientific and industrial projects, addressing practical challenges such as data integration, constructing domain-specific ontologies, and managing the template document exchange.
Keywords: table understanding, table analysis and interpretation, data extraction, unstructured data, document tables
doi: 10.25743/ICT.2025.30.3.010
Author(s): Shigarov Alexei Olegovich PhD. Position: Leading research officer Office: Institute for System Dynamics and Control Theory, Siberian Branch of RAS Address: 664033, Russia, Irkutsk
Phone Office: (3952) 45-31-07 E-mail: shigarov@icc.ru SPIN-code: 5159-9006 Bibliography link: Shigarov A.O. Extracting data from arbitrary tables presented in electronic documents of editable formats driven by user-defined rules // Computational technologies. 2025. V. 30. ¹ 3. P. 127-144
|