| Article information  2025 ,  Volume 30, ¹ 3, p.127-144
Shigarov A.O. Extracting data from arbitrary tables presented in electronic documents of editable formats driven by user-defined rulesCurrently, a vast amount of relational data is available in unstructured sources, particularly  as arbitrary tables within electronic documents in various editable formats. Unlike databases,  these tables are primarily designed for human comprehension rather than for direct interpretation  by computer programs. To be utilized in applications requiring automatic data interpretation,  these tables have to be converted into a structured representation first. This work addresses the  complex task of automating the extraction of relational data from arbitrary tables within electronic  documents. A novel method is proposed to tackle this problem, leveraging end-user programming  to create rules for analyzing and interpreting tables that have been converted into spreadsheet  formats (Excel/Sheets). Unlike the existing rule-based methods devoted to the stated task, this  approach decouples the rules from the underlying representation models and processing algorithms.  Additionally, the proposed method accommodates arbitrary table layouts, structured cells, and  header hierarchies. The tools implementing this method offer several notable advantages. The  developed model of a document table supports arbitrary layouts, structured cells, and header  hierarchies, providing flexibility in handling diverse table formats. The problem-oriented language,  CRL, simplifies syntax specifically designed for end-user programming. Unlike general-purpose  rule languages, CRL conceals irrelevant details, allowing users to focus exclusively on the logic of  table analysis and interpretation. CRL rules can effectively replace specialized algorithms typically  implemented using imperative programming or supervised learning. A single set of CRL rules  can process a class of tables that share similar layout, formatting, and content characteristics.  The complexity of implementing these rules varies depending on the diversity of these shared  properties. The tools for table analysis and interpretation are integrated into the TabbyXL software  package, which specializes in extracting data from tables. TabbyXL supports workbooks in formats  (Excel/Sheets). It is important to note that document tables in other editable formats (Word/Docs,  HTML, etc.) can be converted to the supported formats using existing conversion tools. The  effectiveness of TabbyXL is substantiated by experimental data, both qualitative and quantitative  comparisons with existing analogues. Additionally, the software has been successfully implemented  in various scientific and industrial projects, addressing practical challenges such as data integration,  constructing domain-specific ontologies, and managing the template document exchange.
 Keywords: table understanding, table analysis and interpretation, data extraction, unstructured data, document tables
 
 doi: 10.25743/ICT.2025.30.3.010
 
 Author(s):Shigarov Alexei Olegovich
 PhD.
 Position: Leading research officer
 Office: Institute for System Dynamics and Control Theory, Siberian Branch of RAS
 Address: 664033, Russia, Irkutsk
 Phone Office: (3952) 45-31-07
 E-mail: shigarov@icc.ru
 SPIN-code: 5159-9006
 Bibliography link:
 Shigarov A.O. Extracting data from arbitrary tables presented in electronic documents of editable formats driven by user-defined rules // Computational technologies. 2025. V. 30. ¹ 3. P. 127-144
 |