| Article information  2024 ,  Volume 29, ¹ 6, p.125-146
Shigarov A.O. Table recognition in untagged PDF documents using PDF-specific featuresNowadays, PDF is one of the most popular formats for distributing print-oriented documents  in the electronic environment. PDF documents are often untagged, i.e. pages are represented only  by low-level instructions for rendering text and graphics and are not accompanied by annotations  of their structural components (headings, paragraphs, tables, etc.). Automatic recovering for such  annotations can ensure the accessibility of structural components. The latter is possible as a result of  solving a number of tasks, one of which is recognizing tables in untagged PDF documents: detecting  the boundaries of their rows, columns, and cells. This paper proposes a method for recognizing  tables in untagged PDF documents. Unlike existing analogues, it is originally proposed to solve the  stated task based on the use of PDF-specific features such as text output order, pen movement  positions, etc. This proposal allowed adapting some known approaches and methods to the declared  task, initially oriented towards raster images and unformatted text, including “word clustering”,  “rows first” detection, whitespace segmentation, and connected component analysis. The presented  performance evaluation results demonstrate the effectiveness of solutions implementing this method.  The presented results of the performance evaluation demonstrate the efficiency of the solutions  implemented based on the proposed method. Quantitative comparison with analogues indicates  their compliance with the current level of technology development in the area under consideration.  At the same time, qualitative comparison reveals the following advantages over analogues. The  implementation of the proposed table recognition method does not require preliminary parameter  adjustment and supervised learning. However, if ready-to-use neural network models are available,  they can replace rule-based table detection algorithms. At the same time, the quality of the final  results can be improved by applying filtering of candidate cases.
 Keywords: table recognition, table extraction, unstructured data, document tables, document page layout analysis
 
 doi: 10.25743/ICT.2024.29.6.008
 
 Author(s):Shigarov Alexei Olegovich
 PhD.
 Position: Leading research officer
 Office: Institute for System Dynamics and Control Theory, Siberian Branch of RAS
 Address: 664033, Russia, Irkutsk
 Phone Office: (3952) 45-31-07
 E-mail: shigarov@icc.ru
 SPIN-code: 5159-9006
 Bibliography link:
 Shigarov A.O. Table recognition in untagged PDF documents using PDF-specific features // Computational technologies. 2024. V. 29. ¹ 6. P. 125-146
 |