Article information

2015 , Volume 20, ¹ 6, p.87-112

Shigarov A.O., Bychkov I.V., Paramonov V.V., Belykh V.N.

Table analysis and interpretation based on execution of CRL rules

Often, arbitrary tagged tables presented in spreadsheets, word documents, and web pages are a source of important information that needs to be loaded into a relational database. However, many of them have a complex structure that does not allow populating databases with their information directly. The paper is devoted to the issues of the rule-based data extraction and transformation from arbitrary tagged tables into structured (canonical) form that provides loading data into a database by standard ETL tools. We suggest a novel rule language called CRL for table analysis and interpretation. It enables developing simple declarative programs to recover table semantics. Our methodology for rule-based table analysis and interpretation is mainly oriented on the tasks of unstructured tabular data integration. We expect it to be useful when a large number of arbitrary tagged tables appertaining to a few types are necessary to transform into structured form. The methodology is implemented in our prototype of system for Excel spreadsheet unstructured tabular data extraction and transformation. Our methodology and tools can be used to develop software for populating relational databases with tabular information contained in spreadsheets, word documents, and web pages. The experimental evaluation presented in the paper shows the effectiveness of applying CRL rules for table analysis and interpretation.

[full text]
Keywords: unstructured tabular data integration, table analysis and interpretation, infromation extraction from tables, unstructured ETL, table understanding, spreadsheet data extraction

Author(s):
Shigarov Alexei Olegovich
PhD.
Position: Senior Research Scientist
Office: Institute for System Dynamics and Control Theory, Siberian Branch of RAS
Address: 664033, Russia, Irkutsk
Phone Office: (3952) 45-31-02
E-mail: shigarov@icc.ru

Bychkov Igor Vyacheslavovich
Dr. , Academician RAS, Professor
Position: Director
Office: Institute for System Dynamics and Control Theory of Siberian Branch of Russian Academy of Sciences
Address: 664033, Russia, Irkutsk, Lermontova st., 134
Phone Office: (3952) 45-30-61
E-mail: idstu@icc.ru
SPIN-code: 5816-7451

Paramonov Vyacheslav Vladimirovich
PhD.
Position: Junior Research Scientist
Office: Institute for System Dynamics and Control Theory Siberian Branch of RAS
Address: 664033, Russia, Irkutsk, Lermontova st., 134
Phone Office: (3952) 453073
E-mail: slv@icc.ru
SPIN-code: 2364-8270

Belykh V.N.
Dr. , Professor
Position: Leading research officer
Address: 630090, Russia, Novosibirsk, Lermontova st., 134
Phone Office: (3832) 333 887
E-mail: belykh@math.nsc.ru

References:
[1] Ferrucci, D., Lally, A. UIMA: an architectural approach to unstructured information processing in the corporate research environment. Natural Language Engineering. 2004; 10(3-4):327-348.

[2] Feldman, R., Sanger, J. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press; 2006:423.

[3] Inmon, W.H., Nesavich, A. Tapping into Unstructured Data: Integrating Unstructured Data and Textual Analytics into Business Intelligence. Prentice Hall PTR. 2007.

[4] Doan, A., Naughton, J.F., Ramakrishnan, R., Baid, A., Chai, X. Chen, F., Chen, T., Chu, E., DeRose, P., Gao, B., Gokhale, C., Huang, J., Shen, W., Vuong, B.-Q. Information Extraction Challenges in Managing Unstructured Data. ACM SIGMOD Record. 2009; 37(4):14-20.

[5] Unstructured Information Management Architecture (UIMA) Version 1.0, OASIS. 2009. Available at: http://docs.oasis-open.org/uima/v1.0/uima-v1.0.html

[6] Hurst, M. The Interpretation of Tables in Texts. PhD thesis. United Kingdom: University of Edinburgh, 2000.

[7] Hurst M. Layout and language: Challenges for table understanding on the web. Proceedings of the First International Workshop on Web Document Analysis, Seattle, WA, September 2001. USA, WA: Seatle; 2001:27-30.

[8] Embley, D., Hurst, M., Lopresti, D., Nagy, G. Table-processing paradigms: a research survey. International Journal on Document Analysis and Recognition 2006; 8(2):66-86.

[9] e Silva, A.C., Jorge, A.M., Torgo, L. Design of an end-to-end method to extract information from tables. International Journal on Document Analysis and RecognitionInt. 2006; 8(2):144-171.

[10] Shigarov, A.O. Recovering the logical structure of tables from unstructured texts based on logical inference. Computational Technologies. 2014; 19(1):87-99. (In Russ.)

[11] Shigarov, A. Table understanding using a rule engine. Expert Systems with Applications. 2015; 42(2):929-937.

[12] Drools Expert. Available at: http://www.drools.org

[13] Wang, X. Tabular Abstraction, Editing, and Formatting: PhD thesis. Waterloo, Ontario, Canada: University of Waterloo; 1996.

[14] Nagy G. Learning the characteristics of critical cells from web tables // Proc. of the 21st Int. Conf. on Pattern Recognition. Tsukuba: IEEE Comp. Soc., 2012. P. 1554-1557.

[15] Apache POI. Available at: https://poi.apache.org

[16] YAML. Available at: http://yaml.org

[17] SnakeYAML. Available at: http://www.snakeyaml.org

[18] JavaBeans Specification 1.01 Final Release. 1997. Available at: http://www.oracle.com/technetwork/java/javase/tech/spec-136004.html

[19] JSR 94: Java Rule Engine API. Available at: https://jcp.org/en/jsr/detail?id=94

[20] JESS. Available at: http://www.jessrules.com

[21] RuleML. Available at: http://ruleml.org

[22] OpenRules. Available at: http://openrules.com/ruleengine.htm

[23] Douglas, S., Hurst, M., Quinn, D. Using Natural Language Processing for Identifying and Interpreting Tables in Plain Text. Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval. Las Vegas, NV, US; 1995:535-546.

[24] Tijerino, Y., Embley, D., Lonsdale, D., Ding, Y., Nagy, G. Towards Ontology Generation from Tables. World Wide Web: Internet and Web Information Systems. 2005; 8(3):261-285.

[25] Embley, D., Tao, C., Liddle, S. Automating the Extraction of Data from HTML Tables with Unknown Structure. Data & Knowledge Engineering. 2005; 54(1):3-28.

[26] Wang, J., Wang, H., Wang, Z., Zhu, K.Q. Understanding Tables on the Web. Proceedings of the 31st International Conf. on Conceptual Modeling. Florence, Italy: Springer-Verlag; 2012:141-155.

[27] WordNet. Available at: http://wordnet.princeton.edu

[28] ProBase. Available at: http://research.microsoft.com/en-us/projects/probase

[29] Gatterbauer, W., Bohunsky, P., Herzog, M., Krupl, B., Pollak, B. Towards DomainIndependent Information Extraction from Web Tables. Proceedings of the 16th International Conf. on World Wide Web. New York, US; 2007:71-80.

[30] Pivk, A., Cimiano, P., Sure, Y. From Tables to Frames. Web Semantics: Science, Services and Agents on the World Wide Web. 2005; 3(2-3):132-146.

[31] Pivk, A. Thesis: Automatic Ontology Generation from Web Tabular Structures. AI Communications. 2006; 19(1):83-85.

[32] Pivk, A., Cimiano, P., Sure, Y., Gams, M, Rajkovic, V., Studer, R. Transforming Arbitrary Tables into Logical Form with TARTAR. Data & Knowledge Engineering. 2007; 60(3):567-595.

[33] Kim, Y.-S., Lee, K.-H. Extracting Logical Structures from HTML Tables. Computer Standards & Interfaces. 2008; 30(5):296-308.

[34] Embley, D., Seth, S., Nagy, G. Transforming Web Tables to a Relational Database. Proceedings of the 22nd International Conf. on Pattern Recognition. Washington, DC, USA: IEEE Comp. Soc.; 2014: 2781-2786.

[35] Nagy, G., Embley, D., Seth, S. End-to-End Conversion of HTML Tables for Populating a Relational Database. Proceedings of the 11th IAPR International Workshop on Document Analysis Systems. Tours - Loire Valley, France: IEEE Comp. Soc.; 2014: 222-226.

[36] Chen Z., Cafarella M. Automatic Spreadsheet Data Extraction. Third International Workshop on Semantic Search over the Web (SSW), 2013. Riva del Garda, Italy. ACM. 2013; 54(2): 72-79.

[37] Chen Z., Cafarella M., Chen J., Prevo D., Zhuang J. Senbazuru: A Prototype Spreadsheet Database Management System // Proceedings of the International Journal on Very Large Data Bases Endowment. 2013; 6(12):1202-1205.

[38] Chen, Z., Cafarella, M. Integrating Spreadsheet Data via Accurate and Low-effort Extraction. Proceedings of the 20th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining. New York, NY, USA: ACM; 2014: 1126-1135.

[39] Abraham, R., Erwig, M. UCheck: A spreadsheet type checker for end users. Journal of Visual Languages & Computing. 2007; 18(1):71-95.

[40] Chambers, C., Erwig, M. Automatic detection of dimension errors in spreadsheets. Journal of Visual Languages & Computing. 2009; 20(4):269-283.

[41] Cunha, J., Saraiva, J., Visser, J. From Spreadsheets to Relational Databases and Back. Proceedings of the ACM SIGPLAN Workshop on Partial Evaluation and Program Manipulation. Savannah, GA, USA: ACM; 2009:179-188.

[42] Hung, V., Benatallah, B., Saint-Paul, R. Spreadsheet-based Complex Data Transformation. Proceedings of the 20th ACM Int. Conf. on Information and Knowledge Management. Glasgow, Scotland, UK: ACM; 2011:1749-1754.

[43] Hung, V. Spreadsheet-Based Complex Data Transformation: PhD thesis. Sydney, Australia, School of Computer Science and Engineering, University of New South Wales, 2011.

Bibliography link:
Shigarov A.O., Bychkov I.V., Paramonov V.V., Belykh V.N. Table analysis and interpretation based on execution of CRL rules // Computational technologies. 2015. V. 20. ¹ 6. P. 87-112
Home| Scope| Editorial Board| Content| Search| Subscription| Rules| Contacts
ISSN 1560-7534
© 2024 FRC ICT