TY - JOUR AU - AB - LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding 1 2 2 2 2 3 Yang Xu , Yiheng Xu , Tengchao Lv , Lei Cui , Furu Wei , Guoxin Wang , 3 3 3 1 4 2 Yijuan Lu , Dinei Florencio , Cha Zhang , Wanxiang Che , Min Zhang , Lidong Zhou Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology 2 3 4 Microsoft Research Asia Microsoft Azure AI Soochow University fyxu,carg@ir.hit.edu.cn, fv-yixu,v-telv,lecu,fuwei,lidongzg@microsoft.com, 3 4 fguow,yijlu,dinei,chazhangg@microsoft.com minzhang@suda.edu.cn Abstract applications. Distinct from conventional informa- tion extraction tasks, the VrDU task relies on not Pre-training of text and layout has proved only textual information but also visual and lay- effective in a variety of visually-rich docu- out information that is vital for visually-rich docu- ment understanding tasks due to its effec- tive model architecture and the advantage ments. Different types of documents indicate that of large-scale unlabeled scanned/digital-born the text fields of interest located at different posi- documents. We propose LayoutLMv2 archi- tions within the document, which is often deter- tecture with new pre-training tasks to model mined by the style and format of each type as well the interaction among text, layout, and image TI - LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding JF - Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) DO - 10.18653/v1/2021.acl-long.201 DA - 2021-01-01 UR - https://www.deepdyve.com/lp/unpaywall/layoutlmv2-multi-modal-pre-training-for-visually-rich-document-l0qfevbLy0 DP - DeepDyve ER -