TY - JOUR AU - Lin, Dahua AB - Abstract:The rise in popularity of ChatGPT and GPT-4 has significantly accelerated the development of large models, leading to the creation of numerous impressive large language models(LLMs) and multimodal large language models (MLLMs). These cutting-edge models owe their remarkable performance to high-quality data. However, the details of the training data used in leading paradigms are often kept confidential. This lack of transparency, coupled with the scarcity of open-source data, impedes further developments within the community. As a response, this paper presents "Wan Juan", a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources. The dataset incorporates text, image-text, and video modalities, with a total volume exceeding 2TB. It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale. All data can be accessed at this https URL. TI - WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models JF - Computing Research Repository DO - 10.48550/arxiv.2308.10755 DA - 2023-08-21 UR - https://www.deepdyve.com/lp/arxiv-cornell-university/wanjuan-a-comprehensive-multimodal-dataset-for-advancing-english-and-PyiX7zJJyV VL - 2023 IS - 2308 DP - DeepDyve ER -