TY - JOUR
AU - Lin, Dahua
AB - Abstract:The rise in popularity of ChatGPT and GPT-4 has significantly accelerated the development of large models, leading to the creation of numerous impressive large language models(LLMs) and multimodal large language models (MLLMs). These cutting-edge models owe their remarkable performance to high-quality data. However, the details of the training data used in leading paradigms are often kept confidential. This lack of transparency, coupled with the scarcity of open-source data, impedes further developments within the community. As a response, this paper presents "Wan Juan", a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources. The dataset incorporates text, image-text, and video modalities, with a total volume exceeding 2TB. It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale. All data can be accessed at this https URL.
TI - WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models
JF - Computing Research Repository
DO - 10.48550/arxiv.2308.10755
DA - 2023-08-21
UR - https://www.deepdyve.com/lp/arxiv-cornell-university/wanjuan-a-comprehensive-multimodal-dataset-for-advancing-english-and-PyiX7zJJyV
VL - 2023
IS - 2308
DP - DeepDyve
ER -