TY - JOUR AU - Zeng, Hui AB - Abstract: The development of large-scale Chinese language models is flourishing, yet there is a lack of corresponding capability assessments. Therefore, we propose a test to measure the multitask accuracy of large Chinese language models. This test encompasses four major domains, including medicine, law, psychology, and education, with 15 subtasks in medicine and 8 subtasks in education. We found that the best-performing models in the zero-shot setting outperformed the worst-performing models by nearly 18.6 percentage points on average. Across the four major domains, the highest average zero-shot accuracy of all models is 0.512. In the subdomains, only the GPT-3.5-turbo model achieved a zero-shot accuracy of 0.693 in clinical medicine, which was the highest accuracy among all models across all subtasks. All models performed poorly in the legal domain, with the highest zero-shot accuracy reaching only 0.239. By comprehensively evaluating the breadth and depth of knowledge across multiple disciplines, this test can more accurately identify the shortcomings of the models. TI - Measuring Massive Multitask Chinese Understanding JF - Computing Research Repository DO - 10.48550/arxiv.2304.12986 DA - 2023-04-25 UR - https://www.deepdyve.com/lp/arxiv-cornell-university/measuring-massive-multitask-chinese-understanding-3I0XmTDU1w VL - 2023 IS - 2304 DP - DeepDyve ER -