TY - JOUR AU - AB - Machine-generated text presents a potential threat not only to the public sphere, but also to the scientific enterprise, whereby genuine re- search is undermined by convincing, synthetic text. In this paper we examine the problem of detecting GPT-2-generated technical research text. We first consider the realistic scenario where the defender does not have full infor- mation about the adversary’s text generation pipeline, but is able to label small amounts of in-domain genuine and synthetic text in or- der to adapt to the target distribution. Even in the extreme scenario of adapting a physics- domain detector to a biomedical detector, we find that only a few hundred labels are suffi- cient for good performance. Finally, we show Figure 1: Machine-generated text enables the corrup- that paragraph-level detectors can be used to tion of technical knowledge (e.g., biomedical research). detect the tampering of full-length documents One of the above abstracts was generated by GPT-2. under a variety of threat models. More examples of generated text can be found in §A.7. 1 Introduction While there are currently no documented cases Recent advances in techniques for generating real- of published papers containing text from neural lan- istic synthetic content (i.e., deepfakes) pose a TI - Cross-Domain Detection of GPT-2-Generated Technical Text JF - Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies DO - 10.18653/v1/2022.naacl-main.88 DA - 2022-01-01 UR - https://www.deepdyve.com/lp/unpaywall/cross-domain-detection-of-gpt-2-generated-technical-text-4j3LXZiLZo DP - DeepDyve ER -