河南农业科学 ›› 2024, Vol. 53 ›› Issue (2): 152-161.DOI: 10.15933/j.cnki.1004-3268.2024.02.017

• 农业信息与工程·农产品加工 • 上一篇    下一篇

基于RoBERTa 多特征融合的棉花病虫害命名实体识别

李东亚1,白涛1,2,3,香慧敏4,戴硕1,王震鲁1,陈珍1   

  1. (1.新疆农业大学计算机与信息工程学院,新疆 乌鲁木齐 830052;2.智能农业教育部工程研究中心,新疆 乌鲁木齐 830052;3.新疆农业信息化工程技术研究中心,新疆 乌鲁木齐 830052;4.新疆科信职业技术学院,新疆 乌鲁木齐 830049)
  • 收稿日期:2023-06-15 出版日期:2024-02-15 发布日期:2024-03-20
  • 通讯作者: 白涛(1979-),男,甘肃榆中人,副教授,硕士,主要从事农业大数据、数据挖掘研究。E-mail:bt@xjau.edu.cn
  • 作者简介:李东亚(1997-),男,河南驻马店人,在读硕士研究生,研究方向:自然语言处理、知识图谱。E-mail:569774848@qq.com
  • 基金资助:
    科技部科技创新2030重大项目(2022ZD0115800);新疆维吾尔自治区重大科技专项(2022A02011-4);新疆维吾尔自治区高
    校基本科研业务费科研项目(XJEDU2022J009)

Recognition of Cotton Pests and Diseases Named Entities Based on RoBERTA Multi⁃feature Fusion

LI Dongya1,BAT Tao1,2,3,XIANG Huimin4,DAI Shuo1,WANG Zhenlu1,CHEN Zhen1   

  1. (1.College of Computer and Information Engineering,Xinjiang Agricultural University,Urumqi 830052,China;2.Intelligent Agriculture Engineering Research Center of the Ministry of Education,Urumqi 830052,China;3.Xinjiang Agricultural Informatization Engineering Technology Research Center,Urumqi 830052,China;4.Xinjiang Science and Technology College,Urumqi 830049,China)
  • Received:2023-06-15 Published:2024-02-15 Online:2024-03-20

摘要: 针对棉花病虫害文本语料数据匮乏且缺少中文命名实体识别语料库,棉花病虫害实体内容复杂、类型多样且分布不均等问题,构建了包含11种类别的棉花病虫害中文实体识别语料库CDIPNER,提出了一种基于RoBERTa多特征融合的命名实体识别模型。该模型采用掩码学习能力更强的RoBERTa预训练模型进行字符级嵌入向量转换,通过BiLSTM和IDCNN模型联合抽取特征向量,分别捕捉文本的时序和空间特征,使用多头自注意力机制将抽取的特征向量进行融合,最后利用CRF算法生成预测序列。结果表明,该模型对于棉花病虫害文本中命名实体的识别精确率为96.60%,召回率为95.76%,F1值为96.18%;在ResumeNER等公开数据集上也有较好的效果。表明该模型能有效地识别棉花病虫害命名实体且具有一定的泛化能力。

关键词: 棉花, 病虫害, RoBERTa模型, 命名实体识别, 多特征融合, 多头注意力机制

Abstract: Aiming at the scarcity of cotton pest and disease text corpus data and the lack of Chinese named entity recognition corpus,and the problems of complexity,diversity and uneven distribution of the content of cotton pest and disease entities,a Chinese entity recognition corpus CDIPNER containing 11 categories of cotton pests and diseases entities was constructed,and a named entity recognition model based on RoBERTa multi⁃feature fusion was proposed.The model adopted RoBERTa pre⁃training model with stronger mask learning ability for character⁃level embedding vector conversion,extracted feature vectors jointly by BiLSTM and IDCNN models to capture the temporal and spatial features of the text,respectively,fused the extracted feature vectors using a multi⁃head self⁃attention mechanism,and finally generated predicted sequences using the CRF algorithm.The results showed that the model had 96.60% recognition accuracy,95.76% recall,and 96.18% F1 value for named entities in cotton pest and disease text;it also had good results on public datasets such as ResumeNER.The results indicate that the model could effectively identify named entities of cotton pest and disease and has certain generalisation ability.

Key words: Cotton, Pests and diseases, RoBERTa model, Named entity recognition, Multi?feature fusion, Multi?head attention mechanism

中图分类号: