河南农业科学 ›› 2021, Vol. 50 ›› Issue (10): 172-180.DOI: 10.15933/j.cnki.1004-3268.2021.10.022

• 农业信息与工程·农产品加工 • 上一篇    

基于工作流的统计年鉴数据清洗模型构建

张辉1,2,魏东1,2,乔璐1,2,李丹丹1,2,张玉尧1,2,郑国清1,2,冯晓1,2   

  1. (1.河南省农业科学院农业经济与信息研究所,河南郑州450002;2.河南省智慧农业工程技术研究中心,河南郑州450002)
  • 收稿日期:2021-01-01 出版日期:2021-10-15 发布日期:2021-11-25
  • 通讯作者: 冯晓(1978-),女,河南郑州人,副研究员,硕士,主要从事农业数据分析及图像处理技术研究。E-mail:fengxiao@hnagri.org.cn
  • 作者简介:张辉(1975-),男,河南光山人,副研究员,主要从事农业信息系统研发、数据分析挖掘技术研究。E-mail:zhanghui@hnagri.org.cn
  • 基金资助:
    河南省科技攻关计划项目(212102110213)

Construction of Statistical Yearbook Data Cleaning Model Based on Workflow

ZHANG Hui1,2,WEI Dong1,2,QIAO Lu1,2,LI Danan1,2,ZHANG Yuyao1,2,ZHENG Guoqing1,2,FENG Xiao1,2   

  1. (1.Institute of Agricultural Economics and Information,Henan Academy of Agricultural Sciences,Zhengzhou 450002,China;2.Henan Engineering and Technology Research Center for Intelligent Agriculture,Zhengzhou 450002,China)
  • Received:2021-01-01 Published:2021-10-15 Online:2021-11-25

摘要: 为实现统计年鉴数据集成整合和综合快速查询,以2000—2018年《中国统计年鉴》及《河南统计年鉴》等全国31个省(市、区)统计年鉴为例,深入分析其数据特征后,采用Alteryx Designer 2019.2学习版,基于工作流技术,经过提取目录及文件、提取文件中的表单、提取表单中表的内容、数据清洗及规范、规范标识数据的6个维度、数据重组和数据输出共7个步骤构建了统计年鉴数据清洗模型。结果表明,在16 GB内存的笔记本电脑上,模型用时4~5 h即可将数据容量达21 GB、包含33万个文件、120万张表单的统计年鉴数据清洗并整合为1套包含6 000多万条指标数据序列的标准规范数据集。构建的数据清洗建模方法具有高效、可溯源的优势。

关键词: 工作流, 统计年鉴, 数据清洗, Alteryx, 质量控制

Abstract: In order to realize the data integration and comprehensive quick query of statistical yearbooks,the data characteristics of China Statistical Yearbook and statistical yearbooks of 31 provincial regions such as Henan Statistical Yearbook from 2000 to 2018 were analyzed in depth. Based on workflow technology,Alteryx Designer 2019.2(learning version)was used to construct the data cleaning model of statistical yearbooks by 7 steps:extracting directory and file,extracting forms from file,extracting the contents of tables in the form,data cleaning and specification,data standardization by six dimensions,data reorganization and data output.The results showed that on a laptop with 16 GB RAM,the model took4 to 5 hours to clean and integrate the yearbook data with 21 GB,including more than 330 thousand files and around 1. 2 million forms into a single standard dataset that contained more than 60 million indicator data.The data cleaning modeling method had the advantages of high efficiency and traceability.

Key words: Workflow, Statistical yearbook, Data cleaning, Alteryx, Quality control

中图分类号: