Research Resources

研究资源

The BQ Corpus: A Large-scale Domain-specific Chinese

发布人: 发布时间: 2018-09-04 13:32:47 阅读数:

Jing Chen, Qingcai Chen, Xin Liu,Haijun Yang, Daohe Lu, Buzhou Tang

 mcdh.chenjing@gmail.com, qingcai.chen@hit.edu.cn, hit.liuxin@gmail.com, navyyang@webank.com, leslielu@webank.com, tangbuzhoug@gmail.com

Introduction:

As the semantic matching task, sentence semantic equivalence identification (SSEI) is a fundamental task of natural language processing (NLP) in question answering (QA), automatic customer service and chat-bots. In customer service systems, two questions are defined as semantically equivalent if they convey the same intent or they could be answered by the same answer. We introduce the Bank Question (BQ) corpus, a large-scale domain-specific Chinese corpus for SSEI. The BQ corpus contains 120,000 question pairs from online bank custom service logs. It is split into three parts: 100,000 pairs for training, 10,000 pairs for validation, and 10,000 pairs for test. We present five SSEI benchmark performance on our corpus, including state-of-the-art algorithms. As the largest manually annotated public Chinese SSEI corpus in the bank domain, the BQ corpus is not only useful for Chinese question semantic matching research, but also a significant resource for cross-lingual and cross-domain SSEI research.

Download:

If you want to acquire the corpus. Please fill the application form and send it to Qingcai Chen: qingcai.chen@hit.edu.cn or Jing Chen: mcdh.chenjing@gmail.com [Application from]

How to cite the BQ corpus:

This website accompanies our paper:

Jing Chen, Qingcai Chen, Xin Liu, Haijun Yang, Daohe Lu, Buzhou Tang, The BQ Corpus: A Large-scale Domain-specific Chinese Corpus For Sentence Semantic Equivalence Identification EMNLP2018. [full text] [bib]

Copyright Notice:

  1. Respect the privacy of personal information of the original source.

  2. The original copyright of all the data of the bank question (BQ) corpus: A large-scale domain-specific Chinese corpus for sentence semantic equivalence identification belongs to writers. Intelligent Computing Research Center, Harbin Institute of Technology (Shenzhen) collects, organizes, filters and purifies them. The BQ corpus is free to the public for academic research.

  3. If you want to use the dataset for further research, data providers, Intelligent Computing Research Center, Harbin Institute of Technology(Shenzhen), should be identified in your results.

  4. The dataset is only for the specified applicant or study groups for research purposes. Without permission, it may not be used for any commercial purposes.

  5. If the terms changed, the latest online version shall prevail.

Return Top
© 2014 哈尔滨工业大学深圳研究生院·智能计算研究中心 All rights reserved.