Xin Liu, Qingcai Chen, Chong Deng,
Huajun Zeng, Jing Chen, Dongfang Li,
Buzhou Tang
hit.liuxin@gmail.com, qingcai.chen@gmail.com, dengchong.d@alibaba-inc.comaaahchi@hotmail.com, mcdh.chenjing@gmail.com, crazyofapple@gmail.com, tangbuzhoug@gmail.com
Introduction:
Question matching is a fundamental task of QA, which is usually recognized as a semantic matching task, sometimes a paraphrase identification task. The goal of the task is to search questions that have similar intent as the input question from an existing database. We introduce a large-scale Chinese question matching corpus (named LCQMC). LCQMC is more general than paraphrase corpus as it focuses on intent matching rather than paraphrase. The corpus contains 260,068 question pairs with manual annotation and we split it into three parts, i.e., a training set containing 238,766 question pairs, a development set with 8,802 question pairs, and a test set with 12,500 question pairs. We test several well-known sentence matching methods on it. The experimental results not only demonstrate the good quality of LCQMC, but also provide solid baseline performance for further researches on this corpus.
Download:
If you want to acquire the corpus. Please fill the application form and send to Qingcai Chen: qingcai.chen@hit.edu.cn or Xin Liu:hit.liuxin@gmail.com[application form]
How to cite LCQMC:
This website accompanies our paper:
Xin Liu, Qingcai Chen, Chong Deng, Huajun Zeng, Jing Chen, Dongfang Li, Buzhou Tang, LCQMC: A Large-scale Chinese Question Matching Corpus,COLING2018.
Copyright Notice:
1.Respect the privacy of personal information of the original source.
2.The original copyright of all the data of the LCQMC: A Large-scale Chinese Question Matching Corpus belongs to writers on BaiduKnows, Intelligent Computing Research Center, Harbin Institute of Technology, Shenzhen collects, organizes, filters and purifies them. LCQMC is free to the public.
3.If you want to use the dataset for depth study, data providers (Intelligent Computing Research Center, Harbin Institute of Technology, Shenzhen) should be identified in your results.
4.The dataset is only for the specified applicant or study groups for research purposes. Without permission, it may not be used for any commercial purposes.
5.If the terms changed, the latest online version shall prevail.