Introduction:
Automatic text summarization is widely regarded as the highly difficult problem, partially because of the lack of large text summarization data set. Due to the great challenge of constructing the large scale summaries for full text, we introduce a Large-scale Chinese Short Text Summarization dataset constructed from the Chinese microblogging website SinaWeibo. This corpus consists of over 2 million real Chinese short texts with short summaries given by the writer of each text. We also manually tagged the relevance of 10,666 short summaries with their corresponding short texts. Based on the corpus, we introduce recurrent neural network for the summary generation and achieve promising results, which not only shows the usefulness of the proposed corpus for short text summarization research, but also provides a baseline for further research on this topic.
Data Property:
The dataset consists of three parts shown as Table 1.
Part I contains the large scale (short text, summary) pairs.
Part II contains the 10,666 human labeled (short text, summary) pairs, the score ranges from 1 to 5 which indicates the relevance between the short text and the corresponding summary. ‘1’ denotes“ the least relevant ” and ‘5’ denotes “the most relevant”. These data is randomly sampled from Part I. Figure 4 illustrates examples of different scores.
Part III contains 1,106 pairs which are scored by 3 persons simultaneously. This part is independent from Part I and Part II. In this work, we use pairs scored by 3, 4 and 5 of this part as the test set for short text summarization generation task. Part II and Part III can also be used as training and testing set to train a model which can be used to select required part of Part I.
Download:
If you want to acquire the corpus. Please fill the application form and send to Qingcai Chen: qingcai.chen@hit.edu.cn or Baotian Hu:baotianchina@gmail.com [application form, Mainland China] [application form, Other]
How to cite LCSTS:
This websit accompanies our paper:
Qingcai Chen, Baotian Hu and Fangze Zhu, LCSTS: A Large Scale Chinese Short Text Summarization Dataset, Emnlp2015. [full text] [bib]
Copyright Notice:
1.Respect the privacy of personal information of the original source.
2.The original copyright of all the data of the Large Scale Chinese Short Text Summarization Dataset belongs to writers of the Weiboes, Intelligent Computing Research Center, Harbin Institute of Technology Shenzhen Graduate School collects, organizes, filters and purifies them. LCSTS is free to the public.
3.If you want to use the dataset for depth study, data providers (Intelligent Computing Research Center, Harbin Institute of Technology Shenzhen Graduate School) should be identified in your results.
4.The dataset is only for the specified applicant or study groups for research purposes. Without permission, it may not be used for any commercial purposes.
5.If the terms changed, the latest online version shall prevail.