Text classification corpus TanCorp12 with a specific training/test split
来源: 庞观松/
University of Technology Sydney, Australia
3635
0
0
2014-07-04

TanCorp is a widely used Chinese text classification corpus. This corpus contains 14,150 documents in total, which are organized into two 12 classes, denoted as TanCorp12. Since TanCorp12 is available without a specific corpus split, to facilitate experiment comparisons on this corpus, we used the ratio 2:1 to split this document collection into training and test document sets, which is available below. The original version of TanCorp12 is available at http://www.searchforum.org.cn/tansongbo/corpus.htm.


This data set was used in our recent DMKD paper available at http://link.springer.com/article/10.1007/s10618-014-0358-x

附件

登录用户可以查看和发表评论, 请前往  登录 或  注册
SCHOLAT.com 学者网
免责声明 | 关于我们 | 联系我们
联系我们: