TanCorp is a widely used Chinese text classification corpus. This corpus contains 14,150 documents in total, which are organized into two 12 classes, denoted as TanCorp12. Since TanCorp12 is available without a specific corpus split, to facilitate experiment comparisons on this corpus, we used the ratio 2:1 to split this document collection into training and test document sets, which is available below. The original version of TanCorp12 is available at http://www.searchforum.org.cn/tansongbo/corpus.htm.
This data set was used in our recent DMKD paper available at http://link.springer.com/article/10.1007/s10618-014-0358-x