Text classification corpora with a fixed highly skewed ratio and varying class numbers
来源: 庞观松/
University of Technology Sydney, Australia
4011
0
0
2014-07-04

In many text classification tasks, there is  typically big challenges to perform classification on skewed text corpora, in which the number of positive text instances heavily outnumbers negative instances.

 

Most of traditional text classifiers like SVM, naive bayes, neural network cannot perform well under skewed class distribution, especially for highly skewed class distribution. Moreover, the influence of skewedness on classification performance will be intensified with the increase of class. To check this relation, we have used the Fudan text classification corpus to create 8 highly skewed corpora, with the skew ratio approximately 1:123. Each corpus has exactly the same skew ratio, i.e., 1:123, but with different class numbers. The number of small classes is nearly the same as large classes in those corpora.


The datasets were used in our recent DMKD paper available at http://link.springer.com/article/10.1007/s10618-014-0358-x


If you intend to conduct similar experiments, you can download the attached datasets below.

 

Note: The Fudan University text classification corpus is from the Chinese natural language processing group in the Department of Computer Information and Technology at Fudan University. It is originally available at http://www.nlp.org.cn/docs/download.php?doc_id=294.

 

 

附件

登录用户可以查看和发表评论, 请前往  登录 或  注册
SCHOLAT.com 学者网
免责声明 | 关于我们 | 联系我们
联系我们: