In many text classification tasks, there is typically big challenges to perform classification on skewed text corpora, in which the number of positive text instances heavily outnumbers negative instances.
Most of traditional text classifiers like SVM, naive bayes, neural network cannot perform well under skewed class distribution, especially for highly skewed class distribution. Moreover, the influence of skewedness on classification performance will be intensified with the increase of class. To check this relation, we have used the Fudan text classification corpus to create 8 highly skewed corpora, with the skew ratio approximately 1:123. Each corpus has exactly the same skew ratio, i.e., 1:123, but with different class numbers. The number of small classes is nearly the same as large classes in those corpora.
The datasets were used in our recent DMKD paper available at http://link.springer.com/article/10.1007/s10618-014-0358-x
If you intend to conduct similar experiments, you can download the attached datasets below.
Note: The Fudan University text classification corpus is from the Chinese natural language processing group in the Department of Computer Information and Technology at Fudan University. It is originally available at http://www.nlp.org.cn/docs/download.php?doc_id=294.