Text classification corpora with a fixed class number and varying imbalance ratios
来源: 庞观松/
University of Technology Sydney, Australia
3745
0
0
2014-07-04

The DMOZ data is a large-scale dataset based on the ODP (Open Directory Project) web directory data, which was constructed and distributed by the first edition of the Large Scale Hierarchical Text classification (LSHTC) challenge in 2009. Two datasets, a large one and a small one, were used in this challenge. We selected two subsets (denoted by DMOZ10-S1 and DMOZ10-S2) from the large one to check the performance of text classifiers on corpora with a fixed class number and varying imbalance ratios. The number of classes in this dataset, with hierarchy, amounts to 12,294, and there are 10 classes in the first level of the hierarchy. We focus on non-hierarchical text classification.  


DMOZ10-S1 contains one large class and one small class with 6,921 training documents and 2,555 test documents in total, and its vocabulary size is 62,334. DMOZ10-S2 consists of two of the largest classes in DMOZ10, with 24,335 training documents and 9,281 test documents in total. There are 176,828 terms contained in this subset. 


DMOZ10-S1 and DMOZ10-S2 were used in our recent DMKD paper available at http://link.springer.com/article/10.1007/s10618-014-0358-x 


The whole DMOZ datasets are available at http://lshtc.iit.demokritos.gr/node/3


附件

登录用户可以查看和发表评论, 请前往  登录 或  注册
SCHOLAT.com 学者网
免责声明 | 关于我们 | 联系我们
联系我们: