ADAPTIVE ENSEMBLING OF SEMI-SUPERVISED CLUSTERING SOLUTIONS

 

ABSTRACT

Conventional semi-supervised clustering approaches have several shortcomings, such as (1) not fully utilizing all usefulmust-link and cannot-link constraints, (2) not considering how to deal with high dimensional data with noise, and (3) not fully addressingthe need to use an adaptive process to further improve the performance of the algorithm. In this paper, we first propose the transitiveclosure based constraint propagation approach, which makes use of the transitive closure operator and the affinity propagation toaddress the first limitation. Then, the random subspace based semi-supervised clustering ensemble framework with a set of proposedconfidence factors is designed to address the second limitation and provide more stable, robust and accurate results. Next, the adaptivesemi-supervised clustering ensemble framework is proposed to address the third limitation, which adopts a newly designed adaptiveprocess to search for the optimal subspace set. Finally, we adopt a set of nonparametric tests to compare different semi-supervisedclustering ensemble approaches over multiple datasets. The experimental results on 20 real high dimensional cancer datasets withnoisy genes and 10 datasets from UCI datasets and KEEL datasets show that (1) The proposed approaches work well on most of thereal-world datasets. (2) It outperforms other state-of-the-art approaches on 12 out of 20 cancer datasets, and 8 out of 10 UCI machinelearning datasets.

EXISTING SYSTEM:

Semi-supervised clustering is one of the important researchdirections in the area of data mining, which is able to makeuse of prior knowledge, such as pairwise constraints or a smallamount of labeled data, to guide the search process and improvethe quality of clustering. A number of semi-supervisedclustering approaches have been previously proposed, whichcan be divided into five categories.The approaches belonging to the first category focus ondesigning new kinds of semi-supervised clustering algorithms,such as semi-supervised hierarchical clustering, semisupervisedkernel mean shift clustering , semi-supervisedmaximum margin clustering , semi-supervised linear discriminantclustering semi-supervised subspace clustering, semi-supervised matrix decomposition, semisupervisedinformation-maximization clustering, activesemi-supervised fuzzy clustering, semi-supervised kernelfuzzy c-means, semi-supervised clustering frameworkbased on discriminative random fields, semi-supervisedfuzzy clustering based on competitive agglomeration,semi-supervised clustering corresponding to spherical Kmeansand feature projection, constrained clustering basedon Minkowski weighted K-means , semi-supervised nonnegativematrix factorization based on constraint propagation, semi-supervised kernel mean shift clustering , semisupervisedclustering based on seeding , and so on. In general,most of the new semi-supervised clustering approachesare extensions of traditional clustering algorithms by takinginto account label information or pairwise constraints.

PROPOSED SYSTEM:

The contribution of the paper is threefold. First, we proposea random subspace based semi-supervised clustering ensembleframework to provide more stable, robust and accurate resultsbased on a set of confidence factors. Second, an adaptiveprocess is designed to search for the optimal subspace setand improve the performance of RSEMICE. Third, we adopta set of nonparametric tests to compare constrained clusteringapproaches over multiple datasets.In this paper, we focus on constrained clustering, whichbelongs to the class of semi-supervised clustering approaches.Constrained clustering integrates a set of must-link constraintsand cannot-link constraints into the clustering process. Themust-link constraint means that two data samples shouldbelong to the same cluster, while the cannot-link constraintmeans that two data samples cannot be assigned to the samecluster. Traditional constrained clustering approaches have twolimitations: (1) They do not consider how to make full use ofmust-link constraints and cannot-link constraints; (2) Somemethods do not take into account how to deal with highdimensional data with noise.

CONCLUSION

This paper has proposed an adaptive semi-supervised clusteringensemble framework (A-RSEMICE) for high dimensionaldata clustering. When compared with traditional semisupervisedclustering approaches, A-RSEMICE is characterizedwith the following three properties: (1) A newly proposedtransitive closure based constraint propagation approach isadopted to make use of the transitive closure operator andthe constraint propagation to fully explore how to use alluseful must-link and cannot-link constraints. (2) A-RSEMICEadopts the random subspace based semi-supervised clusteringensemble framework to integrate the clustering solutionsobtained by different transitive closure operators frommultiple datasets into a unified clustering solution. (3) Anewly designed adaptive process is adopted to search for theoptimal subspace set. We performed a thorough analysis ofthe properties of A-RSEMICE in the experiments, and drawconclusions as follows: (1) The transitive closure operator andthe confidence factor each plays an important role in attaininggood performance for A-RSEMICE. (2) The adaptive processis useful for A-RSEMICE to obtain better results. (3) ARSEMICEoutperforms most of the state-of-the-art approacheswhen dealing with high dimensional cancer datasets. In thefuture, we will adopt a suitable cluster validity index todetermine the number of clusters.One of the limitations of the proposed algorithm is that itdoes not consider the effectiveness of the pairwise constraintsand how to remove redundant constraints which do not contributeto the final result. We should consider these issues inour future work.

 

REFERENCES

[1] X. Zhu, “Semi-supervised learning literature survey”, Department ofComputer Sciences, University of Wisconsin-Madison, 2008.

[2] X. Zhu, A.B. Goldberg, “Introduction to semi-supervised learning”,Morgan & Claypool, 2009.

[3] K. Wagstaff, C. Cardie, S. Rogers, S. Schrdl, “Constrained K-meansClustering with Background Knowledge”, Proceedings of the EighteenthInternational Conference on Machine Learning, pp. 577-584, 2001.

[4] A. Biswas, D. Jacobs, “Active image clustering: Seeking constraints fromhumans to complement algorithms”, 2012 IEEE Conference on ComputerVision and Pattern Recognition, pp. 2152-2159, 2012.

[5] H. Liu, G. Yang, Z. Wu, D. Cai, “Constrained Concept Factorization forImage Representation”, IEEE Transactions on Cybernetics, vol. 44, no.7, pp. 1214-1224, 2014

[6] H. Lai, M. Visani, A. Boucher, J. Ogier, “A new interactive semisupervisedclustering model for large image database indexing”, PatternRecognition Letters, vol. 37, pp. 94-106, 2014.

[7] P. He, X. Xu, K. Hu, L. Chen, “Semi-supervised clustering via multi-levelrandom walk”, Pattern Recognition, vol. 47, no. 2, pp. 820-832, 2014.

[8] L.C. Jiao, F. Shang, F. Wang, Y. Liu, “Fast semi-supervised clusteringwith enhanced spectral embedding”, Pattern Recognition, vol. 45, no. 12,pp. 4358-4369, 2012.

[9] I.A. Maraziotis, “A semi-supervised fuzzy clustering algorithm applied togene expression data”, Pattern Recognition, vol. 45, no. 1, pp. 637-648,2012.

[10] Z. Yu, H.-S. Wong, J. You, Q. Yang, H. Liao, “Knowledge basedCluster Ensemble for Cancer Discovery from Biomolecular Data”, IEEETransactions on NanoBioScience, vol. 10, no. 2, pp. 76-85, 2011.

[11] L. Zheng, T. Li, “Semi-supervised Hierarchical Clustering”, 2011 IEEE11th International Conference on Data Mining (ICDM), pp. 982-991,2011.

[12] S. Anand, S. Mittal, O. Tuzel, P. Meer, “Semi-Supervised Kernel MeanShift Clustering”, IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 36, no. 6, pp. 1201-1215, 2014.