A comparative study of clustering and classification algorithms
Description
Clustering and Classification are two of the most common data mining tasks, used frequently for data categorization and analysis in both industry and academia. Clustering is the process of organizing unlabeled objects into groups of which members are similar in some way. Clustering is a kind of unsupervised learning algorithm. It does not use category labels when grouping objects. In Semi-Supervised clustering, some prior knowledge is available either in the form of labeled data or pair-wise constraints on some of the objects. Classification is a kind of supervised learning algorithm. It is a procedure to assign class labels. A classifier is constructed from the labeled training data using certain classification algorithm, it then will be used to predict the class label of the test data In this dissertation, the results of a comprehensive comparative study of three kinds of clustering algorithms including Co-Clustering, Consensus-based Clustering and Semi-supervised Clustering is presented. Through experiments using artificial datasets with different data substructures and UCI data sets, the performance of the three kinds of clustering algorithms was compared and analyzed. A method was proposed to combine a Co-Clustering algorithm and a Semi-supervised Clustering algorithm. A comprehensive comparative study was conducted on three kinds of classification algorithms including Logistic Regression Classifier, Support Vector Machine and Decision Tree. Experiments were carried out using different artificial datasets and UCI data sets to analyze and compare their classification performance. A method using controlled False Discovery Rate was proposed in Logistic Regression Classifier to select important features. A detailed proof was developed to show that controlling False Discovery Rate can be achieved by controlling the related p-value. Experiments were also conducted to compare the classification performance using the proposed feature selection algorithm Keywords. Classification, Clustering, Semi-supervised Clustering, Feature Selection