Probability estimation trees: Empirical comparison, algorithm extension and applications
Description
In the framework of statistical supervised learning, posterior probability estimation is of crucial interest. The major difficulty for the learning process is that the true class membership probabilities are never known for many real world problems. Among various learning algorithms, decision tree induction is one of the most popular and effective techniques. However, decision trees have been observed to be poor posterior probability estimators and this weakness more or less impedes its utilization in real-world contexts when accurate probability estimation is needed. Although a variety of probability estimation tree (PET) algorithms have been proposed to handle this problem, there still lacks an exhaustively unbiased empirical study and analysis of the efficacies of these algorithms with respect to the characteristics of a specific domain in terms of multiple evaluation metrics. A practical guide which can help practitioners and researchers choose the most appropriate PET algorithm in light of the datasets at hand is also missing. Moreover, even if the performances of some proposed algorithms are quite satisfactory, is it possible to develop new algorithms for better estimation? In this spirit, this dissertation systematically addresses the problem of better probability estimation and ranking through decision tree induction. In addition, the effectiveness of PETS on real-world data mining problems is also demonstrated through ozone level detection