PhD Defense - Utility-based Predictive Analytics
In several predictive tasks the end-user attention is focused in certain regions of the domain of the target variable.
As opposed to standard predictive tasks where all target variable values are equally important, in these particular tasks the domain has a non-uniform importance for the end-user. In many real world domains, such as financial, meteorological or medical, we can find tasks that fit into this non-standard setting. In effect, for most practical applications we observe that there is important domain knowledge that must be accounted for when solving the corresponding predictive task.
In these tasks, the relevance of certain regions of the domain is associated to either high costs and/or severe consequences, or to important profits and/or benefits. Initially, the research community addressed these tasks through the development of the cost-sensitive learning theory which considers only the costs component. More recently, this theory evolved to the broader framework of utility-based mining. The utility-based learning setting allows the consideration of both costs and benefits that may derive from different domain information. Although more complex, the utility-based learning framework is also more intuitive from an end-user perspective and more thorough regarding the domain information representation.
The first efforts for including costs and benefits into the learning procedure were concentrated in tasks with a nominal target variable (classification tasks). Still, with time, it became clear that this learning paradigm was also applicable to regression tasks, where the target variable is continuous. In this thesis we focus on the utility-based learning problem. The youth of this broader approach, specially regarding regression tasks, results in the existence of several open issues that we tackled. In particular, we identified and addressed the following main challenges: i) development of a unifying framework for utility-based learning; ii) developmentof learning methods for optimising utility in regression tasks; and iii) proposal of new pre-processing methods for addressing the problem of learning from imbalanced domains.
The proposal of a unifying utility-based framework allows to better understand the characteristics of these tasks, while integrating both classification and regression problems. Moreover, using this framework we are also able to establish important connections between different predictive problems.
The second challenge is related with the lack of methods to address utility-based regression problems. When dealing with utility-based learning if the utility information is not incorporated into the learning procedure we are only able to obtain sub-optimal models. To solve this problem we propose and evaluate new methods that allow to maximise the utility in regression. We show that these methods are effective for different utility settings.
The third and final challenge is related with the particular sub-class of utility-based problems known as imbalanced domains. This is also a problem insufficiently studied in the context of regression tasks. We propose and evaluate several approaches to address this problem.
As a practical outcome of this work, we provide UBL, an R package for utility-based learning that integrates approaches for classification and regression tasks. The UBL package includes the proposals presented in this thesis, as well as many other approaches developed for utility-based classification, providing to the research community a tool for testing and comparing different alternative methods of addressing utility-based predictive tasks.