Regression models with good fitting but no predictive ability are sometimes chance correlations and often show some pathological features such as multicollinearity, overfitting, and inclusion of noisy/spurious variables. This problem is well known and of the utmost importance. The present paper proposes some criteria that are to be fulfilled as conditions for model acceptability, the aim being to recognize linear regression models with pathology. These criteria have been thought of in order to face the following problems: model instability due to outliers and influential objects; predictor multicollinearity; redundancy in explanatory variables; overfitting due to chance factors. A multicriteria fitness function based on the maximization of the Q<sup>2</sup> statistics under a set of tests is proposed here. This new fitness function can also be used in model searching by variable selection approaches in order to obtain a final optimal population of models. Computations on the Selwood data set are reported to illustrate the use of this multicriteria fitness function in model searching. © 2003 Elsevier B.V. All rights reserved.
Todeschini, R., Consonni, V., Mauri, A., Pavan, M. (2004). Detecting "bad" regression models: multicriteria fitness functions in regression analysis. ANALYTICA CHIMICA ACTA, 515(1), 199-208 [10.1016/j.aca.2003.12.010].
Detecting "bad" regression models: multicriteria fitness functions in regression analysis
TODESCHINI, ROBERTO;CONSONNI, VIVIANA;
2004
Abstract
Regression models with good fitting but no predictive ability are sometimes chance correlations and often show some pathological features such as multicollinearity, overfitting, and inclusion of noisy/spurious variables. This problem is well known and of the utmost importance. The present paper proposes some criteria that are to be fulfilled as conditions for model acceptability, the aim being to recognize linear regression models with pathology. These criteria have been thought of in order to face the following problems: model instability due to outliers and influential objects; predictor multicollinearity; redundancy in explanatory variables; overfitting due to chance factors. A multicriteria fitness function based on the maximization of the Q2 statistics under a set of tests is proposed here. This new fitness function can also be used in model searching by variable selection approaches in order to obtain a final optimal population of models. Computations on the Selwood data set are reported to illustrate the use of this multicriteria fitness function in model searching. © 2003 Elsevier B.V. All rights reserved.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.