The classification of cancer patients into risk classes is a very active field of research, with direct clinical applications. We have recently compared several machine learning methods on the well known 70-genes signature dataset. In that study, genetic programming showed promising results, given that it outperformed all the other techniques. Nevertheless, the study was preliminary, mainly because the validation dataset was preprocessed and all its features binarized in order to use logical operators for the genetic programming functional nodes. If this choice allowed simple interpretation of the solutions from the biological viewpoint, on the other hand the binarization of data was limiting, since it amounts to a sizable loss of information. The goal of this paper is to overcome this limitation, using the 70-genes signature dataset with real-valued expression data. The results we present show that genetic programming using the number of incorrectly classified instances as fitness function is not able to outperform the other machine learning methods. However, when a weighted average between false positives and false negatives is used to calculate fitness values, genetic programming obtains performances that are comparable with the other methods in the minimization of incorrectly classified instances and outperforms all the other methods in the minimization of false negatives, which is one of the main goals in breast cancer clinical applications. Also in this case, the solutions returned by genetic programming are simple, easy to understand, and they use a rather limited subset of the available features.
Farinaccio, A., Giacobini, M., Mauri, G., Provero, P., Vanneschi, L. (2010). On the Use of Genetic Programming for the Prediction of Survival in Cancer. In Gecco 2010. Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation (pp.163-170). New York : ACM Press [10.1145/1830483.1830514].
On the Use of Genetic Programming for the Prediction of Survival in Cancer
FARINACCIO, ANTONELLA;MAURI, GIANCARLO;VANNESCHI, LEONARDO
2010
Abstract
The classification of cancer patients into risk classes is a very active field of research, with direct clinical applications. We have recently compared several machine learning methods on the well known 70-genes signature dataset. In that study, genetic programming showed promising results, given that it outperformed all the other techniques. Nevertheless, the study was preliminary, mainly because the validation dataset was preprocessed and all its features binarized in order to use logical operators for the genetic programming functional nodes. If this choice allowed simple interpretation of the solutions from the biological viewpoint, on the other hand the binarization of data was limiting, since it amounts to a sizable loss of information. The goal of this paper is to overcome this limitation, using the 70-genes signature dataset with real-valued expression data. The results we present show that genetic programming using the number of incorrectly classified instances as fitness function is not able to outperform the other machine learning methods. However, when a weighted average between false positives and false negatives is used to calculate fitness values, genetic programming obtains performances that are comparable with the other methods in the minimization of incorrectly classified instances and outperforms all the other methods in the minimization of false negatives, which is one of the main goals in breast cancer clinical applications. Also in this case, the solutions returned by genetic programming are simple, easy to understand, and they use a rather limited subset of the available features.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.