Machine Learning Classification Methods Performance Comparison in Liver Cancer Cohort
Abstract
The goal of this paper is to compare the performance of the statistical and machine learning classification methods in diagnosing the death of liver cancer patients based on demographic characteristics, risk factors, and medical interventions. For this purpose, five methods; include random tree, C4.5, random forest, support vector machine (SVM) and logistic regression, all of which are supervised methods, were selected. The data used in this research are the real data of 165 patients diagnosed with liver cancer in a hospital in Portugal. The aim variable is the patient's death during the trial period, and the aforementioned group was monitored for a year. There are twenty-six qualitative and twenty-three quantitative variables in this diverse dataset. In total, 10.22% of the dataset is missing data, and just eight patients have full information in every field (4.85%). Additionally, there is some class disparity (63 cases classified as "Dead" and 102 as "Alive"). With 73.33% accurate detection, the SVM approach was found to be the most effective approach. After that, the random forest method with 71.52% had a more correct identification ratio than the others. The treesC4.5 method had the lowest correct diagnosis with 58.18%. Although, based on the ROC Area index, the random forest method performed better (with an area under the curve = 0.789) than the SVM method (with an area under the curve = 0.711). In total, SVM and random forest methods worked with a large difference compared to others in diagnosing the death of patients with liver cancer.