Breast cancer is a leading cause of death among women globally, with millions of cases diagnosed each year. Anemia is a common blood disorder affecting millions of people worldwide. Early and reliable diagnosis of both anemia and breast cancer can prevent complications and reduce the number of deaths caused by these diseases. Machine learning (ML) algorithms have the potential to contribute significantly to the early detection of both, thus saving lives. This study seeks to further investigate the relationship between the types of features extracted and the accuracy of a diagnosis by evaluating the performance of three widely used and reliable ML algorithms in the context of both breast cancer and anemia diagnosis: Decision Tree, Logistic Regression, and K-Nearest-Neighbors(KNN). By comparing the performance of these algorithms in diagnosing breast cancer and anemia, the results of this study demonstrate the potential of ML algorithms in the medical field and highlight the importance of considering the type and quality of predictors when evaluating the performance of these algorithms. These findings can have important implications for developing improved tools to diagnose and treat breast cancer and anemia. They may also suggest ideas for future studies and evaluate the potential of these algorithms in improving diagnosis and treatment outcomes.
Anemia is a prevalent and potentially serious health condition that affects individuals of all ages and has significant global health impacts [1]. The World Health Organization (WHO) estimates that 42% of children less than 5 years of age and 40% of pregnant women are anemic [1]. The prevalence of anemia is higher in developing countries and parts of the world, particularly in sub-Saharan Africa and South Asia, where it affects over 40% of their population [1]. The WHO defines anemia as a hemoglobin concentration below 120 g/L in adult men and 110 g/L in adult women [2]. Studies have shown that anemia is more common in women than in men, with a higher prevalence in pregnant women [2]. Anemia is also more common in older adults, with a higher prevalence in those over the age of 65 [2]. It can lead to fatigue, weakness, and impaired cognitive function, and has been linked to a variety of conditions, including cardiovascular disease, cancer, and infections [3]. Breast cancer is the most common cancer in women worldwide, and the incidence of it varies greatly by region. It is more prevalent in countries with higher income, and the majority of breast cancer occurs in women over the age of 50, while only 0.5% of all breast cancer affects men [4]. Accurate and timely diagnosis and treatment of both anemia and breast cancer are important for improving individual and population health outcomes.
There has been a growing body of research on the application of machine learning techniques for the diagnosis and classification of anemia and breast cancer. Previous studies have demonstrated the potential of machine learning algorithms to improve diagnostic accuracy and efficiency, as well as to identify patterns and trends that may be missed by traditional methods. A study by Yildiz et al. used four different artificial learning methods (Artificial Neural Networks, Support Vector Machines, Naïve Bayes, and Ensemble Decision Tree) as classification algorithms to diagnose 12 different types of anemia [5]. The model was trained and evaluated using a dataset of 1663 samples, comprising hemogram data and general information such as age, sex, chronic diseases, and symptoms, collected from patient files at a university hospital in Turkey. The highest accuracy of 85.6% was achieved using Bagged Decision Trees and the validities of the models were determined through measurements such as accuracy, classification error, area under the curve, precision, recall, and F-score [5]. In another study by Naji et al., five algorithms, including Support Vector Machine (SVM), Random Forest, Logistic Regression, Decision tree, and K-Nearest Neighbors (KNN), were applied to the Breast Cancer Wisconsin Diagnostic dataset and their performance was compared using metrics such as the confusion matrix, accuracy, and precision [6]. The results showed that SVM outperformed the other algorithms, achieving the highest accuracy of 97.2% [6]. These works demonstrate the potential of using machine learning techniques in the diagnosis of these diseases and highlight the need for further research in this area.
In this study, we first perform a comprehensive correlation analysis between physiological attributes and the diagnosis outcome independently for breast cancer and anemia. Next, we use different classification-based machine learning algorithms to develop a predictive model for the diagnosis outcome regarding each disease. Three algorithms are used for both the breast cancer and anemia datasets, including KNN, Decision Tree, and Logistic Regression. The performance of the models is evaluated using a variety of metrics, including accuracy, sensitivity, and specificity. The models are also compared using a confusion matrix, which displays the number of true positive, true negative, false positive, and false negative predictions made by the model. The algorithms are implemented using Scikit-learn, Numpy, Pandas, Matplotlib, Seaborn, and MLxtend in the Python programming language (version 3.9), with the Anaconda environment (version 2.3.1) serving as the platform for development and execution. There is a novelty in this study since anemia and breast cancer have not been compared to each other with the application of machine learning algorithms in order to understand the relationship between accuracy and types of features in a dataset. The results of this study can help in the medical diagnosis and understanding of the correlation between anemia and breast cancer.
The anemia dataset used in this study was obtained from the Faculty of Medicine, Tokat Gaziosmanpasa University, Turkey [7]. The data contains the complete blood count test results of 15,300 patients, of which 10,379 were female and 4921 were male, in the 5-year interval between 2013 and 2018. The dataset consists of 1019 (7%) patients with HGB anemia, 4182 (27%) patients with iron deficiency, 199 (1%) patients with B12 deficiency, 153 (1%) patients with folate deficiency, and 9747 (64%) patients who had no anemia. The data of pregnant women, children, and patients with cancer are excluded from the study (for more information on the features and their descriptions, refer to Table 1).
In order to classify breast cancer tumors into malignant or benign, we use data from the Breast Cancer Wisconsin (Diagnostic) Dataset [7]. For the breast cancer diagnosis task, the Breast Cancer Wisconsin Diagnostic dataset (BCWD) is used (for more information on the features and their descriptions, refer to Table 2). The samples are labeled as benign (212 samples) or malignant (357 samples). Both of the datasets were accessed through the Kaggle website [7]. One of the goals of this study is to determine the extent to which different types of features can influence the performance of the algorithms. The anemia dataset has information about the patient’s blood measurements such as HGB and B12 levels (Table 1), whereas the breast cancer dataset includes information regarding measurements of the tumor itself (Table 2). After the preprocessing steps were done by removing lowly correlated variables with the target variable in the Anemia dataset, the increase in accuracy, sensitivity, and specificity indicates that removing them helps improve the model. Breast cancer classification is binary such that class 0 corresponds to being cancer free while class 1 corresponds to having cancer. Anemia diagnosis is based on 5 classes such that there is a class of anemia, each for the deficiency of Hemoglobin (Class 1), Iron (Class 2), Folate (Class 3), B12 (Class 4), and a fifth class to represent the patients without any type of anemia (Class 0). For the anemia dataset only, predictor variables that have a correlation with the target variable between -0.04 and 0.04 are removed to improve the accuracy of the three models, as suggested in previous studies [8]. However, this isn’t done with breast cancer since removing those predictor variables showed little to no difference in the model. The anemia dataset has close to 15,000 rows while the breast cancer dataset has nearly 500. The breast cancer dataset also has slightly more columns, which may be a possible contributor to the higher accuracy of breast cancer diagnosis. Most likely, the types of information/features that were recorded in each dataset are the main reason for the differences in the accuracies between the algorithms applied to both datasets.
Parameter |
Description |
Unit |
B12 |
B12 |
ng/mL |
BA |
Basophils |
103/μL |
EO |
Eosinophil |
103/μL |
FERRITE |
Ferrite |
ng/mL |
FOLATE |
Folate |
ng/mL |
GENDER |
Female/Male |
0–1 |
HCT |
Hematocrit |
% |
HGB |
Hemoglobin |
gr/dL |
LY |
Lenfosit |
103/μL |
MCH |
Mean Corpuscular Hemoglobin |
pg |
MCHC |
Mean Corpuscular Hemoglobin Concentration |
gr/dL |
MCV |
Mean Corpuscular Volume |
fL |
MO |
Monositler |
103/μL |
MPV |
Mean Platelet Volume |
fL |
NE |
Neutrophils |
103/μL |
PCT |
Plateletcrit |
K/uL |
PDW |
Platelet Distribution Width |
fL |
PLT |
Platelets |
K/uL |
RBC |
Red Blood Cells |
Milyon/ |
RDW |
Red Cell Distribution Width |
% |
SD |
Serum Iron |
μg/dL |
SDTSD |
(SD/TSD) * 100 |
μg/dL |
TSD |
Total Serum Iron |
μg/dL |
WBC |
White Blood Cells |
103/mL |
Parameter |
Description |
id |
Unique ID |
diagnosis |
Target: M - Malignant B - Benign |
radius_mean |
Radius of Lobes |
texture_mean |
Mean of Surface Texture |
perimeter_mean |
Outer Perimeter of Lobes |
area_mean |
Mean Area of Lobes |
smoothness_mean |
Mean of Smoothness Levels |
compactness_mean |
Mean of Compactness |
concavity_mean |
Mean of Concavity |
concave points_mean |
Mean of Concave Points |
symmetry_mean |
Mean of Symmetry |
fractal_dimension_mean |
Mean of Fractal Dimension |
radius_se |
SE of Radius |
texture_se |
SE of Texture |
perimeter_se |
SE of Perimeter |
area_se |
SE of Area |
smoothness_se |
SE of Smoothness Levels |
compactness_se |
SE of Compactness |
concavity_se |
SE of Concavity |
concave points_se |
SE of Concave Points |
symmetry_se |
SE of Symmetry |
fractal_dimension_se |
SE of Fractal Dimension |
radius_worst |
Worst Radius |
texture_worst |
Worst Texture |
perimeter_worst |
Worst Perimeter |
area_worst |
Worst Area |
smoothness_worst |
Worst Smoothness |
compactness_worst |
Worst Compactness |
concavity_worst |
Worst Concavity |
concave points_worst |
Worst Concave Points |
symmetry_worst |
Worst Symmetry |
fractal_dimension_worst |
Worst Fractal Dimension |
Figure 1. Overview of a correlation table for variables in the Anemia Dataset after preprocessing
Figure 2. Overview of a correlation table for variables in the Breast Cancer Dataset
Figure 3. Overview of accuracy and comparison between the gini and entropy criterion as the parameter ‘max_depth’ is increased in the Decision Tree algorithm(Anemia Dataset)
Figure 4. Overview of accuracy and comparison between the gini and entropy criterion as the parameter ‘max_depth’ is increased in the Decision Tree algorithm(Breast Cancer Dataset)
Figure 5. Image of the Decision Tree after we applied the model to the Breast Cancer dataset
Figure 6. Image of the Decision Tree after we applied the model to the Breast Cancer dataset and pruned the decision tree
In this study, three algorithms (K-Nearest-Neighbors, Decision Tree, and Logistic Regression) are applied to both the anemia and breast cancer datasets. The KNN algorithm is used without any specific parameters, and so are the Decision Tree and Logistic Regression algorithms; however, for the Decision Tree algorithm that is applied to the anemia dataset, we prune the decision tree so that the image isn’t illegible with excess nodes and branches (Figure 6). In order to prune the tree, we test different criterion(gini and entropy) and max_depth values until we reduce the number of nodes in the decision tree and it is legible (Figure 3). For all models, we test the data with an 80/20 train-test split. For the code of the Decision Tree model, pruning is conducted for both anemia and breast cancer (since the decision tree had a more than ideal amount of nodes and branches) in order to be able to draw conclusions from the decision tree by looking at it and analyzing the image of the tree (Figures 4-5). Since no parameters are added to KNN or Logistic Regression, they yield a lower accuracy, especially for the anemia dataset; however, Decision Tree yields the highest accuracy for anemia diagnosis. No additional parameters are added because this study intends to determine the validity of the models without changing the extra features of those models. We study how the models perform on their own without the extra changes in order to test the efficiency of applying nonparametric machine learning algorithms to the fields of anemia and breast cancer diagnoses. It is important to note that the performance of each algorithm can vary depending on the specific dataset and the chosen parameters. Further research is needed to determine the optimal parameters for each algorithm and to compare the performance of these algorithms with other commonly used algorithms.
Figure 7. Bar graph representing accuracies, sensitivities, and specificities of the three machine learning algorithms applied to the each of both datasets
The results indicate that the performance of the algorithms varies depending on the specific dataset and the chosen algorithm. For the anemia dataset, the Decision Tree algorithm has the highest accuracy of 0.926, followed by Logistic Regression with 0.814 and KNN with 0.691. The Decision Tree algorithm also performed the best in terms of sensitivity with 0.913. This suggests that the features in the anemia dataset may be better suited for the Decision Tree algorithm and that this algorithm may be a good choice for anemia diagnosis, especially when using data with the same type of features. Additionally, we observe that the Decision Tree algorithm is less sensitive to the choice of parameters and performs well without any fine-tuning. For the breast cancer dataset, the KNN algorithm has the highest accuracy of 0.947, followed by Logistic Regression with 0.929 and Decision Tree with 0.903. The KNN algorithm also performed the best in terms of sensitivity with 0.931. This suggests that the features in the breast cancer dataset may be better suited for the KNN algorithm and that this algorithm may be a good choice for breast cancer diagnosis. In terms of specificity, the Logistic Regression algorithm performs well for both datasets. It is worth mentioning that this is a small sample, and it is important to consider other factors such as the size and quality of the dataset, the complexity of the model, and the specific characteristics of the target population. Overall, our study suggests that the Decision Tree algorithm performs the best on the anemia dataset, while the KNN algorithm performs the best on the breast cancer dataset. Furthermore, Logistic Regression performs well for both datasets with a good balance between sensitivity and specificity. These findings have many implications for the selection of algorithms in research studies involving anemia and breast cancer diagnosis.
In this study, the three machine learning-based classification algorithms selected for our classification analysis are widely used and considered reliable [10]: Decision Tree, Logistic Regression, and K-Nearest Neighbor (KNN). Decision Trees are known for their interpretability, as they can provide a clear visualization of the relationships between the predictors and the outcome. Logistic Regression, on the other hand, is a popular method for binary classification problems and provides interpretable coefficients that can help identify important predictors. Finally, KNN is a simple and effective algorithm that is especially useful when the relationship between the predictors and the outcome is non-linear. These three algorithms have been widely used in previous studies [9-10] and are thought of as reliable for many reasons. Decision Trees, Logistic Regression, and KNN have proven to be effective in solving a wide range of classification problems, making them popular choices for researchers and practitioners. Also, these algorithms are relatively simple to understand and implement, which makes them accessible to a wide range of users. In addition to that, they can handle various types of data, including continuous, categorical, and mixed data, making them suitable for a wide range of classification problems. Specifically, Decision Trees and Logistic Regression provide interpretable models that can provide insights into the relationships between the predictors and the outcome [9]. Finally, these algorithms have been widely tested and validated in a variety of contexts, which has demonstrated their robustness and reliability in practice [10].
Although all three of these algorithms are widely used and considered reliable for a variety of reasons, some of these models work better on different datasets than others, and there can be many interpretations and reasons as to why this is. The anemia dataset has close to 15,000 rows while the breast cancer dataset has nearly 500, which means the anemia dataset has significantly more data. To the best of our knowledge, we can conclude that the breast cancer algorithms may yield higher accuracies due to the types of features in the dataset, such as Worst Area or Worst Smoothness, and the number of features/columns in the dataset. The correlation between the measurements of a breast cancer tumor and patient diagnosis is most likely stronger than the correlation between measurements from a blood sample and the patient’s anemia diagnosis (Figures 1-2). To the best of our knowledge, we can also presume that the stronger correlation of variables could be because the types of information/features that were recorded in each dataset were different, and one of them suits a dataset better than the other, leading to better validity measurements.
There are many interpretations as to why these models performed better than others and why these performances vary by dataset, and while our study provides valuable insights into the performance of the Decision Tree, Logistic Regression, and KNN algorithms for classification, it is important to acknowledge the limitations of the study. Firstly, the patient dataset used in this study may be limited to a specific nationality because the nationality of the patients is not one of the features in either dataset. It would be valuable to expand the study to include patients from a more diverse range of backgrounds in future work. Secondly, the results of this study are based on a specific set of predictors and it would be interesting to explore the impact of adding additional predictors on the performance of the algorithms. Comparing the performance of different classification algorithms on larger, more diverse datasets could help identify the strengths and weaknesses of each algorithm, and lead to the development of improved algorithms in the future. Also, investigating the impact of adding additional predictors or incorporating different types of predictors on the performance of the algorithms could help improve their predictive power, in addition to combining the strengths of different algorithms into hybrid models (e.g. combining decision trees with neural networks) could lead to improved performance in specific contexts [11]. In order to learn more about these algorithms and their performance, future studies could employ a variety of methods, including large-scale simulations, controlled experiments, and real-world case studies. In addition, utilizing advanced evaluation metrics such as partial dependence plots, permutation importance, and visualizations could help shed light on the relationships between the predictors and the outcome [12].
In this study, we evaluated the performance of three popular classification algorithms (Decision Tree, Logistic Regression, and KNN) in the context of breast cancer diagnosis and anemia diagnosis, measured the accuracy, sensitivity, and specificity of each algorithm, and found that they performed well, with some variation in performance between the algorithms. This study represents a novel contribution in that it compares the performance of these algorithms in the context of both diagnoses and offers interpretations as to why the variations occur in measurements of the validity of the models. One of the key findings of this study is that the difference in accuracy between the algorithms could be due to the different types of features included in the dataset. This highlights the importance of considering the type and quality of the predictors when evaluating the performance of classification algorithms. There are numerous possibilities in future studies to improve the efficiency and usability of these models through various techniques. Our results suggest that these algorithms could be useful tools for assisting healthcare professionals in making informed decisions about patient diagnosis and treatment in the future.