Deep Science Innovation

Abstract

Breast cancer is a leading cause of death among women globally, with millions of cases diagnosed each year. Anemia is a common blood disorder affecting millions of people worldwide. Early and reliable diagnosis of both anemia and breast cancer can prevent complications and reduce the number of deaths caused by these diseases. Machine learning (ML) algorithms have the potential to contribute significantly to the early detection of both, thus saving lives. This study seeks to further investigate the relationship between the types of features extracted and the accuracy of a diagnosis by evaluating the performance of three widely used and reliable ML algorithms in the context of both breast cancer and anemia diagnosis: Decision Tree, Logistic Regression, and K-Nearest-Neighbors(KNN). By comparing the performance of these algorithms in diagnosing breast cancer and anemia, the results of this study demonstrate the potential of ML algorithms in the medical field and highlight the importance of considering the type and quality of predictors when evaluating the performance of these algorithms. These findings can have important implications for developing improved tools to diagnose and treat breast cancer and anemia. They may also suggest ideas for future studies and evaluate the potential of these algorithms in improving diagnosis and treatment outcomes.

Introduction

Anemia is a prevalent and potentially serious health condition that affects individuals of all ages and has significant global health impacts [1]. The World Health Organization (WHO) estimates that 42% of children less than 5 years of age and 40% of pregnant women are anemic [1]. The prevalence of anemia is higher in developing countries and parts of the world, particularly in sub-Saharan Africa and South Asia, where it affects over 40% of their population [1]. The WHO defines anemia as a hemoglobin concentration below 120 g/L in adult men and 110 g/L in adult women [2]. Studies have shown that anemia is more common in women than in men, with a higher prevalence in pregnant women [2]. Anemia is also more common in older adults, with a higher prevalence in those over the age of 65 [2]. It can lead to fatigue, weakness, and impaired cognitive function, and has been linked to a variety of conditions, including cardiovascular disease, cancer, and infections [3]. Breast cancer is the most common cancer in women worldwide, and the incidence of it varies greatly by region. It is more prevalent in countries with higher income, and the majority of breast cancer occurs in women over the age of 50, while only 0.5% of all breast cancer affects men [4]. Accurate and timely diagnosis and treatment of both anemia and breast cancer are important for improving individual and population health outcomes.

There has been a growing body of research on the application of machine learning techniques for the diagnosis and classification of anemia and breast cancer. Previous studies have demonstrated the potential of machine learning algorithms to improve diagnostic accuracy and efficiency, as well as to identify patterns and trends that may be missed by traditional methods. A study by Yildiz et al. used four different artificial learning methods (Artificial Neural Networks, Support Vector Machines, Naïve Bayes, and Ensemble Decision Tree) as classification algorithms to diagnose 12 different types of anemia [5]. The model was trained and evaluated using a dataset of 1663 samples, comprising hemogram data and general information such as age, sex, chronic diseases, and symptoms, collected from patient files at a university hospital in Turkey. The highest accuracy of 85.6% was achieved using Bagged Decision Trees and the validities of the models were determined through measurements such as accuracy, classification error, area under the curve, precision, recall, and F-score [5]. In another study by Naji et al., five algorithms, including Support Vector Machine (SVM), Random Forest, Logistic Regression, Decision tree, and K-Nearest Neighbors (KNN), were applied to the Breast Cancer Wisconsin Diagnostic dataset and their performance was compared using metrics such as the confusion matrix, accuracy, and precision [6]. The results showed that SVM outperformed the other algorithms, achieving the highest accuracy of 97.2% [6]. These works demonstrate the potential of using machine learning techniques in the diagnosis of these diseases and highlight the need for further research in this area.

In this study, we first perform a comprehensive correlation analysis between physiological attributes and the diagnosis outcome independently for breast cancer and anemia. Next, we use different classification-based machine learning algorithms to develop a predictive model for the diagnosis outcome regarding each disease. Three algorithms are used for both the breast cancer and anemia datasets, including KNN, Decision Tree, and Logistic Regression. The performance of the models is evaluated using a variety of metrics, including accuracy, sensitivity, and specificity. The models are also compared using a confusion matrix, which displays the number of true positive, true negative, false positive, and false negative predictions made by the model. The algorithms are implemented using Scikit-learn, Numpy, Pandas, Matplotlib, Seaborn, and MLxtend in the Python programming language (version 3.9), with the Anaconda environment (version 2.3.1) serving as the platform for development and execution. There is a novelty in this study since anemia and breast cancer have not been compared to each other with the application of machine learning algorithms in order to understand the relationship between accuracy and types of features in a dataset. The results of this study can help in the medical diagnosis and understanding of the correlation between anemia and breast cancer.

Results

The anemia dataset used in this study was obtained from the Faculty of Medicine, Tokat Gaziosmanpasa University, Turkey [7]. The data contains the complete blood count test results of 15,300 patients, of which 10,379 were female and 4921 were male, in the 5-year interval between 2013 and 2018. The dataset consists of 1019 (7%) patients with HGB anemia, 4182 (27%) patients with iron deficiency, 199 (1%) patients with B12 deficiency, 153 (1%) patients with folate deficiency, and 9747 (64%) patients who had no anemia. The data of pregnant women, children, and patients with cancer are excluded from the study (for more information on the features and their descriptions, refer to Table 1).

In order to classify breast cancer tumors into malignant or benign, we use data from the Breast Cancer Wisconsin (Diagnostic) Dataset [7]. For the breast cancer diagnosis task, the Breast Cancer Wisconsin Diagnostic dataset (BCWD) is used (for more information on the features and their descriptions, refer to Table 2). The samples are labeled as benign (212 samples) or malignant (357 samples). Both of the datasets were accessed through the Kaggle website [7]. One of the goals of this study is to determine the extent to which different types of features can influence the performance of the algorithms. The anemia dataset has information about the patient’s blood measurements such as HGB and B12 levels (Table 1), whereas the breast cancer dataset includes information regarding measurements of the tumor itself (Table 2). After the preprocessing steps were done by removing lowly correlated variables with the target variable in the Anemia dataset, the increase in accuracy, sensitivity, and specificity indicates that removing them helps improve the model. Breast cancer classification is binary such that class 0 corresponds to being cancer free while class 1 corresponds to having cancer. Anemia diagnosis is based on 5 classes such that there is a class of anemia, each for the deficiency of Hemoglobin (Class 1), Iron (Class 2), Folate (Class 3), B12 (Class 4), and a fifth class to represent the patients without any type of anemia (Class 0). For the anemia dataset only, predictor variables that have a correlation with the target variable between -0.04 and 0.04 are removed to improve the accuracy of the three models, as suggested in previous studies [8]. However, this isn’t done with breast cancer since removing those predictor variables showed little to no difference in the model. The anemia dataset has close to 15,000 rows while the breast cancer dataset has nearly 500. The breast cancer dataset also has slightly more columns, which may be a possible contributor to the higher accuracy of breast cancer diagnosis. Most likely, the types of information/features that were recorded in each dataset are the main reason for the differences in the accuracies between the algorithms applied to both datasets.

Table 1. Definitions of Parameters for the Anemia Dataset

Parameter	Description	Unit
B12	B12	ng/mL
BA	Basophils	103/μL
EO	Eosinophil	103/μL
FERRITE	Ferrite	ng/mL
FOLATE	Folate	ng/mL
GENDER	Female/Male	0–1
HCT	Hematocrit	%
HGB	Hemoglobin	gr/dL
LY	Lenfosit	103/μL
MCH	Mean Corpuscular Hemoglobin	pg
MCHC	Mean Corpuscular Hemoglobin Concentration	gr/dL
MCV	Mean Corpuscular Volume	fL
MO	Monositler	103/μL
MPV	Mean Platelet Volume	fL
NE	Neutrophils	103/μL
PCT	Plateletcrit	K/uL
PDW	Platelet Distribution Width	fL
PLT	Platelets	K/uL
RBC	Red Blood Cells	Milyon/
RDW	Red Cell Distribution Width	%
SD	Serum Iron	μg/dL
SDTSD	(SD/TSD) * 100	μg/dL
TSD	Total Serum Iron	μg/dL
WBC	White Blood Cells	103/mL

Table 2. Definitions of Parameters for the Breast Cancer Dataset

Parameter	Description
id	Unique ID
diagnosis	Target: M - Malignant B - Benign
radius_mean	Radius of Lobes
texture_mean	Mean of Surface Texture
perimeter_mean	Outer Perimeter of Lobes
area_mean	Mean Area of Lobes
smoothness_mean	Mean of Smoothness Levels
compactness_mean	Mean of Compactness
concavity_mean	Mean of Concavity
concave points_mean	Mean of Concave Points
symmetry_mean	Mean of Symmetry
fractal_dimension_mean	Mean of Fractal Dimension
radius_se	SE of Radius
texture_se	SE of Texture
perimeter_se	SE of Perimeter
area_se	SE of Area
smoothness_se	SE of Smoothness Levels
compactness_se	SE of Compactness
concavity_se	SE of Concavity
concave points_se	SE of Concave Points
symmetry_se	SE of Symmetry
fractal_dimension_se	SE of Fractal Dimension
radius_worst	Worst Radius
texture_worst	Worst Texture
perimeter_worst	Worst Perimeter
area_worst	Worst Area
smoothness_worst	Worst Smoothness
compactness_worst	Worst Compactness
concavity_worst	Worst Concavity
concave points_worst	Worst Concave Points
symmetry_worst	Worst Symmetry
fractal_dimension_worst	Worst Fractal Dimension

Figure 1. Overview of a correlation table for variables in the Anemia Dataset after preprocessing

Figure 2. Overview of a correlation table for variables in the Breast Cancer Dataset

Figure 3. Overview of accuracy and comparison between the gini and entropy criterion as the parameter ‘max_depth’ is increased in the Decision Tree algorithm(Anemia Dataset)

Figure 4. Overview of accuracy and comparison between the gini and entropy criterion as the parameter ‘max_depth’ is increased in the Decision Tree algorithm(Breast Cancer Dataset)

Figure 5. Image of the Decision Tree after we applied the model to the Breast Cancer dataset

Figure 6. Image of the Decision Tree after we applied the model to the Breast Cancer dataset and pruned the decision tree

In this study, three algorithms (K-Nearest-Neighbors, Decision Tree, and Logistic Regression) are applied to both the anemia and breast cancer datasets. The KNN algorithm is used without any specific parameters, and so are the Decision Tree and Logistic Regression algorithms; however, for the Decision Tree algorithm that is applied to the anemia dataset, we prune the decision tree so that the image isn’t illegible with excess nodes and branches (Figure 6). In order to prune the tree, we test different criterion(gini and entropy) and max_depth values until we reduce the number of nodes in the decision tree and it is legible (Figure 3). For all models, we test the data with an 80/20 train-test split. For the code of the Decision Tree model, pruning is conducted for both anemia and breast cancer (since the decision tree had a more than ideal amount of nodes and branches) in order to be able to draw conclusions from the decision tree by looking at it and analyzing the image of the tree (Figures 4-5). Since no parameters are added to KNN or Logistic Regression, they yield a lower accuracy, especially for the anemia dataset; however, Decision Tree yields the highest accuracy for anemia diagnosis. No additional parameters are added because this study intends to determine the validity of the models without changing the extra features of those models. We study how the models perform on their own without the extra changes in order to test the efficiency of applying nonparametric machine learning algorithms to the fields of anemia and breast cancer diagnoses. It is important to note that the performance of each algorithm can vary depending on the specific dataset and the chosen parameters. Further research is needed to determine the optimal parameters for each algorithm and to compare the performance of these algorithms with other commonly used algorithms.

Figure 7. Bar graph representing accuracies, sensitivities, and specificities of the three machine learning algorithms applied to the each of both datasets

The results indicate that the performance of the algorithms varies depending on the specific dataset and the chosen algorithm. For the anemia dataset, the Decision Tree algorithm has the highest accuracy of 0.926, followed by Logistic Regression with 0.814 and KNN with 0.691. The Decision Tree algorithm also performed the best in terms of sensitivity with 0.913. This suggests that the features in the anemia dataset may be better suited for the Decision Tree algorithm and that this algorithm may be a good choice for anemia diagnosis, especially when using data with the same type of features. Additionally, we observe that the Decision Tree algorithm is less sensitive to the choice of parameters and performs well without any fine-tuning. For the breast cancer dataset, the KNN algorithm has the highest accuracy of 0.947, followed by Logistic Regression with 0.929 and Decision Tree with 0.903. The KNN algorithm also performed the best in terms of sensitivity with 0.931. This suggests that the features in the breast cancer dataset may be better suited for the KNN algorithm and that this algorithm may be a good choice for breast cancer diagnosis. In terms of specificity, the Logistic Regression algorithm performs well for both datasets. It is worth mentioning that this is a small sample, and it is important to consider other factors such as the size and quality of the dataset, the complexity of the model, and the specific characteristics of the target population. Overall, our study suggests that the Decision Tree algorithm performs the best on the anemia dataset, while the KNN algorithm performs the best on the breast cancer dataset. Furthermore, Logistic Regression performs well for both datasets with a good balance between sensitivity and specificity. These findings have many implications for the selection of algorithms in research studies involving anemia and breast cancer diagnosis.

Discussion

In this study, the three machine learning-based classification algorithms selected for our classification analysis are widely used and considered reliable [10]: Decision Tree, Logistic Regression, and K-Nearest Neighbor (KNN). Decision Trees are known for their interpretability, as they can provide a clear visualization of the relationships between the predictors and the outcome. Logistic Regression, on the other hand, is a popular method for binary classification problems and provides interpretable coefficients that can help identify important predictors. Finally, KNN is a simple and effective algorithm that is especially useful when the relationship between the predictors and the outcome is non-linear. These three algorithms have been widely used in previous studies [9-10] and are thought of as reliable for many reasons. Decision Trees, Logistic Regression, and KNN have proven to be effective in solving a wide range of classification problems, making them popular choices for researchers and practitioners. Also, these algorithms are relatively simple to understand and implement, which makes them accessible to a wide range of users. In addition to that, they can handle various types of data, including continuous, categorical, and mixed data, making them suitable for a wide range of classification problems. Specifically, Decision Trees and Logistic Regression provide interpretable models that can provide insights into the relationships between the predictors and the outcome [9]. Finally, these algorithms have been widely tested and validated in a variety of contexts, which has demonstrated their robustness and reliability in practice [10].

Although all three of these algorithms are widely used and considered reliable for a variety of reasons, some of these models work better on different datasets than others, and there can be many interpretations and reasons as to why this is. The anemia dataset has close to 15,000 rows while the breast cancer dataset has nearly 500, which means the anemia dataset has significantly more data. To the best of our knowledge, we can conclude that the breast cancer algorithms may yield higher accuracies due to the types of features in the dataset, such as Worst Area or Worst Smoothness, and the number of features/columns in the dataset. The correlation between the measurements of a breast cancer tumor and patient diagnosis is most likely stronger than the correlation between measurements from a blood sample and the patient’s anemia diagnosis (Figures 1-2). To the best of our knowledge, we can also presume that the stronger correlation of variables could be because the types of information/features that were recorded in each dataset were different, and one of them suits a dataset better than the other, leading to better validity measurements.

There are many interpretations as to why these models performed better than others and why these performances vary by dataset, and while our study provides valuable insights into the performance of the Decision Tree, Logistic Regression, and KNN algorithms for classification, it is important to acknowledge the limitations of the study. Firstly, the patient dataset used in this study may be limited to a specific nationality because the nationality of the patients is not one of the features in either dataset. It would be valuable to expand the study to include patients from a more diverse range of backgrounds in future work. Secondly, the results of this study are based on a specific set of predictors and it would be interesting to explore the impact of adding additional predictors on the performance of the algorithms. Comparing the performance of different classification algorithms on larger, more diverse datasets could help identify the strengths and weaknesses of each algorithm, and lead to the development of improved algorithms in the future. Also, investigating the impact of adding additional predictors or incorporating different types of predictors on the performance of the algorithms could help improve their predictive power, in addition to combining the strengths of different algorithms into hybrid models (e.g. combining decision trees with neural networks) could lead to improved performance in specific contexts [11]. In order to learn more about these algorithms and their performance, future studies could employ a variety of methods, including large-scale simulations, controlled experiments, and real-world case studies. In addition, utilizing advanced evaluation metrics such as partial dependence plots, permutation importance, and visualizations could help shed light on the relationships between the predictors and the outcome [12].

Conclusion

In this study, we evaluated the performance of three popular classification algorithms (Decision Tree, Logistic Regression, and KNN) in the context of breast cancer diagnosis and anemia diagnosis, measured the accuracy, sensitivity, and specificity of each algorithm, and found that they performed well, with some variation in performance between the algorithms. This study represents a novel contribution in that it compares the performance of these algorithms in the context of both diagnoses and offers interpretations as to why the variations occur in measurements of the validity of the models. One of the key findings of this study is that the difference in accuracy between the algorithms could be due to the different types of features included in the dataset. This highlights the importance of considering the type and quality of the predictors when evaluating the performance of classification algorithms. There are numerous possibilities in future studies to improve the efficiency and usability of these models through various techniques. Our results suggest that these algorithms could be useful tools for assisting healthcare professionals in making informed decisions about patient diagnosis and treatment in the future.

References

Anaemia. 12 Nov. 2019, www.who.int/health-topics/anaemia.
Anaemia. www.who.int/data/nutrition/nlis/info/anaemia.
Woodman, Richard, et al. “Anemia in Older Adults.” Current Opinion in Internal Medicine, vol. 4, no. 3, Ovid Technologies (Wolters Kluwer Health), June 2005, pp. 261–66. https://doi.org/10.1097/01.moh.0000154030.13020.85.
Breast Cancer. 26 Mar. 2021, www.who.int/news-room/fact-sheets/detail/breast-cancer.
Yildiz, Tuba, et al. "Classifying Anemia Types Using Artificial Learning Methods."
ScienceDirect, vol. 24, no. 1, 2021, pp. 50-70,
https://doi.org/10.1016/j.jestch.2020.12.003. Accessed 24 Dec. 2022.
Naji, Mohammed, et al. "Machine Learning Algorithms For Breast Cancer Prediction And Diagnosis." ScienceDirect, vol. 191, 2021, https://doi.org/10.1016/j.procs.2021.07.062. Accessed 24 Dec. 2022.
Kaggle: Your Machine Learning and Data Science Community. www.kaggle.com.
R. Gupta, N. Koli, N. Mahor and N. Tejashri, "Performance Analysis of Machine Learning Classifier for Predicting Chronic Kidney Disease," 2020 International Conference for Emerging Technology (INCET), Belgaum, India, 2020, pp. 1-4, https://doi.org/10.1109/INCET49848.2020.9154147.
Peng, Junfeng, et al. “An Explainable Artificial Intelligence Framework for the Deterioration Risk Prediction of Hepatitis Patients.” Journal of Medical Systems, vol. 45, no. 5, Springer Science and Business Media LLC, Apr. 2021, https://doi.org/10.1007/s10916-021-01736-5.
Zhao, Yue, et al. “Employee Turnover Prediction with Machine Learning: A Reliable Approach.” Advances in Intelligent Systems and Computing, 2018, pp. 737–758., https://doi.org/10.1007/978-3-030-01057-7_56.
Li, Pan, et al. “Combining Decision Trees and Neural Networks for Learning-to-Rank in Personal Search.” Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery &Amp; Data Mining, ACM, July 2019, https://doi.org/10.1145/3292500.3330676.
Inglis, Alan, et al. “Visualizing Variable Importance and Variable Interaction Effects in Machine Learning Models.” Journal of Computational and Graphical Statistics, vol. 31, no. 3, Informa UK Limited, Jan. 2022, pp. 766–78. https://doi.org/10.1080/10618600.2021.2007935.

Understanding the Correlation Between the Diagnosis Outcome of Breast Cancer and Anemia Using Machine Learning-Based Classification Models