Breast cancer classification using machine learning
techniques: a comparative study
Type of article: Original
Djihane Houfani1, Sihem Slatnia1, Okba Kazar1, Noureddine Zerhouni2, Hamza Saouli1, Ikram Remadna1
1 LINFI Laboratory, Department of Computer
Science, University of Biskra, Algeria.
2 Institut FEMTO-ST, UMR CNRS 6174- UFC /
ENSMM / UTBM, Besanon, France
Abstract:
Background: The second leading deadliest disease
affecting women worldwide, after lung
cancer, is breast cancer. Traditional approaches for breast cancer diagnosis
suffer from time consumption and some human errors in classification. To deal
with this problems, many research works based on machine learning techniques
are proposed. These approaches show their effectiveness in data classification in
many fields, especially in healthcare.
Methods: In this cross sectional study, we conducted a practical comparison
between the most used machine learning algorithms in the literature. We applied
kernel and linear support vector machines, random forest, decision tree,
multi-layer perceptron, logistic regression, and k-nearest neighbors for breast
cancer tumors classification. The used
dataset is Wisconsin diagnosis Breast
Cancer.
Results: After comparing the machine learning algorithms efficiency, we
noticed that multilayer perceptron and logistic regression gave the best results with an accuracy of 98% for
breast cancer classification.
Conclusion: Machine learning approaches are extensively used in medical prediction and decision support
systems. This study showed that multilayer perceptron and logistic
regression algorithms are performant ( good accuracy specificity and sensitivity)
compared to the other evaluated
algorithms.
Keyword: Breast cancer, Classification, Accuracy, Comparative
study, Machine learning.
Corresponding
author: Djihane Houfani, LINFI Laboratory, Department of Computer Science,
University of Biskra, Algeria., email: djihane.houfani@univ-biskra.dz.
Received: July 12 2020. Reviewed: August 17 2020. Accepted: October 3
2020. Published: October 20 2020.
Medical Technologies Journal subscribes to the principles of the
Committee on Publication Ethics (COPE).
Screened by iThenticate..©2017-2019 KNOWLEDGE KINGDOM PUBLISHING.
1. Introduction
Breast cancer (BC) is among the major
deadliest diseases affecting women around the world [1]. It occurs because of
the uncontrolled growth of the cells in breast tissue. BC diagnosis based on histopathological data can provide
inaccurate outcomes. In last decade, machine learning (ML)
techniques are widely used in diagnosis of BC to help pathologists and
physicians in early detection, decision making process and giving a successful
plan for treatment.
In the literature, several algorithms for breast
cancer diagnosis and prognosis are proposed. In this paper we provide a
practical comparison between kernel
and linear support vector machines (K-SVM, L-SVM respectively), random forest(RF), decision tree (DTs),
multi-layer perceptron (MLP),
logistic regression (LR),
and k-nearest neighbors (k-NN) which are the most used algorithms in several researches [2-4]. The goal
of this study is to evaluate the performance of these algorithms in terms of effectiveness,
efficiency and accuracy,. We conduct this comparative study to find out the
best approach to be used in learning models, to apply it on new datasets and
improve its performance by combining it with other technologies such as fuzzy
learning, convolutional neural networks, genetic algorithms …etc.
In the rest of the paper, we will explain our
experiment, the materials and the methods, in Section 2. Then, we will present
the obtained results in Section 3, and finally, we present our conclusions and
future works in Section 4.
2. Materials and Method
a.
Related works
Classification is one of the most crucial
machine learning tasks. It is applied in many research works using several
medical datasets in order to classify breast cancer cells. In this section, we
present some works that apply ML techniques for early BC diagnosis.
Abdel-Zaher and Eldeib [5] proposed a
computer aided diagnosis (CAD) scheme for BC detection. Deep belief network
unsupervised learning and back propagation supervised learning are used in this
system. It is evaluated using the
Wisconsin Breast Cancer Dataset (WBCD) and gave an accuracy of 99.68%
on.
Thein and Khin [6] presented an approach
for BC classification. The proposed
system applied the island-based training method on the Wisconsin Diagnostic and
Prognostic Breast Cancer data sets. This approach gave good accuracy and low
training time by using and analyzing two migration topologies.
Ibrahim and Siti [7] applied MLP neural
network and enhanced non-dominated sorting genetic algorithm (NSGA-II) for BC
automatic classification. Compared to other methods, this work improved
classification accuracy by optimizing the ANN parameters and network structure.
Guan et al. [8] proposed breast tumor
classifier. They used Wisconsin Breast
Cancer Dataset to evaluate their
diagnostic model called self-validation cerebellar model articulation
controller (SVCMAC) neural network. The advantages of this method are simple
computation, fast learning, and good
generalization capability.
Kumar et al. [9] proposed an ensemble voting classifier. It
combines J48, Naïve Bayes, and SVM on WBCD to improve the decision-making approaches
in the prediction of BC survivability. The dataset is preprocessed. Then, it was
trained and tested using 10-fold cross validation. The combined model gave good
accuracy.
Mittal et al. [10] presented a hybrid
classifier for BC diagnosis. The classifier is a combination of self-organizing maps (SOM) and stochastic
gradient descent (SGD) on WBCD. The proposed system improved the accuracy
compared to other works in state of the art ML techniques.
Haifeng et al. [11] proposed an SVM-based ensemble learning
model for BC diagnosis. The proposed model includes C-SVM and ν–SVM structures,
and six types of kernel functions. It was tested using two datasets: the WBCD
(original and diagnosis) datasets, and the Surveillance, Epidemiology, and End
Results (SEER) dataset. The system presented a good accuracy compared to works
based on single SVM.
Emina et al. [12] proposed a BC classifier applying
several machine learning algorithms. Logistic Regression, Decision Trees, RF,
Bayesian Network, MLP, Radial Basis Function Networks (RBFN), SVM, Rotation
Forest and genetic algorithm-based feature selection were compared and the
Rotation Forest model with GA-based 14 features The system gave the best
results (accuracy 99.48%). The system was evaluated using diagnosis and
original WBCD datasets.
Zheng et al. [13] applied K-means and
support vector machine (K-SVM) algorithms to develop a hybrid system for breast
tumors classification. The method is tested on WDBC dataset and gave an
accuracy of 97.38%.
Arpit and Aruna [14] developed a
genetically optimized neural network (GONN) for BC classification. They improved
the neural network architecture by introducing a new crossover and mutation
operators. The proposed approach is evaluated by using WBCD and presented good accuracy.
By this study, we aim to employ MLP, L-SVM, K-SVM, DTs, RF,
KNN, NB, LR and NB on the Wisconsin Breast Cancer (Diagnosis) dataset to compare
their performance (effectiveness and efficiency).
b.
Experiments
In this section, a presentation of the
experiments is given.
A.
Dataset
In the literature, many studies used The Wisconsin
Breast Cancer dataset (diagnosis). It is available in the UCI Machine Learning
Repository. It has 569 instances
(Benign: 357 Malignant: 212), 2 classes (37.3% malignant and 62.7% benign), and
32 attributes which are: ID number, Diagnosis (M = malignant, B =
benign), Radius (mean of distances from center to points on the perimeter), Texture
(standard deviation of gray-scale values), Perimeter (mean size of the core
tumor), Area, Smoothness (local variation in radius lengths), Compactness
(perimeter^2 / area - 1.0), Concavity (severity of concave portions of the
contour), Concave points (number of concave portions of the contour, Symmetry, Fractal
dimension (coastline approximation – 1), the mean, standard error and
"worst" or largest (mean of the three largest values) of these
attributes are computed for each image, resulting in 32 attributes) [15].
In this work, the ML algorithms are evaluated using WBCD.
B.
Data Normalization
The z-score standardization method is used to normalize the dataset. The
used equation for calculating the z-score is given in (1), where, μm is the mean value of the attribute, δm is
the standard deviation, is
the raw data, and is
the normalized data [16].
C.
Methods description
Supervised learning algorithms are algorithms that
learn on a labeled dataset, given training on input
and output parameters [17]. In this case, the goal of computers is to learn a
general formula which maps inputs to outputs. Predictive models developed in
this type of learning are achieved using classification and regression
techniques. Classification methods predict discrete variable however,
regression techniques provide continuous variables.
· Naïve Bayes (NB)
NB is a probabilistic ML method. It
calculates probabilities of different classes given some observed evidence [18].
It uses the maximum likelihood method for parameter estimation and is appropriate for high dimensionality inputs.
Equation (2) gives the probability of a class given predictor, where is posterior
probability, is
the class prior probability, is
the likelihood, and is the
probability of predictor [ 4]:
· Support Vector Machine (SVM)
SVM algorithm consists of finding a
hyperplane which separates classes [17]. It is suited for high dimensional inputs
and memory efficient because it uses support vectors. SVM is a powerful
algorithm; however its storage and computational grow up with the number of
training vectors [18].
· Decision tree (DT)
DT is a diagram having a tree
structure, where each node represents a test on an attribute, each branch denotes
an outcome of the test, and each terminal node detains a class label. It is
used to classify input data points or predict output values given inputs [19].
It is efficient and capable of fitting complex datasets.
· Random forest (RF)
RF is a large number of decision
trees which are ensemble [18]. In this method, each individual tree builds an
output class then the average of predictions is taken. The final result is generated by taking the
mode of classes found separately [11].
· Logistic regression (LR)
LR algorithm can be applied for classification and regression tasks [19].
It is a statistical method for data analyzing. It aims to obtain the best fitting model which
describes the relation between inputs and outputs.
· K-Nearest Neighbor (K-NN)
K-NN is the simplest machine learning
method. It is non-parametric method used for both classification and
regression. It consists of calculating the distance between the test data and
the input and gives the prediction accordingly [18].
· Multilayer Perceptron (MLP)
MLP is a feed forward supervised neural
network for data classification. It is composed of many layers as a directed
graph between the input and output layers. For the training task, MLP uses
backpropagation method [18].
3. Results and discussion
The goal of this work is to compare the
performance of NB, SVM, DT, LR, RF, k-NN and MLP. We used split method to divide
the dataset: a training set (75%) to train the model, and a testing set (25%)
to evaluate it.
A.Efficiency
A confusion matrix is a performance
measurement that provides information about real and predicted values resulting
from a classification system. Table 1 is a description of a confusion matrix
for a two class classifier.
TABLE I. Confusion matrix representation
True value |
Predicted value |
|
Positive |
Negative |
|
Positive |
TP |
FN |
Negative |
FP |
TN |
Where:
- TP: malignant tumors (M) correctly predicted as malignant;
- FP: benign tumors
(B) incorrectly
predicted as malignant tumors (M);
- FN: malignant tumor (M) incorrectly identified as benign
tumor (B);
- TN: benign tumor (B) correctly identified as benign.
To compare true classes and predicted results, we use confusion matrices
shown in figure 1. We note that both of MLP and LR correctly predict 140
instances from 143 instances (87 benign instances correctly predicted benign
and 53 malignant instance effectively malignant), and 3 instances incorrectly
predicted (2 benign instances predicted as malignant and 1 malignant instance
predicted as benign). As a result, MLP and LR give the best accuracies.
Fig. 1. Confusion
matrices
To check how our models are efficient, we build the ROC
(receiver operating characteristic) curve presented in figure 2. ROC curve is used with binary classifiers, it is
used to understand the performance of a ML
algorithm; it plots the TPR against the FPR [17].
The TPR and FPR are given in equation
(3) and (4) [20]:
TPR and FPR
values are given in table 2.
Fig. 2. ROC curve
We can easily observe that MLP and LR are
the best classifiers followed by other algorithms.
A.
Effectiveness
To measure the performance
of used algorithms, we conduct a comparison based on accuracy, correctly and incorrectly
classified instances.
Accuracy is a metric used
for evaluating classification models. It gives the ratio of the total number of the
correct predictions, its equation is given in
(5) [21]:
Table 2 and figure 3 show the obtained
results.
Fig. 3. comparative graphs of used classifiers
We can notice from figure 3, that accuracy
obtained by MLP and LR (98%) is the best compared to accuracy obtained by KNN,
DTs, RF, L-SVM, K-SVM and NB which vary between 95% and 97%. We can also easily
see that both of MLP and LR reach the best value of correctly classified
instances and the lower value of incorrectly classified instances compared to
the other classification methods. It is noted that the DTs performance was the
highest in the training phase, but it wasn’t in the testing phase; this proofs
that DTs can learn accurately in the training phase, however it can be weak in
generalization.
In summary, both of MLP and LR algorithms provide
a good performance (effectiveness and efficiency, accuracy, sensitivity and
specificity) compared to the other algorithms. In this study, we achieve the
highest accuracy (98%) in classifying breast tumors.
4. Conclusion
Machine learning techniques are
revolutionizing the field of bio-medical and healthcare. One of the most
important challenges of ML is to provide computationally efficient and accurate
classifiers for healthcare field. In the last decade, many research works have
been conducted in medical field for this reason. ML techniques have played a
crucial role in improving classification and prediction accuracy. Although several
algorithms have achieved a very good accuracy using WBCD, the development of new
algorithms is still essential.
In this paper, we employed MLP, L-SVM,
K-SVM, DTs, RF, KNN, LR and NB on WBCD dataset. We compared their performance in
terms of effectiveness and efficiency to find the highest classification
accuracy. In this experimental study, we achieved the best accuracy (98%) in classifying BC dataset
using MLP and LR. In conclusion, MLP and LR have shown their efficiency in BC
classification.
For the future work, we plan to apply deep
reinforcement learning and genetic algorithms on new datasets to boost the breast
cancer diagnosis and further improve prognostic accuracy.
5. Conflict of interest statement
We certify that there is no conflict of interest with
any financial organization inthe subject matter or materials discussed in this
manuscript.
6. Authors’ biography
Djihane Houfani received
her Master degree in Computer Science from University of Biskra, Algeria in
2017. She is now a PhD student in artificial intelligence at the University of
Biskra and her current research interest includes medical prediction, deep
learning, multi-agent systems and optimization.
Sihem Slatnia was born in
the city of Biskra, Algeria. She followed her high studies at the university of Biskra, Algeria
at the Computer Science Department and obtained the engineering diploma in 2004
on the work "Diagnostic based model by Black and White analyzing in
Background Petri Nets", After that, she obtained Master diploma in 2007
(option: Artificial intelligence and advanced system's information), on the
work "Evolutionary Cellular Automata Based-Approach for Edge
Detection". She obtained PhD degree from the same university in 2011, on
the work "Evolutionary Algorithms for Image Segmentation based on Cellular
Automata". Presently she is an associate professor at computer science
department of Biskra University. She is interested to the artificial
intelligence, emergent complex systems and optimization.
Okba Kazar professor
in the Computer Science Department of Biskra, he helped to create the
laboratory LINFI at the University of Biskra. He is a member of international
conference program committees and the "editorial board" for various
magazines. His research interests are artificial intelligence, multi-agent
systems, web applications and information systems.
Noureddine Zerhouni holds a
doctorate in Automatic-Productivity from the National Polytechnic Institute of Grenoble
(INPG), France, in 1991. He was a lecturer at the National School of Engineers (ENI,
UTBM) in Belfort. Since 1999, he is Professor of Universities at the National School
of Mechanics and Microtechnics (ENSMM) in Besançon. He is doing his research in
the Automatic department of the FEMTO-ST Institute in Besançon. His areas of
research are related to the monitoring and maintenance of production systems.
Hamza Saouli received the
Master and Doctorate degrees in Computer Science from University of Mohamed
KhiderBiskra (UMKB), the Republic of Algeria in 2010 and 2015, respectively. He
is a university lecturer since 2015 and his research interest includes
artificial intelligence, web services and Cloud Computing.
Ikram Remadna received
her Master degree in Computer Science from University of Biskra, Algeria in
2016. She is now a PhD student in artificial intelligence at the University of
Biskra and her current research interest includes Prognostics and Health
Management and Deep learning.
7. References
[1] U.S. Cancer Statistics
Working Group. United States Cancer Statistics: 19992008 Incidence and
Mortality Web-based Report. Atlanta (GA): Department of Health and Human
Services, Centers for Disease Control and Prevention, and National Cancer
Institute (2012).
[2] BF Cruz, JT de Assis, VV Estrela, A Khelassi, A compact SIFT-based
strategy for visual information retrieval in large image databases, Medical
Technologies Journal 3 (2), 402-412, 2019 https://doi.org/10.26415/2572-004X-vol3iss2p402-412
[3] Q Memon, (2019), On assisted
living of paralyzed persons through real-time eye features tracking and
classification using Support Vector Machines, Medical Technologies Journal 3
(1), 316-333 https://doi.org/10.26415/2572-004X-vol3iss1p316-333
[4] Devi I., Karpagam G.R. and
Vinoth Kumar B (2017), A survey of machine learning techniques. International
Journal of Computational Systems Engineering. 3 (4): 203-212. https://doi.org/10.1504/IJCSYSE.2017.10010099
[5] Abdel-Zaher Ahmed M. and Eldeib Ayman M. (2016), Breast cancer
classification using deep belief networks. Expert Systems with Applications.
ELSEVIER;46:139-144. https://doi.org/10.1016/j.eswa.2015.10.015
[6] Thein HTT. and Khin MMT. (2015), An Approach for Breast Cancer
Diagnosis Classification Using Neural Network. Advanced Computing. An
International Journal (ACIJ). 6 (1): 1-11. https://doi.org/10.5121/acij.2015.6101
[7] Ashraf O. I. and Siti, M. S. (2018), Intelligent breast cancer
diagnosis based on enhanced Pareto optimal and multilayer perceptron neural
network. International Journal of Computer Aided Engineering and Technology.
Inderscience. 10 (5): 543-556. https://doi.org/10.1504/IJCAET.2018.10013710
[8] Guan J., Lin L., Ji G., Lin C., Le T., Imre JR. (2016), Breast Tumor
Computer-aided Diagnosis using Self-Validating Cerebellar Model Neural
Networks. Acta Polytechnica Hungarica. 13 (4): 39-52. https://doi.org/10.12700/APH.13.4.2016.4.3
[9] Karthik Kumar U., Sai Nikhil
M.B. and Sumangali K. (2017), Prediction of Breast Cancer using Voting
Classifier Technique. IEEE International Conference on Smart Technologies and
Management for Computing, Communication, Controls, Energy and Materials (ICSTM);
2017 2 - 4 August; Veltech Dr.RR & Dr.SR University, Chennai, T.N., India.
108-114 https://doi.org/10.1109/ICSTM.2017.8089135
[10] Mittal D., Gaurav D. and Sanjiban SR. (2015), An Effective
Hybridized Classifier for Breast Cancer Diagnosis. IEEE International
Conference on Advanced Intelligent Mechatronics (AIM); 2015 July 7-11. Busan,
Korea. https://doi.org/10.1109/AIM.2015.7222674
[11] Haifeng W., Bichen Z., Sang W.Y., Hoo S. K. (2017), A Support Vector
Machine-Based Ensemble Algorithm for Breast Cancer Diagnosis. European Journal
of Operational Research. Elsevier: 1-33.
[12] Emina A., Abdulhamit S. (2015), Breast cancer diagnosis using GA
feature selection and Rotation Forest. Neural Comput & Applic. Springer.
[13] Zheng B., Sang WY., Sarah SL.
(2013), Breast cancer diagnosis based on feature extraction using a hybrid of
K-means and support vector machine algorithms. Expert Systems with
Applications. Elsevier: 1-7.
[14] Arpit, B., Aruna, T. (2015), Breast Cancer Diagnosis Using
Genetically Optimized Neural Network Model. Expert Systems with Applications.
Elsevier: 1-15.
[15] https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic),
last accessed September 20, 2019.
[16] E. Kreyszig (1979), Advanced Engineering Mathematics (Fourth ed.).
Wiley, ISBN 0-471-02140-7.
[17] Aurélien Géron (2017), Hands-On Machine Learning with Scikit-Learn
and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems.
Published by O'Reilly Media.
[18] Amit K. and Bikash KS. (2017), A case study on machine learning and
classification. International Journal Information and Decision Sciences 9 (2):
97-208 https://doi.org/10.1504/IJIDS.2017.084885
[19] Francois Chollet (2018), Deep Learning with Python. Published by
Manning Publications.
[20] Davis, J., & Goadrich, M. (2006). The relationship between
Precision-Recall and ROC curves. Proceedings of the 23rd International
Conference on Machine Learning - ICML '06.
https://doi.org/10.1145/1143844.1143874
[21] Piri, S., Delen, D., & Liu, T. (2018). A synthetic informative
minority over-sampling (SIMO) algorithm leveraging support vector machine to
enhance learning from imbalanced datasets. Decision Support Systems, 106,
15-29. https://doi.org/10.1016/j.dss.2017.11.006
[22] Razmjooy N., Estrela VV., Loschi HJ. (2019), A study on
metaheuristic-based neural networks for image segmentation purposes, Data
Science Theory, Analysis and Applications, Taylor and Francis, Abingdon, UK,
2019. https://doi.org/10.1201/9780429263798-2
[23] Razmjooy N., N, Estrela V.V., Loschi H.J., Farfan W.S. (2019), A
Comprehensive Survey of New Metaheuristic Algorithms, Wiley.
[24] Karim CN., Mohamed O, Ryad T. (2018), A new approach for breast
abnormality detection based on thermography. Medical Technologies Journal,
2(3):245-254.
https://doi.org/10.26415/2572-004X-vol2iss3p245-254
[25] Hemanth J., Estrela V.V. (2017), Deep Learning for Image Processing
Applications. Advances in Parallel Computing, Vol. 31, IOS Press, Amsterdam,
Netherlands. ISSN: 978-1-61499-822-8.
https://www.iospress.nl/book/deep-learning-for-imageprocessing-applications/
[26] Souadih K., Belaid A, Ben Salem D. (2019), Automatic Segmentation of
the Sphenoid Sinus in CT-Scans Volume with Deep Medics 3D CNN Architecture,
Medical Technologies Journal, Vol. 3, no. 1, pp. 334-46,
https://doi.org/10.26415/2572-004X-vol3iss1p334-346
.