Two-stage Gene Selection and Classification for a High-Dimensional Microarray Data

Authors

  • Masithoh Yessi Rochayani Universitas Brawijaya, Indonesia
  • Umu Sa'adah Universitas Brawijaya, Indonesia
  • Ani Budi Astuti Universitas Brawijaya, Indonesia

DOI:

https://doi.org/10.15575/join.v5i1.569

Keywords:

Classification and Regression, Feature selection, Gene expression, High-dimensional, Microarray, Tree

Abstract

Microarray technology has provided benefits for cancer diagnosis and classification. However, classifying cancer using microarray data is confronted with difficulty since the dataset has high dimensions. One strategy for dealing with the dimensionality problem is to make a feature selection before modeling. Lasso is a common regularization method to reduce the number of features or predictors. However, Lasso remains too many features at the optimum regularization parameter. Therefore, feature selection can be continued to the second stage. We proposed Classification and Regression Tree (CART) for feature selection on the second stage which can also produce a classification model. We used a dataset which comparing gene expression in breast tumor tissues and other tumor tissues. This dataset has 10,936 predictor variables and 1,545 observations. The results of this study were the proposed method able to produce a few numbers of selected genes but gave high accuracy. The model also acquired in line with the Oncogenomics Theory by the obtained of GATA3 to split the root node of the decision tree model. GATA3 has become an important marker for breast tumors.

Author Biographies

Masithoh Yessi Rochayani, Universitas Brawijaya

Department of Statistics, Faculty of Mathematics and Natural Sciences

Umu Sa'adah, Universitas Brawijaya

Department of Statistics, Faculty of Mathematics and Natural Sciences

Ani Budi Astuti, Universitas Brawijaya

Department of Statistics, Faculty of Mathematics and Natural Sciences

References

I. Guyon and A. Elisseeff, “An Introduction to Variable and Feature Selection,†J. Mach. Learn. Res., vol. 3, pp. 1157–1182, 2003.

S. Biswas, M. Bordoloi, and B. Purkayastha, “Review on Feature Selection and Classification using Neuro-Fuzzy Approaches,†Int. J. Appl. Evol. Comput., vol. 7, no. 4, pp. 28–44, 2016, doi: 10.4018/IJAEC.2016100102.

H. Zhang, J. Wang, Z. Sun, J. M. Zurada, and N. R. Pal, “Feature Selection for Neural Networks Using Group Lasso Regularization,†IEEE Trans. Knowl. Data Eng., vol. 32, no. 4, pp. 659–673, 2020, doi:10.1109/TKDE.2019.2893266

R. Tibshirani, “Regression Shrinkage and Selection via the Lasso,†J. R. Stat. Soc. Ser. B, vol. 58, no. 1, pp. 267–288, 1996.

S. Tateishi, H. Matsui, and S. Konishi, “Nonlinear regression modeling via the lasso-type regularization,†J. Stat. Plan. Inference, vol. 140, no. 5, pp. 1125–1134, 2010, doi: 10.1016/j.jspi.2009.10.015.

Y. Fan and C. Y. Tang, “Tuning parameter selection in high dimensional penalized likelihood,†J. R. Stat. Soc. Ser. B (Statistical Methodol., vol. 75, pp. 531–552, 2013.

K. Hirose, S. Tateishi, and S. Konishi, “Tuning parameter selection in sparse regression modeling,†Comput. Stat. Data Anal., vol. 59, pp. 28–40, 2013, doi: 10.1016/j.csda.2012.10.005.

Z. Y. Algamal and M. H. Lee, “Penalized Logistic Regression with the Adaptive LASSO for Gene Selection in High-Dimensional Cancer Classification,†Expert Syst. Appl., vol. 42, no. 23, pp. 9326–9332, 2015.

C. Kang, Y. Huo, L. Xin, B. Tian, and B. Yu, “Feature Selection and Tumor Classification for Microarray Data Using Relaxed Lasso and Generalized Multi-class Support Vector Machine,†J. Theor. Biol., 2018, doi: 10.1016/j.jtbi.2018.12.010.

L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees. Chapman and Hall, 1984.

H. Jiang, W. Zheng, L. Luo, and Y. Dong, “A two-stage minimax concave penalty based method in pruned AdaBoost ensemble,†Appl. Soft Comput. J., vol. 83, 2019, doi: 10.1016/j.asoc.2019.105674.

B. J. Friedman, T. Hastie, and H. Holger, “Pathwise Coordinate Optimization,†Ann. Appl. Stat., vol. 1, no. 2, pp. 302–332, 2007, doi: 10.1214/07-AOAS131.

J. Friedman, T. Hastie, and R. Tibshirani, “Regularization Paths for Generalized Linear Models via Coordinate Descent,†J. Stat. Softw., vol. 33, no. 1, 2010.

R. Mazumder, J. H. Friedman, and T. Hastie, “SparseNet: Coordinate Descent With Nonconvex Penalties,†J. Am. Stat. Assoc., vol. 106, no. 495, pp. 1125–1138, 2011, doi: 10.1198/jasa.2011.tm09738.

R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R. J. Tibshirani, “Strong Rules for Discarding Predictors in Lasso-type Problems,†J. R. Stat. Soc. Ser. B, vol. 74, pp. 245–266, 2012.

T. Hastie, R. Tibshirani, and M. Wainwright, Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman and Hall, 2015.

A. Agresti, Categorical Data Analysis, Second Edi. Wiley-Interscience, 2002.

T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning DataMining, Inference, and Prediction, Second Edi. California: Springer, 2009.

J. Han, M. Kamber, and J. Pei, Data Mining Concepts and Techniques Third Edition. Elsevier, 2012.

T. Fawcett, “An Introduction to ROC Analysis,†Pattern Recognit. Lett., vol. 27, pp. 861–874, 2006.

T. Shaoxian, Y. Baohua, X. Xiaoli, C. Yufan, T. Xiaoyu, L. Hongfen, B Rui, S. Xiangjie, S. Ruohong, and Y. Wentao, “Characterisation of GATA3 expression in invasive breast cancer : differences in histological subtypes and immunohistochemically defined molecular subtypes,†J Clin Pathol, vol. 15, pp. 1–9, 2017.

H. Liu, J. Shi, M. L. Wilkerson, and F. Lin, “Immunohistochemical Evaluation of GATA3 Expression in Tumors and Normal Tissues: A Useful Immunomarker for Breast and Urothelial Carcinomas,†Am J Clin Pathol, vol. 138, pp. 57–64, 2012.

D. Ivanochko, L. Halabelian, E. Henderson, P. Savitsky, H. Jain, E. Marcon, S. Duan, A. Hutchinson, A. Seitova, D. Barsyte-Lovejoy, P. Filippakopoulos, J. Greenblatt, E. Lima-Fernandes, and C. H. Arrowsmith, “Direct interaction between the PRDM3 and PRDM16 tumor suppressors and the NuRD chromatin remodeling complex,†Nucleic Acids Res., vol. 47, no. 3, pp. 1225–1238, 2019, doi: 10.1093/nar/gky1192.

Y. J. Kim, M. Sung, E. Oh, M. Van Vranckena, J. Song, K. Jung, and Y. Choi, “Engrailed 1 overexpression as a potential prognostic marker in quintuple-negative breast cancer,†Cancer Biol. Ther., vol. 19, no. 4, pp. 335–345, 2018, doi: 10.1080/15384047.2018.1423913.

A. Assawamakin, S. Prueksaaroon, S. Kulawonganunchai, P. J. Shaw, Vara, Varavithya, T. Ruangrajitpakorn, and S. Tongsima, “Biomarker Selection and Classification of ‘“ - Omics â€â€™ Data Using a Two-Step Bayes Classification Framework,†Biomed Res. Int., 2013, doi: 10.1155/2013/148014.

Downloads

Published

2020-07-16

Issue

Section

Article

Citation Check

Similar Articles

1 2 3 4 5 6 7 8 9 > >> 

You may also start an advanced similarity search for this article.