Two-stage Gene Selection and Classification for a High-Dimensional Microarray Data

Masithoh Yessi Rochayani; Umu Sa'adah; Ani Budi Astuti

doi:10.15575/join.v5i1.569

Authors

Masithoh Yessi Rochayani Universitas Brawijaya, Indonesia
Umu Sa'adah Universitas Brawijaya, Indonesia
Ani Budi Astuti Universitas Brawijaya, Indonesia

DOI:

https://doi.org/10.15575/join.v5i1.569

Keywords:

Classification and Regression, Feature selection, Gene expression, High-dimensional, Microarray, Tree

Abstract

Microarray technology has provided benefits for cancer diagnosis and classification. However, classifying cancer using microarray data is confronted with difficulty since the dataset has high dimensions. One strategy for dealing with the dimensionality problem is to make a feature selection before modeling. Lasso is a common regularization method to reduce the number of features or predictors. However, Lasso remains too many features at the optimum regularization parameter. Therefore, feature selection can be continued to the second stage. We proposed Classification and Regression Tree (CART) for feature selection on the second stage which can also produce a classification model. We used a dataset which comparing gene expression in breast tumor tissues and other tumor tissues. This dataset has 10,936 predictor variables and 1,545 observations. The results of this study were the proposed method able to produce a few numbers of selected genes but gave high accuracy. The model also acquired in line with the Oncogenomics Theory by the obtained of GATA3 to split the root node of the decision tree model. GATA3 has become an important marker for breast tumors.

Author Biographies

Masithoh Yessi Rochayani, Universitas Brawijaya

Department of Statistics, Faculty of Mathematics and Natural Sciences

Umu Sa'adah, Universitas Brawijaya

Department of Statistics, Faculty of Mathematics and Natural Sciences

Ani Budi Astuti, Universitas Brawijaya

Department of Statistics, Faculty of Mathematics and Natural Sciences

References

I. Guyon and A. Elisseeff, â€œAn Introduction to Variable and Feature Selection,â€ J. Mach. Learn. Res., vol. 3, pp. 1157â€“1182, 2003.

S. Biswas, M. Bordoloi, and B. Purkayastha, â€œReview on Feature Selection and Classification using Neuro-Fuzzy Approaches,â€ Int. J. Appl. Evol. Comput., vol. 7, no. 4, pp. 28â€“44, 2016, doi: 10.4018/IJAEC.2016100102.

H. Zhang, J. Wang, Z. Sun, J. M. Zurada, and N. R. Pal, â€œFeature Selection for Neural Networks Using Group Lasso Regularization,â€ IEEE Trans. Knowl. Data Eng., vol. 32, no. 4, pp. 659â€“673, 2020, doi:10.1109/TKDE.2019.2893266

R. Tibshirani, â€œRegression Shrinkage and Selection via the Lasso,â€ J. R. Stat. Soc. Ser. B, vol. 58, no. 1, pp. 267â€“288, 1996.

S. Tateishi, H. Matsui, and S. Konishi, â€œNonlinear regression modeling via the lasso-type regularization,â€ J. Stat. Plan. Inference, vol. 140, no. 5, pp. 1125â€“1134, 2010, doi: 10.1016/j.jspi.2009.10.015.

Y. Fan and C. Y. Tang, â€œTuning parameter selection in high dimensional penalized likelihood,â€ J. R. Stat. Soc. Ser. B (Statistical Methodol., vol. 75, pp. 531â€“552, 2013.

K. Hirose, S. Tateishi, and S. Konishi, â€œTuning parameter selection in sparse regression modeling,â€ Comput. Stat. Data Anal., vol. 59, pp. 28â€“40, 2013, doi: 10.1016/j.csda.2012.10.005.

Z. Y. Algamal and M. H. Lee, â€œPenalized Logistic Regression with the Adaptive LASSO for Gene Selection in High-Dimensional Cancer Classification,â€ Expert Syst. Appl., vol. 42, no. 23, pp. 9326â€“9332, 2015.

C. Kang, Y. Huo, L. Xin, B. Tian, and B. Yu, â€œFeature Selection and Tumor Classification for Microarray Data Using Relaxed Lasso and Generalized Multi-class Support Vector Machine,â€ J. Theor. Biol., 2018, doi: 10.1016/j.jtbi.2018.12.010.

L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees. Chapman and Hall, 1984.

H. Jiang, W. Zheng, L. Luo, and Y. Dong, â€œA two-stage minimax concave penalty based method in pruned AdaBoost ensemble,â€ Appl. Soft Comput. J., vol. 83, 2019, doi: 10.1016/j.asoc.2019.105674.

B. J. Friedman, T. Hastie, and H. Holger, â€œPathwise Coordinate Optimization,â€ Ann. Appl. Stat., vol. 1, no. 2, pp. 302â€“332, 2007, doi: 10.1214/07-AOAS131.

J. Friedman, T. Hastie, and R. Tibshirani, â€œRegularization Paths for Generalized Linear Models via Coordinate Descent,â€ J. Stat. Softw., vol. 33, no. 1, 2010.

R. Mazumder, J. H. Friedman, and T. Hastie, â€œSparseNet: Coordinate Descent With Nonconvex Penalties,â€ J. Am. Stat. Assoc., vol. 106, no. 495, pp. 1125â€“1138, 2011, doi: 10.1198/jasa.2011.tm09738.

R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R. J. Tibshirani, â€œStrong Rules for Discarding Predictors in Lasso-type Problems,â€ J. R. Stat. Soc. Ser. B, vol. 74, pp. 245â€“266, 2012.

T. Hastie, R. Tibshirani, and M. Wainwright, Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman and Hall, 2015.

A. Agresti, Categorical Data Analysis, Second Edi. Wiley-Interscience, 2002.

T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning DataMining, Inference, and Prediction, Second Edi. California: Springer, 2009.

J. Han, M. Kamber, and J. Pei, Data Mining Concepts and Techniques Third Edition. Elsevier, 2012.

T. Fawcett, â€œAn Introduction to ROC Analysis,â€ Pattern Recognit. Lett., vol. 27, pp. 861â€“874, 2006.

T. Shaoxian, Y. Baohua, X. Xiaoli, C. Yufan, T. Xiaoyu, L. Hongfen, B Rui, S. Xiangjie, S. Ruohong, and Y. Wentao, â€œCharacterisation of GATA3 expression in invasive breast cancer : differences in histological subtypes and immunohistochemically defined molecular subtypes,â€ J Clin Pathol, vol. 15, pp. 1â€“9, 2017.

H. Liu, J. Shi, M. L. Wilkerson, and F. Lin, â€œImmunohistochemical Evaluation of GATA3 Expression in Tumors and Normal Tissues: A Useful Immunomarker for Breast and Urothelial Carcinomas,â€ Am J Clin Pathol, vol. 138, pp. 57â€“64, 2012.

D. Ivanochko, L. Halabelian, E. Henderson, P. Savitsky, H. Jain, E. Marcon, S. Duan, A. Hutchinson, A. Seitova, D. Barsyte-Lovejoy, P. Filippakopoulos, J. Greenblatt, E. Lima-Fernandes, and C. H. Arrowsmith, â€œDirect interaction between the PRDM3 and PRDM16 tumor suppressors and the NuRD chromatin remodeling complex,â€ Nucleic Acids Res., vol. 47, no. 3, pp. 1225â€“1238, 2019, doi: 10.1093/nar/gky1192.

Y. J. Kim, M. Sung, E. Oh, M. Van Vranckena, J. Song, K. Jung, and Y. Choi, â€œEngrailed 1 overexpression as a potential prognostic marker in quintuple-negative breast cancer,â€ Cancer Biol. Ther., vol. 19, no. 4, pp. 335â€“345, 2018, doi: 10.1080/15384047.2018.1423913.

A. Assawamakin, S. Prueksaaroon, S. Kulawonganunchai, P. J. Shaw, Vara, Varavithya, T. Ruangrajitpakorn, and S. Tongsima, â€œBiomarker Selection and Classification of â€˜â€œ - Omics â€â€™ Data Using a Two-Step Bayes Classification Framework,â€ Biomed Res. Int., 2013, doi: 10.1155/2013/148014.

Two-stage Gene Selection and Classification for a High-Dimensional Microarray Data

Authors

DOI:

Keywords:

Abstract

Author Biographies

Masithoh Yessi Rochayani, Universitas Brawijaya

Umu Sa'adah, Universitas Brawijaya

Ani Budi Astuti, Universitas Brawijaya

References

Downloads

Published

Issue

Section

Citation Check

License

You are free to:

Under the following terms:

Notices:

Similar Articles

Make a Submission

newsidebarjoin