Regularized K-Means Clustering via Fully Corrective Frank-Wolfe Optimization

Ahmed Yacoub Yousif; Basad Al-Sarray

doi:10.31026/j.eng.2025.08.11

download full text (الإنجليزية)

منشور: 2025-08-01

DOI: https://doi.org/10.31026/j.eng.2025.08.11

الكلمات المفتاحية:

Clustering، Regularized k-means، Fully corrective Frank-Wolfe

Ahmed Yacoub Yousif

Department of Mathematics, College of Science, University of Baghdad

Basad Al-Sarray

Department of Computer Science, College of Science, University of Baghdad

الملخص

يُستخدم التعلم غير المُراقب لتحليل هياكل البيانات في عمليات التجميع، إلا أن خوارزمية k-means التقليدية تعاني من الحساسية للضوضاء والقيم الشاذة والبيانات عالية الأبعاد، مما يؤدي إلى أداء غير مستقر. وللتغلب على هذه القيود، تُقدَّم في هذه الدراسة إطار مُحسَّن للتجميع باستخدام k-means المُنظَّم، حيث يتم تحسينه باستخدام خوارزمية فرانك-وولف المصححة بالكامل (FCFW). يساهم التنظيم باستخدام معيار فروبينياس (Frobenius norm) في زيادة استقرار المجموعات، والحد من فرط التكيّف، وتحقيق توزيع أكثر تماسكًا للعناقيد. بالإضافة إلى ذلك، يتيح تعيين المجموعات الاحتمالي مرونة أكبر، حيث يمكن لكل نقطة بيانات الانتماء إلى عدة مجموعات بمستويات عضوية مختلفة. كما يتم تنفيذ اختيار الميزات باستخدام اختبار كروسكال-واليس (Kruskal-Wallis)، وهو اختبار إحصائي غير معلمي يُستخدم لتحديد الميزات المهمة في البيانات عالية الأبعاد، خصوصًا في تحليل التعبير الجيني.

تم تقييم كفاءة الطريقة المقترحة من خلال التجارب على مجموعات بيانات تركيبية وبيانات تعبير جيني لسرطان الثدي عالية الأبعاد، حيث أثبتت التجارب قدرة النموذج على التعامل مع العناقيد غير الخطية والمتداخلة بفعالية. أظهرت النتائج أن خوارزمية k-means المُنظَّمة باستخدام FCFW تتفوق على k-means التقليدية والخوارزميات الأخرى من حيث الدقة وسرعة التقارب. وبالتالي، تؤكد هذه الدراسة أن النهج المقترح يتمتع بالقوة والفعالية، مما يجعله مناسبًا لتطبيقات تحليل البيانات عالية الأبعاد.

##plugins.themes.bootstrap3.displayStats.downloads##

##plugins.themes.bootstrap3.displayStats.noStats##

إصدار

مجلد 31 عدد 8 (2025): Journal of Engineering

القسم

Articles

هذا العمل مرخص بموجب Creative Commons Attribution 4.0 International License.

كيفية الاقتباس

"Regularized K-Means Clustering via Fully Corrective Frank-Wolfe Optimization" (2025) مجلة الهندسة, 31(8), ص 178–199. doi:10.31026/j.eng.2025.08.11.

المراجع

Adams, R.P., 2018. K-means clustering and related algorithms. Princeton University.

Ahmed, V.M. and Al-Haleem, A.A., 2024. Permeability prediction for Ajeel Oilfield/Tertiary Reservoir by integrating rock typing approach with FZI method. Journal of Engineering, 30(12), pp. 96–111. https://doi.org/10.31026/j.eng.2024.12.07.

AL-Kordy, S.U. and Khudair, B.H., 2021. Effluent quality assessment of sewage treatment plant using principal component analysis and cluster analysis. Journal of Engineering, 27(4), pp. 79–95. https://doi.org/10.31026/j.eng.2021.04.07.

Beznosikov, A., Dobre, D. and Gidel, G., 2023. Sarah frank-wolfe: Methods for constrained optimization with best rates and practical features. arXiv preprint arXiv:2304.11737.

Blanza, J., 2021. Wireless propagation multipaths using spectral clustering and three-constraint affinity matrix spectral clustering. Baghdad Science Journal, 18(2), P. 1001. https://doi.org/10.21123/bsj.2021.18.2(Suppl.).1001.

Canon, M.D. and Cullum, C.D., 1968. A tight upper bound on the rate of convergence of Frank-Wolfe algorithm. SIAM Journal on Control, 6(4), pp. 509–516. https://doi.org/10.1137/0306032.

Cherfaoui, F., Emiya, V., Ralaivola, L. and Anthoine, S., 2018. Frank-Wolfe algorithm for the exact sparse problem. arXiv preprint arXiv:1812.07201.

Daoudi, S., Anouar Zouaoui, C.M., El-Mezouar, M.C. and Taleb, N., 2021. Parallelization of the K-means++ clustering algorithm. Ingénierie des Systèmes d’Information, 26(1).

Deng, Q., Ramsköld, D., Reinius, B. and Sandberg, R., 2014. Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science, 343(6167), pp. 193–196. https://doi.org/10.1126/science.1245316.

Feltes, B.C., Chandelier, E.B., Grisci, B.I. and Dorn, M., 2019. Cumida: An extensively curated microarray database for benchmarking and testing of machine learning approaches in cancer research. Journal of Computational Biology, 26(4), pp. 376–386. https://doi.org/DOI:10.1089/cmb.2018.0238.

Frank, M. and Wolfe, P., 1956. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1–2), pp. 95–110. https://doi.org/10.1002/nav.3800030112.

Gao, C.X., Dwyer, D., Zhu, Y., Smith, C.L., Du, L., Filia, K.M., Bayer, J., Menssink, J.M., Wang, T. and Bergmeir, C., 2023. An overview of clustering methods with guidelines for application in mental health research. Psychiatry Research, 327, P. 115265. https://doi.org/10.1016/j.psychres.2023.115265.

Ghathwan, K.I. and Mohammed, A.J., 2022. Intelligent bat algorithm for finding eps parameter of DbScan clustering algorithm. Iraqi Journal of Science, pp. 5572–5580. https://doi.org/10.24996/ijs.2022.63.12.41.

Gondeau, A., Aouabed, Z., Hijri, M., Peres-Neto, P.R. and Makarenkov, V., 2019. Object weighting: A new clustering approach to deal with outliers and cluster overlap in computational biology. IEEE/ACM transactions on computational biology and bioinformatics, 18(2), pp. 633–643. https://doi.org/10.1109/TCBB.2019.2921577.

Goolam, M., Scialdone, A., Graham, S.J.L., Macaulay, I.C., Jedrusik, A., Hupalowska, A., Voet, T., Marioni, J.C. and Zernicka-Goetz, M., 2016. Heterogeneity in Oct4 and Sox2 targets biases cell fate in 4-cell mouse embryos. Cell, 165(1), pp. 61–74. https://doi.org/10.1097/01.ogx.0000488738.30718.bf.

Grisci, B.I., Feltes, B.C. and Dorn, M., 2019. Neuroevolution as a tool for microarray gene expression pattern identification in cancer research. Journal of Biomedical Informatics, 89, pp. 122–133. https://doi.org/10.1016/j.jbi.2018.11.013.

He, Y. and Zheng, Y., 2018. Short-term power load probability density forecasting based on Yeo-Johnson transformation quantile regression and Gaussian kernel function. Energy, 154, pp. 143–156. https://doi.org/10.1016/j.energy.2018.04.072.

Holloway, C.A., 1974. An extension of the Frank and Wolfe method of feasible directions. Mathematical Programming, 6, pp. 14–27. https://doi.org/10.1007/BF01580219.

Jamail, I. and Moussa, A., 2020. Current state-of-the-art of clustering methods for gene expression data with RNA-Seq. In: Applications of Pattern Recognition. IntechOpen. https://doi.org/10.5772/intechopen.94069.

Jiang, P., Cao, J., Yu, W. and Nie, F., 2025. A robust entropy regularized K-means clustering algorithm for processing noise in datasets. Neural Computing and Applications, pp. 1–16. https://doi.org/10.1007/s00521-024-10899-4.

Lacoste-Julien, S. and Jaggi, M., 2015. On the global linear convergence of Frank-Wolfe optimization variants. Advances in neural information processing systems, 28.

Lacoste-Julien, S., Jaggi, M., Schmidt, M. and Pletscher, P., 2013. Block-coordinate Frank-Wolfe optimization for structural SVMs. In: International Conference on Machine Learning. PMLR. pp. 53–61.

Lei, T., Jia, X., Zhang, Y., He, L., Meng, H. and Nandi, A.K., 2018. Significantly fast and robust fuzzy c-means clustering algorithm based on morphological reconstruction and membership filtering. IEEE Transactions on Fuzzy Systems, 26(5), pp. 3027–3041. https://doi.org/10.1109/TFUZZ.2018.2796074.

LI, X., CAI, L., LI, J., YU, C.K.W.A.I. and HU, Y., 2021. A survey of clustering methods via optimization methodology. Journal of Applied & Numerical Optimization, 3(1). https://doi.org/10.23952/jano.3.2021.1.09.

Liu, H., Chen, J., Dy, J. and Fu, Y., 2023. Transforming complex problems into K-means solutions. IEEE transactions on pattern analysis and machine intelligence, 45(7), pp. 9149–9168.

Mahdi, S.S. and Mahmood, R.S., 2014. MR brain image segmentation using spatial fuzzy C-means clustering algorithm. Journal of Engineering, 20(09), pp. 78–89. https://doi.org/10.31026/j.eng.2014.09.06.

Meilă, M., 2007. Comparing clusterings—An information based distance. Journal of multivariate analysis, 98(5), pp. 873–895. https://doi.org/10.1016/j.jmva.2006.11.013.

Meléndez Surmay, R., Giraldo Henao, R. and Rodríguez Cortes, F., 2024. Kruskal-Wallis test for functional data based on random projections generated from a simulation of a Brownian motion. TecnoLógicas, 27(59).

Oyelade, J., Isewon, I., Oladipupo, F., Aromolaran, O., Uwoghiren, E., Ameh, F., Achas, M. and Adebiyi, E., 2016. Clustering algorithms: Their application to gene expression data. Bioinformatics and Biology insights, 10, P. BBI-S38316. https://doi.org/10.4137/BBI.S38316.

Raeisi, M. and Sesay, A.B., 2022. A distance metric for uneven clusters of unsupervised k-means clustering algorithm. IEEE Access, 10, pp. 86286–86297. https://doi.org/10.1109/ACCESS.2022.3198992.

Raymaekers, J. and Zamar, R.H., 2022. Regularized k-means through hard-thresholding. Journal of Machine Learning Research, 23(93), pp. 1–48.

Saha, J., Tanvir, R.H., Hassan Samee, M.A. and Rahman, A., 2023. Probabilistic clustering of cells using single-cell RNA-seq data. bioRxiv, pp. 2012–2023.

Salman, A. and Hussain, B.A., 2023. Gene expression analysis via spatial clustering and evaluation indexing. Iraqi Journal for Computer Science and Mathematics, 4(1), pp. 24–34. https://doi.org/10.52866/ijcsm.2023.01.01.004.

Sarray, B. Al, Chrétien, S., Clarkson, P. and Cottez, G., 2017. Enhancing Prony’s method by nuclear norm penalization and extension to missing data. Signal, Image and Video Processing, 11, pp. 1089–1096.

Shiltagh, N.A. and Hussein, M.A., 2015. Data aggregation in wireless sensor networks using modified Voronoi fuzzy clustering algorithm. Journal of Engineering, 21(4), pp. 42–60. https://doi.org/10.31026/j.eng.2015.04.03.

Strehl, A. and Ghosh, J., 2002. Cluster ensembles---a knowledge reuse framework for combining multiple partitions. Journal of machine learning research, 3(Dec), pp. 583–617. https://doi.org/10.1162/153244303321897735.

Sun, W., Wang, J. and Fang, Y., 2012. Regularized k-means clustering of high-dimensional data and its asymptotic consistency.

Teran Hidalgo, S.J., Zhu, T., Wu, M. and Ma, S., 2018. Overlapping clustering of gene expression data using penalized weighted normalized cut. Genetic epidemiology, 42(8), pp. 796–811.

Wei, X., Wu, J., Li, G., Liu, J., Wu, X. and He, C., 2025. scPEDSSC: Proximity enhanced deep sparse subspace clustering method for scRNA-seq data. PLOS Computational Biology, 21(4), P. e1012924. https://doi.org/10.1371/journal.pcbi.1012924.

Wirth, E., Pena, J. and Pokutta, S., 2024. Fast Convergence of Frank-Wolfe algorithms on polytopes. arXiv preprint arXiv:2406.18789.

Wu, Z. and Wu, Z., 2020. An enhanced regularized k-means type clustering algorithm with adaptive weights. IEEE Access, 8, pp. 31171–31179. https://doi.org/10.1109/ACCESS.2020.2972333.

Yang, X., Zhao, W., Xu, Y., Wang, C.-D., Li, B. and Nie, F., 2024. Sparse K-means clustering algorithm with anchor graph regularization. Information Sciences, 667, P. 120504. https://doi.org/10.1016/j.ins.2024.120504.

Yousif, A.Y. and Sarray, B. Al, 2024. Convex optimization techniques for high-dimensional data clustering analysis: A review. Iraqi Journal for Computer Science and Mathematics, 5(3), P. 29. https://doi.org/10.52866/ijcsm.2024.05.03.022.

Zhang, X., He, Y., Jin, Y., Qin, H., Azhar, M. and Huang, J.Z., 2020. A robust k‐means clustering algorithm based on observation point mechanism. Complexity, 2020(1), P. 3650926. https://doi.org/10.1155/2020/3650926.

الشريط الجانبي للمقالة

محتوى المقالة الرئيسي