In the rapid growth of textual data in various domains has increased the need for efficient clustering techniques capable of handling large-scale datasets. Traditional clustering methods often fail to capture semantic relationships and struggle with high-dimensional, sparse data. The present study shows an improved document clustering technique, i.e., WEClustering++, which enhances the existing WEClustering framework by integrating fine-tuned BERT based word embeddings. The proposed model incorporates advanced dimensionality reduction techniques and optimized clustering algorithms to improve clustering accuracy. In the present work, the BERT-large model, fine-tuned on domain-specific datasets is utilized. Seven benchmark datasets spanning various domains and sizes are considered. These datasets include collections of research articles, news articles, and other domain-specific texts. Experimental evaluations on multiple benchmark datasets demonstrate significant performance improvements in clustering metrics, including silhouette score, purity, and ARI. Results show a 45% and 67% increase in median silhouette scores for WEClustering_K++ (K-means-based) and WEClustering_A++ (Agglomerative-based) models, respectively. Result also shows an increase of median purity metrics of 0.4% and 0.8% is obtained for proposed WEClustering_K++ and WEClustering_A++ compared to the state of art model. Also, an increase of median ARI metrics of 7% and 11% is obtained for proposed WEClustering_K++ and WEClustering_A++ compared to the state of art model. These findings highlight the potential of fine-tuned word embeddings in bridging the gap between statistical clustering robustness and semantic understanding. The proposed approach is expected to contribute to advancements in large-scale text mining applications, including document organization, topic modelling, and information retrieval.
| Published in | Machine Learning Research (Volume 10, Issue 1) |
| DOI | 10.11648/j.mlr.20251001.14 |
| Page(s) | 32-43 |
| Creative Commons |
This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited. |
| Copyright |
Copyright © The Author(s), 2025. Published by Science Publishing Group |
Deep Learning, Word Embedding, Large Text Data, Silhouette Score, Clustering Technique
Data | Agglomerative [15] | K_means [15] | WEClustering_K [15] | WEClustering_A [15] | WEClustering _K++ | WEClustering _A++ |
|---|---|---|---|---|---|---|
Articles-253 | 0.11 | 0.12 | 0.45 | 0.43 | 0.65 | 0.56 |
Classic4 | 0.06 | 0.03 | 0.21 | 0.25 | 0.46 | 0.47 |
Classic4-long | 0.04 | 0.01 | 0.238 | 0.21 | 0.385 | 0.391 |
Scopus | 0.03 | 0.03 | 0.212 | 0.191 | 0.3 | 0.32 |
Scopus-long | 0.025 | 0.032 | 0.234 | 0.212 | 0.32 | 0.28 |
20NG | 0.031 | 0.04 | 0.197 | 0.112 | 0.201 | 0.16 |
20NG-long | 0.01 | 0.03 | 0.043 | 0.091 | 0.25 | 0.16 |
Data | Agglomerative [15] | K_means [15] | WEClustering_K [15] | WEClustering_A [15] | WEClustering _K++ | WEClustering _A++ |
|---|---|---|---|---|---|---|
Articles-253 | 0.94 | 0.93 | 0.971 | 0.963 | 0.975 | 0.971 |
Classic4 | 0.866 | 0.84 | 0.911 | 0.925 | 0.89 | 0.912 |
Classic4-long | 0.911 | 0.81 | 0.958 | 0.94 | 0.985 | 0.991 |
Scopus | 0.71 | 0.87 | 0.975 | 0.91 | 0.88 | 0.892 |
Scopus-long | 0.41 | 0.69 | 0.722 | 0.722 | 0.76 | 0.71 |
20NG | 0.12 | 0.574 | 0.534 | 0.622 | 0.701 | 0.66 |
20NG-long | 0.94 | 0.93 | 0.971 | 0.963 | 0.975 | 0.971 |
Data | Agglomerative [15] | K_means [15] | WEClustering_K [15] | WEClustering_A [15] | WEClustering _K++ | WEClustering _A++ |
|---|---|---|---|---|---|---|
Articles-253 | 0.961 | 0.98 | 0.971 | 0.989 | 0.985 | 0.967 |
Classic4 | 0.854 | 0.69 | 0.932 | 0.947 | 0.96 | 0.943 |
Classic4-long | 0.921 | 0.598 | 0.847 | 0.96 | 0.915 | 0.931 |
Scopus | 0.743 | 0.847 | 0.925 | 0.851 | 0.967 | 0.952 |
Scopus-long | 0.621 | 0.511 | 0.71 | 0.672 | 0.76 | 0.8 |
20NG | 0.162 | 0.164 | 0.434 | 0.302 | 0.601 | 0.56 |
20NG-long | 0.09 | 0.062 | 0.202 | 0.191 | 0.521 | 0.6 |
ARI | Adjusted Rand Index |
BERT | Bidirectional Encoder Representations from Transformers |
IDF | Inverse Document Frequency |
TF | Term Frequency |
WEClustering++ | Word Embeddings Clustering++ |
| [1] | Novel coronavirus resource directory (2020) Accessed Feb 08, 2025 |
| [2] | Johnson R, Watkinson A, Mabe M (2018) The stm report. An overview of scientific and scholarly publishing, 5th Ed. |
| [3] | Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT press, New York |
| [4] | Robertson S (2004) Understanding inverse document frequency: on theoretical arguments for idf. J Doc 60(5): 503–520 |
| [5] | Hotho A, Staab S, Stumme G (2003) Ontologies improve text document clustering. In: Third IEEE international conference on data mining, pp 541–544. IEEE. |
| [6] | Mehta V, Bawa S, Singh J (2021) Stamantic clustering: combining statistical and semantic features for clustering of large text datasets. Expert Syst Appl 174: 114710 |
| [7] | Sedding J, Kazakov D (2004) Wordnet-based text document clustering. In: proceedings of the 3rd workshop on robust methods in analysis of natural language data. Association for Computational Linguistics, pp 104–113 |
| [8] | Wei T, Lu Y, Chang H, Zhou Q, Bao X (2015) A semantic approach for text clustering using wordnet and lexical chains. Expert Syst Appl 42(4): 2264–2275 |
| [9] | Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv: 1301.3781 |
| [10] | Miller GA (1995) Wordnet: a lexical database for English. Commun ACM 38(11): 39–41 |
| [11] | Chang WC, Yu HF, Zhong K, Yang Y, Dhillon I (2019) Xbert: extreme multi-label text classification with using bidirectional encoder representations from transformers. arXiv preprint arXiv: 1905.02331 |
| [12] | Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543 |
| [13] | Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching wordvectors with subword information. Trans Assoc Comput Linguist 5: 135–146. |
| [14] | Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv: 1810.04805 |
| [15] | Mehta, V., Bawa, S., & Singh, J. (2021). WEClustering: Word embeddings based text clustering technique for large datasets. Complex & Intelligent Systems, 7(6), 3211–3224. |
| [16] | Almeida F, Xexéo G (2019) Word embeddings: a survey. arXiv preprint arXiv: 1901.09069 |
| [17] | Bakarov A (2018) A survey of word embeddings evaluation methods. arXiv preprint arXiv: 1801.09536 |
| [18] | Camacho-Collados J, Pilehvar MT (2018) From word to sense embeddings: a survey on vector representations of meaning. J Artif Intell Res 63: 743–788 |
| [19] | Wang S, Zhou W, Jiang C (2020) A survey of word embeddings based on deep learning. Computing 102(3): 717–740 |
| [20] | Fränti P, Sieranoja S (2018) K-means properties on six clustering benchmark datasets. Appl Intell 48(12): 4743–4759 |
| [21] | Sculley D (2010) Web-scale k-means clustering. In: Proceedings of the 19th international conference on World wide web, pp 1177–1178 |
| [22] | Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam |
| [23] | Nielsen F (2016) Hierarchical clustering. Introduction to HPC with MPI for data science. Springer, New York, pp 195–211 |
| [24] | Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv: 1810.04805. |
| [25] | Arthur D, Vassilvitskii S (2006) k-means++: the advantages of careful seeding. Technical report, Stanford |
| [26] |
Soares VH. Downloads. Available from:
https://vhasoares.github.io/downloads.html (accessed 18 November 2024) |
| [27] |
Langley J. 20 Newsgroups Dataset. Available from:
http://qwone.com/~jason/20Newsgroups/ (accessed 18 November 2024) |
| [28] | Shahapure KRS and Nicholas C (2020) Cluster Quality Analysis Using Silhouette Score, IEEE DSAA, |
| [29] | Santos JM, Embrechts M (2009) On the use of the adjusted rand index as a metric for evaluating supervised classification. International conference on artificial neural networks. Springer, New York, pp 175–184 |
| [30] | Manning CD, Schütze H, Raghavan P (2008) Introduction to information retrieval. Cambridge University Press, Cambridge |
APA Style
Sutrakar, V. K., Mogre, N. (2025). An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets. Machine Learning Research, 10(1), 32-43. https://doi.org/10.11648/j.mlr.20251001.14
ACS Style
Sutrakar, V. K.; Mogre, N. An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets. Mach. Learn. Res. 2025, 10(1), 32-43. doi: 10.11648/j.mlr.20251001.14
AMA Style
Sutrakar VK, Mogre N. An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets. Mach Learn Res. 2025;10(1):32-43. doi: 10.11648/j.mlr.20251001.14
@article{10.11648/j.mlr.20251001.14,
author = {Vijay Kumar Sutrakar and Nikhil Mogre},
title = {An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets
},
journal = {Machine Learning Research},
volume = {10},
number = {1},
pages = {32-43},
doi = {10.11648/j.mlr.20251001.14},
url = {https://doi.org/10.11648/j.mlr.20251001.14},
eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.mlr.20251001.14},
abstract = {In the rapid growth of textual data in various domains has increased the need for efficient clustering techniques capable of handling large-scale datasets. Traditional clustering methods often fail to capture semantic relationships and struggle with high-dimensional, sparse data. The present study shows an improved document clustering technique, i.e., WEClustering++, which enhances the existing WEClustering framework by integrating fine-tuned BERT based word embeddings. The proposed model incorporates advanced dimensionality reduction techniques and optimized clustering algorithms to improve clustering accuracy. In the present work, the BERT-large model, fine-tuned on domain-specific datasets is utilized. Seven benchmark datasets spanning various domains and sizes are considered. These datasets include collections of research articles, news articles, and other domain-specific texts. Experimental evaluations on multiple benchmark datasets demonstrate significant performance improvements in clustering metrics, including silhouette score, purity, and ARI. Results show a 45% and 67% increase in median silhouette scores for WEClustering_K++ (K-means-based) and WEClustering_A++ (Agglomerative-based) models, respectively. Result also shows an increase of median purity metrics of 0.4% and 0.8% is obtained for proposed WEClustering_K++ and WEClustering_A++ compared to the state of art model. Also, an increase of median ARI metrics of 7% and 11% is obtained for proposed WEClustering_K++ and WEClustering_A++ compared to the state of art model. These findings highlight the potential of fine-tuned word embeddings in bridging the gap between statistical clustering robustness and semantic understanding. The proposed approach is expected to contribute to advancements in large-scale text mining applications, including document organization, topic modelling, and information retrieval.
},
year = {2025}
}
TY - JOUR T1 - An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets AU - Vijay Kumar Sutrakar AU - Nikhil Mogre Y1 - 2025/04/29 PY - 2025 N1 - https://doi.org/10.11648/j.mlr.20251001.14 DO - 10.11648/j.mlr.20251001.14 T2 - Machine Learning Research JF - Machine Learning Research JO - Machine Learning Research SP - 32 EP - 43 PB - Science Publishing Group SN - 2637-5680 UR - https://doi.org/10.11648/j.mlr.20251001.14 AB - In the rapid growth of textual data in various domains has increased the need for efficient clustering techniques capable of handling large-scale datasets. Traditional clustering methods often fail to capture semantic relationships and struggle with high-dimensional, sparse data. The present study shows an improved document clustering technique, i.e., WEClustering++, which enhances the existing WEClustering framework by integrating fine-tuned BERT based word embeddings. The proposed model incorporates advanced dimensionality reduction techniques and optimized clustering algorithms to improve clustering accuracy. In the present work, the BERT-large model, fine-tuned on domain-specific datasets is utilized. Seven benchmark datasets spanning various domains and sizes are considered. These datasets include collections of research articles, news articles, and other domain-specific texts. Experimental evaluations on multiple benchmark datasets demonstrate significant performance improvements in clustering metrics, including silhouette score, purity, and ARI. Results show a 45% and 67% increase in median silhouette scores for WEClustering_K++ (K-means-based) and WEClustering_A++ (Agglomerative-based) models, respectively. Result also shows an increase of median purity metrics of 0.4% and 0.8% is obtained for proposed WEClustering_K++ and WEClustering_A++ compared to the state of art model. Also, an increase of median ARI metrics of 7% and 11% is obtained for proposed WEClustering_K++ and WEClustering_A++ compared to the state of art model. These findings highlight the potential of fine-tuned word embeddings in bridging the gap between statistical clustering robustness and semantic understanding. The proposed approach is expected to contribute to advancements in large-scale text mining applications, including document organization, topic modelling, and information retrieval. VL - 10 IS - 1 ER -