An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets

Vijay Kumar Sutrakar; Nikhil Mogre

doi:doi:10.11648/j.mlr.20251001.14

Research Article |

| Peer-Reviewed

An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets

Vijay Kumar Sutrakar^*

, Nikhil Mogre

Published in Machine Learning Research (Volume 10, Issue 1)

Received: 14 March 2025 Accepted: 31 March 2025 Published: 29 April 2025

Views: Downloads:

Download PDF

Share This Article

Twitter
Linked In
Facebook

Abstract

In the rapid growth of textual data in various domains has increased the need for efficient clustering techniques capable of handling large-scale datasets. Traditional clustering methods often fail to capture semantic relationships and struggle with high-dimensional, sparse data. The present study shows an improved document clustering technique, i.e., WEClustering++, which enhances the existing WEClustering framework by integrating fine-tuned BERT based word embeddings. The proposed model incorporates advanced dimensionality reduction techniques and optimized clustering algorithms to improve clustering accuracy. In the present work, the BERT-large model, fine-tuned on domain-specific datasets is utilized. Seven benchmark datasets spanning various domains and sizes are considered. These datasets include collections of research articles, news articles, and other domain-specific texts. Experimental evaluations on multiple benchmark datasets demonstrate significant performance improvements in clustering metrics, including silhouette score, purity, and ARI. Results show a 45% and 67% increase in median silhouette scores for WEClustering_K++ (K-means-based) and WEClustering_A++ (Agglomerative-based) models, respectively. Result also shows an increase of median purity metrics of 0.4% and 0.8% is obtained for proposed WEClustering_K++ and WEClustering_A++ compared to the state of art model. Also, an increase of median ARI metrics of 7% and 11% is obtained for proposed WEClustering_K++ and WEClustering_A++ compared to the state of art model. These findings highlight the potential of fine-tuned word embeddings in bridging the gap between statistical clustering robustness and semantic understanding. The proposed approach is expected to contribute to advancements in large-scale text mining applications, including document organization, topic modelling, and information retrieval.

Published in	Machine Learning Research (Volume 10, Issue 1)
DOI	10.11648/j.mlr.20251001.14
Page(s)	32-43
Creative Commons	This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.
Copyright	Copyright © The Author(s), 2025. Published by Science Publishing Group

Keywords

Deep Learning, Word Embedding, Large Text Data, Silhouette Score, Clustering Technique

References

[1]	Novel coronavirus resource directory (2020) Accessed Feb 08, 2025 https://doi.org/10.2172/1602724
[2]	Johnson R, Watkinson A, Mabe M (2018) The stm report. An overview of scientific and scholarly publishing, 5th Ed. https://doi.org/10.1017/9780511862489.019
[3]	Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT press, New York https://doi.org/10.1162/coli.2000.26.2.277
[4]	Robertson S (2004) Understanding inverse document frequency: on theoretical arguments for idf. J Doc 60(5): 503–520 https://doi.org/10.1108/00220410410560582
[5]	Hotho A, Staab S, Stumme G (2003) Ontologies improve text document clustering. In: Third IEEE international conference on data mining, pp 541–544. IEEE. https://doi.org/10.1109/icdm.2003.1250972
[6]	Mehta V, Bawa S, Singh J (2021) Stamantic clustering: combining statistical and semantic features for clustering of large text datasets. Expert Syst Appl 174: 114710 https://doi.org/10.1016/j.eswa.2021.114710
[7]	Sedding J, Kazakov D (2004) Wordnet-based text document clustering. In: proceedings of the 3rd workshop on robust methods in analysis of natural language data. Association for Computational Linguistics, pp 104–113 https://doi.org/10.3115/1621445.1621458
[8]	Wei T, Lu Y, Chang H, Zhou Q, Bao X (2015) A semantic approach for text clustering using wordnet and lexical chains. Expert Syst Appl 42(4): 2264–2275 https://doi.org/10.1016/j.eswa.2014.10.023
[9]	Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv: 1301.3781 https://doi.org/10.32614/cran.package.fasttext
[10]	Miller GA (1995) Wordnet: a lexical database for English. Commun ACM 38(11): 39–41 https://doi.org/10.1145/219717.219748
[11]	Chang WC, Yu HF, Zhong K, Yang Y, Dhillon I (2019) Xbert: extreme multi-label text classification with using bidirectional encoder representations from transformers. arXiv preprint arXiv: 1905.02331 https://doi.org/10.1145/3394486.3403368
[12]	Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543 https://doi.org/10.3115/v1/d14-1162
[13]	Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching wordvectors with subword information. Trans Assoc Comput Linguist 5: 135–146. https://doi.org/10.1162/tacl_a_00051
[14]	Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv: 1810.04805 https://doi.org/10.20944/preprints202411.2377.v1
[15]	Mehta, V., Bawa, S., & Singh, J. (2021). WEClustering: Word embeddings based text clustering technique for large datasets. Complex & Intelligent Systems, 7(6), 3211–3224. https://doi.org/10.1007/s40747-021-00512-9.
[16]	Almeida F, Xexéo G (2019) Word embeddings: a survey. arXiv preprint arXiv: 1901.09069 https://doi.org/10.32614/cran.package.sweater
[17]	Bakarov A (2018) A survey of word embeddings evaluation methods. arXiv preprint arXiv: 1801.09536 https://doi.org/10.32614/cran.package.sweater
[18]	Camacho-Collados J, Pilehvar MT (2018) From word to sense embeddings: a survey on vector representations of meaning. J Artif Intell Res 63: 743–788 https://doi.org/10.1613/jair.1.11259
[19]	Wang S, Zhou W, Jiang C (2020) A survey of word embeddings based on deep learning. Computing 102(3): 717–740 https://doi.org/10.1007/s00607-019-00768-7
[20]	Fränti P, Sieranoja S (2018) K-means properties on six clustering benchmark datasets. Appl Intell 48(12): 4743–4759 https://doi.org/10.1007/s10489-018-1238-7
[21]	Sculley D (2010) Web-scale k-means clustering. In: Proceedings of the 19th international conference on World wide web, pp 1177–1178 https://doi.org/10.1145/1772690.1772862
[22]	Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam https://doi.org/10.1016/b978-0-12-381479-1.00007-1
[23]	Nielsen F (2016) Hierarchical clustering. Introduction to HPC with MPI for data science. Springer, New York, pp 195–211 https://doi.org/10.1007/978-3-319-21903-5_8
[24]	Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv: 1810.04805.
[25]	Arthur D, Vassilvitskii S (2006) k-means++: the advantages of careful seeding. Technical report, Stanford https://doi.org/10.5220/0012045000003546
[26]	Soares VH. Downloads. Available from: https://vhasoares.github.io/downloads.html (accessed 18 November 2024)
[27]	Langley J. 20 Newsgroups Dataset. Available from: http://qwone.com/~jason/20Newsgroups/ (accessed 18 November 2024)
[28]	Shahapure KRS and Nicholas C (2020) Cluster Quality Analysis Using Silhouette Score, IEEE DSAA, https://doi.org/10.1109/dsaa49011.2020.00096
[29]	Santos JM, Embrechts M (2009) On the use of the adjusted rand index as a metric for evaluating supervised classification. International conference on artificial neural networks. Springer, New York, pp 175–184 https://doi.org/10.1007/978-3-642-04277-5_18
[30]	Manning CD, Schütze H, Raghavan P (2008) Introduction to information retrieval. Cambridge University Press, Cambridge https://doi.org/10.1007/s10791-009-9096-x

Cite This Article

Plain Text BibTeX RIS

APA Style

Sutrakar, V. K., Mogre, N. (2025). An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets. Machine Learning Research, 10(1), 32-43. https://doi.org/10.11648/j.mlr.20251001.14

Copy | Download

ACS Style

Sutrakar, V. K.; Mogre, N. An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets. Mach. Learn. Res. 2025, 10(1), 32-43. doi: 10.11648/j.mlr.20251001.14

Copy | Download

AMA Style

Sutrakar VK, Mogre N. An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets. Mach Learn Res. 2025;10(1):32-43. doi: 10.11648/j.mlr.20251001.14

Copy | Download

@article{10.11648/j.mlr.20251001.14,
  author = {Vijay Kumar Sutrakar and Nikhil Mogre},
  title = {An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets
},
  journal = {Machine Learning Research},
  volume = {10},
  number = {1},
  pages = {32-43},
  doi = {10.11648/j.mlr.20251001.14},
  url = {https://doi.org/10.11648/j.mlr.20251001.14},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.mlr.20251001.14},
  abstract = {In the rapid growth of textual data in various domains has increased the need for efficient clustering techniques capable of handling large-scale datasets. Traditional clustering methods often fail to capture semantic relationships and struggle with high-dimensional, sparse data. The present study shows an improved document clustering technique, i.e., WEClustering++, which enhances the existing WEClustering framework by integrating fine-tuned BERT based word embeddings. The proposed model incorporates advanced dimensionality reduction techniques and optimized clustering algorithms to improve clustering accuracy. In the present work, the BERT-large model, fine-tuned on domain-specific datasets is utilized. Seven benchmark datasets spanning various domains and sizes are considered. These datasets include collections of research articles, news articles, and other domain-specific texts. Experimental evaluations on multiple benchmark datasets demonstrate significant performance improvements in clustering metrics, including silhouette score, purity, and ARI. Results show a 45% and 67% increase in median silhouette scores for WEClustering_K++ (K-means-based) and WEClustering_A++ (Agglomerative-based) models, respectively. Result also shows an increase of median purity metrics of 0.4% and 0.8% is obtained for proposed WEClustering_K++ and WEClustering_A++ compared to the state of art model. Also, an increase of median ARI metrics of 7% and 11% is obtained for proposed WEClustering_K++ and WEClustering_A++ compared to the state of art model. These findings highlight the potential of fine-tuned word embeddings in bridging the gap between statistical clustering robustness and semantic understanding. The proposed approach is expected to contribute to advancements in large-scale text mining applications, including document organization, topic modelling, and information retrieval.
},
 year = {2025}
}

Copy | Download

TY - JOUR
T1 - An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets

AU - Vijay Kumar Sutrakar
AU - Nikhil Mogre
Y1 - 2025/04/29
PY - 2025
N1 - https://doi.org/10.11648/j.mlr.20251001.14
DO - 10.11648/j.mlr.20251001.14
T2 - Machine Learning Research
JF - Machine Learning Research
JO - Machine Learning Research
SP - 32
EP - 43
PB - Science Publishing Group
SN - 2637-5680
UR - https://doi.org/10.11648/j.mlr.20251001.14
AB - In the rapid growth of textual data in various domains has increased the need for efficient clustering techniques capable of handling large-scale datasets. Traditional clustering methods often fail to capture semantic relationships and struggle with high-dimensional, sparse data. The present study shows an improved document clustering technique, i.e., WEClustering++, which enhances the existing WEClustering framework by integrating fine-tuned BERT based word embeddings. The proposed model incorporates advanced dimensionality reduction techniques and optimized clustering algorithms to improve clustering accuracy. In the present work, the BERT-large model, fine-tuned on domain-specific datasets is utilized. Seven benchmark datasets spanning various domains and sizes are considered. These datasets include collections of research articles, news articles, and other domain-specific texts. Experimental evaluations on multiple benchmark datasets demonstrate significant performance improvements in clustering metrics, including silhouette score, purity, and ARI. Results show a 45% and 67% increase in median silhouette scores for WEClustering_K++ (K-means-based) and WEClustering_A++ (Agglomerative-based) models, respectively. Result also shows an increase of median purity metrics of 0.4% and 0.8% is obtained for proposed WEClustering_K++ and WEClustering_A++ compared to the state of art model. Also, an increase of median ARI metrics of 7% and 11% is obtained for proposed WEClustering_K++ and WEClustering_A++ compared to the state of art model. These findings highlight the potential of fine-tuned word embeddings in bridging the gap between statistical clustering robustness and semantic understanding. The proposed approach is expected to contribute to advancements in large-scale text mining applications, including document organization, topic modelling, and information retrieval.

VL - 10
IS - 1
ER -

Copy | Download

Author Information

Vijay Kumar Sutrakar

Aeronautical Development Establishment, Defence Research and Development Organisation, Bangalore, India

Contact Email

http://orcid.org/0000-0001-9186-5816
Nikhil Mogre

Aeronautical Development Establishment, Defence Research and Development Organisation, Bangalore, India

Contact Email

http://orcid.org/0009-0000-1143-3037

Download PDF

Sections

Plain Text BibTeX RIS

APA Style

Sutrakar, V. K., Mogre, N. (2025). An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets. Machine Learning Research, 10(1), 32-43. https://doi.org/10.11648/j.mlr.20251001.14

Copy | Download

ACS Style

Sutrakar, V. K.; Mogre, N. An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets. Mach. Learn. Res. 2025, 10(1), 32-43. doi: 10.11648/j.mlr.20251001.14

Copy | Download

AMA Style

Sutrakar VK, Mogre N. An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets. Mach Learn Res. 2025;10(1):32-43. doi: 10.11648/j.mlr.20251001.14

Copy | Download

@article{10.11648/j.mlr.20251001.14,
  author = {Vijay Kumar Sutrakar and Nikhil Mogre},
  title = {An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets
},
  journal = {Machine Learning Research},
  volume = {10},
  number = {1},
  pages = {32-43},
  doi = {10.11648/j.mlr.20251001.14},
  url = {https://doi.org/10.11648/j.mlr.20251001.14},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.mlr.20251001.14},
  abstract = {In the rapid growth of textual data in various domains has increased the need for efficient clustering techniques capable of handling large-scale datasets. Traditional clustering methods often fail to capture semantic relationships and struggle with high-dimensional, sparse data. The present study shows an improved document clustering technique, i.e., WEClustering++, which enhances the existing WEClustering framework by integrating fine-tuned BERT based word embeddings. The proposed model incorporates advanced dimensionality reduction techniques and optimized clustering algorithms to improve clustering accuracy. In the present work, the BERT-large model, fine-tuned on domain-specific datasets is utilized. Seven benchmark datasets spanning various domains and sizes are considered. These datasets include collections of research articles, news articles, and other domain-specific texts. Experimental evaluations on multiple benchmark datasets demonstrate significant performance improvements in clustering metrics, including silhouette score, purity, and ARI. Results show a 45% and 67% increase in median silhouette scores for WEClustering_K++ (K-means-based) and WEClustering_A++ (Agglomerative-based) models, respectively. Result also shows an increase of median purity metrics of 0.4% and 0.8% is obtained for proposed WEClustering_K++ and WEClustering_A++ compared to the state of art model. Also, an increase of median ARI metrics of 7% and 11% is obtained for proposed WEClustering_K++ and WEClustering_A++ compared to the state of art model. These findings highlight the potential of fine-tuned word embeddings in bridging the gap between statistical clustering robustness and semantic understanding. The proposed approach is expected to contribute to advancements in large-scale text mining applications, including document organization, topic modelling, and information retrieval.
},
 year = {2025}
}

Copy | Download

TY - JOUR
T1 - An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets

VL - 10
IS - 1
ER -

Copy | Download