The Effect of Fine-tuned Word Embedding Techniques on the Accuracy of Automated Essay Scoring Systems Using Neural Networks

The Effect of Fine-tuned Word Embedding Techniques on the Accuracy of Automated Essay Scoring Systems Using Neural Networks


  • Centre for Research in Applied Measurement and Evaluation, University of Alberta, Edmonton
  • Centre for Research in Applied Measurement and Evaluation, University of Alberta, Edmonton
  • Department of Computing Science, University of Alberta, Edmonton
  • Department of Computing Science, University of Alberta, Edmonton
  • Department of Computing Science, University of Alberta, Edmonton


Automated Essay Scoring, Glove Embedding, Neural Networks, Word Embeddings, Word2Vec


Automated Essay Scoring (AES) using neural networks has helped increase the accuracy and efficiency of scoring students’ written tasks. Generally, the improved accuracy of neural network approaches has been attributed to the use of modern word embedding techniques. However, which word embedding techniques produce higher accuracy in AES systems with neural networks is still unclear. In addition, the importance of fine-tuned word embedding techniques on the accuracy of the AES systems is not justified yet. This study investigates the effect of fine-tuned modern word embedding techniques, including pretrained GloVe and Word2Vec, on the accuracy of a deep learning AES model using a Long-Short Term Memory (LSTM) network. The dataset used in this study consisted of 12,978 essays introduced in the 2012 Automated Scoring Assessment Prize (ASAP) competition. Results show that fine-tuned word embedding techniques could significantly improve the accuracy of the AES (QWK= 0.79) compared with the baseline model without pretrained embeddings (QWK = 0.73). Moreover, when used in AES, the pre-trained GloVe word embedding (QWK= 0.79) outperformed Word2Vec (QWK = 0.77). The results of this study can guide future AES studies in selecting more appropriate word representations and how to fine-tune the word embedding techniques for scoring-related tasks.


Download data is not yet available.


Metrics Loading ...




How to Cite

Firoozi, T., Bulut, O., Epp, C. D., Naeimabadi, A., & Barbosa, D. (2022). The Effect of Fine-tuned Word Embedding Techniques on the Accuracy of Automated Essay Scoring Systems Using Neural Networks. Journal of Applied Testing Technology, 23, 21–29. Retrieved from





Alikaniotis, D., Yannakoudakis, H. & Rei, M. (2016). Automatic text scoring using neural networks. arXiv preprint arXiv:1606.04289.

Araujo, A., Golo, M., Viana, B., Sanches, F., Romero, R. & Marcacini, R. (2020, October). From bag-of-words to pre-trained neural language models: Improving automatic classification of app reviews for requirements engineering. In Anais do XVII Encontro Nacional de Inteligência Artificial e Computacional (pp. 378-389). SBC.

Baldi, Pierre, and Peter J. Sadowski (2013). Understanding dropout. Advances in Neural Information Processing Systems.

Bird, S., Klein, E. & Loper, E. (2009). Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc.

Birunda, S. & Devi, R. K. (2021). A review on word embedding techniques for text classification. In Innovative Data Communication Technologies and Application. Springer, 267-281. 9651-3_23

Cai, D., He, X., Wang, X., Bao, H., & Han, J. (2009, June). Locality preserving nonnegative matrix factorization. In 21st International Joint Conference on Artificial Intelligence.

Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.

Devlin, J., Chang, M.W., Lee, K. and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Dong, F., Zhang, Y. and Yang, J. (2017, August). Attentionbased recurrent convolutional neural network for automatic essay scoring. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017) (pp. 153-162).

Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179-211. s15516709cog1402_1

Fernandez, N., Ghosh, A., Liu, N., Wang, Z., Choffin, B., Baraniuk, R. & Lan, A. (2022). Automated Scoring for Reading Comprehension via In-context BERT Tuning. arXiv preprint arXiv:2205.09864. 031-11644-5_69

Gao, J., He, Y., Zhang, X. & Xia, Y. (2017, November). Duplicate short text detection based on Word2vec. In 2017 8th IEEE International Conference on Software Engineering and Service Science (ICSESS) (pp. 33-37). IEEE.

Haider, M. M., Hossin, M. A., Mahi, H. R. & Arif, H. (2020, June). Automatic text summarization using gensim word- 2vec and k-means clustering algorithm. In 2020 IEEE Region 10 Symposium (TENSYMP) (pp. 283-286). IEEE

Hendrycks, D., Lee, K. & Mazeika, M. (2019, May). Using pre-training can improve model robustness and uncertainty. In International Conference on Machine Learning (pp. 2712-2721). PMLR.

Jang, B., Kim, M., Harerimana, G., Kang, S. U. & Kim, J. W. (2020). Bi-LSTM model to increase accuracy in text classification: Combining Word2vec CNN and attention mechanism. Applied Sciences, 10(17), 5841.

Kao, C. C., Sun, M., Wang, W., & Wang, C. (2020, May). A comparison of pooling methods on LSTM models for rare acoustic event classification. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 316-320).

Kumar, V., & Boulanger, D. (2020, October). Explainable automated essay scoring: Deep learning really has pedagogical value. In Frontiers in education (Vol. 5, p. 572367). Frontiers Media SA.

Landauer, T. K., Foltz, P. W. & Laham, D. (1998). An introduction to latent seman ic analysis. Discourse Processes, 25(2-3), 259-284.

Lottridge, S., Godek, B., Jafari, A., & Patel, M. (2021). Comparing the robustness of deep learning and classical automated scoring approaches to gaming strategies. Technical report, Cambium Assessment Inc.

Lottridge, S., Burkhardt, A. & Boyer, M. (2020). Digital module 18: Automated scoring https://ncme.elevate. Educational Measurement: Issues and Practice, 39(3), 141-142.

Levy, O. & Goldberg, Y. (2014, June). Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 302-308). PMid:25270273

Mayfield, E. & Black, A. W. (2020, July). Should you finetune BERT for automated essay scoring? In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 151-162).

Mikolov, T., Chen, K., Corrado, G. & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26.

Ormerod, C., Malhotra, A. & Jafari, A. (2021). Automated essay scoring using efficient transformer-based language models. rXiv preprint:

Pennington, J., Socher, R. & Manning, C.D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on Empirical Methods In Natural Language Processing (EMNLP) (pp. 1532-1543).

Pickard, T. (2020, December). Comparing word2vec and GloVe for automatic measurement of MWE compositionality. In Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons (pp. 95-100).

Pauls, A. & Yoder, J. (2018) Determining optimum drop-out rate for neural networks. Midwest Instructional Computing Symposium (MICS).

Rodriguez, P. U., Jafari, A. & Ormerod, C. M. (2019). Language models and automated essay scoring. arXiv preprint arXiv:1909.09482.

Salehi, B., Cook, P. & Baldwin, T. (2015). A word embedding approach to predicting the compositionality of multiword expressions. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 977-983).

Shin, J. & Gierl, M.J. (2020). More efficient processes for creating automated essay scoring frameworks: A demonstration of two algorithms. Language Testing, p. 0265532220937830.

Taghipour, K. & Ng, H.T. (2016, November). A neural approach to automated essay scoring. In Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 1882-1891).

Tian, W., Li, J. & Li, H. (2018, July). A method of feature selection based on Word2Vec in text categorization. In 2018 37th Chinese Control Conference (CCC) (pp. 9452-9455). IEEE.

Uto, M. (2021). A review of deep-neural automated essay scoring models. Behavior Metrika, 48(2), 459-484.

Uto, M., Xie, Y. & Ueno, M. (2020). Neural Automated Essay Scoring Incorporating Handcrafted Features. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6077-6088, Barcelona, Spain (Online). International Committee on Computational Linguistics

Williamson, D. M., Xi, X. & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2-13.

Wu, C., Li, X., Guo, Y., Wang, J., Ren, Z., Wang, M. & Yang, Z. (2022). Natural language processing for smart construction: Current status and future directions. Automation in Construction, 134, 104059.

Yang, D., Rupp, A.A. & Foltz, P.W. eds. (2020). Handbook of automated scoring: Theory into practice. New York, NY: Taylor & Francis Group/CRC Press.

Zhao, S., Zhang, Y., Xiong, X., Botelho, A. & Heffernan, N. (2017, April). A memory-augmented neural model for automated grading. In Proceedings of the 4th (2017) ACM Conference on Learning@ scale (pp. 189-192).