The Effect of Fine-tuned Word Embedding Techniques on the Accuracy of Automated Essay Scoring Systems Using Neural Networks

The Effect of Fine-tuned Word Embedding Techniques on the Accuracy of Automated Essay Scoring Systems Using Neural Networks

Authors

  • Centre for Research in Applied Measurement and Evaluation, University of Alberta, Edmonton
  • Centre for Research in Applied Measurement and Evaluation, University of Alberta, Edmonton
  • Department of Computing Science, University of Alberta, Edmonton
  • Department of Computing Science, University of Alberta, Edmonton
  • Department of Computing Science, University of Alberta, Edmonton

Keywords:

Automated Essay Scoring, Glove Embedding, Neural Networks, Word Embeddings, Word2Vec

Abstract

Automated Essay Scoring (AES) using neural networks has helped increase the accuracy and efficiency of scoring students’ written tasks. Generally, the improved accuracy of neural network approaches has been attributed to the use of modern word embedding techniques. However, which word embedding techniques produce higher accuracy in AES systems with neural networks is still unclear. In addition, the importance of fine-tuned word embedding techniques on the accuracy of the AES systems is not justified yet. This study investigates the effect of fine-tuned modern word embedding techniques, including pretrained GloVe and Word2Vec, on the accuracy of a deep learning AES model using a Long-Short Term Memory (LSTM) network. The dataset used in this study consisted of 12,978 essays introduced in the 2012 Automated Scoring Assessment Prize (ASAP) competition. Results show that fine-tuned word embedding techniques could significantly improve the accuracy of the AES (QWK= 0.79) compared with the baseline model without pretrained embeddings (QWK = 0.73). Moreover, when used in AES, the pre-trained GloVe word embedding (QWK= 0.79) outperformed Word2Vec (QWK = 0.77). The results of this study can guide future AES studies in selecting more appropriate word representations and how to fine-tune the word embedding techniques for scoring-related tasks.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

Downloads

Published

2022-12-16

How to Cite

Firoozi, T., Bulut, O., Epp, C. D., Naeimabadi, A., & Barbosa, D. (2022). The Effect of Fine-tuned Word Embedding Techniques on the Accuracy of Automated Essay Scoring Systems Using Neural Networks. Journal of Applied Testing Technology, 23, 21–29. Retrieved from http://jattjournal.net/index.php/atp/article/view/172687

Issue

Section

Articles

References

Alikaniotis, D., Yannakoudakis, H. & Rei, M. (2016). Automatic text scoring using neural networks. arXiv preprint arXiv:1606.04289. https://doi.org/10.18653/v1/P16-1068

Araujo, A., Golo, M., Viana, B., Sanches, F., Romero, R. & Marcacini, R. (2020, October). From bag-of-words to pre-trained neural language models: Improving automatic classification of app reviews for requirements engineering. In Anais do XVII Encontro Nacional de Inteligência Artificial e Computacional (pp. 378-389). SBC. https://doi.org/10.5753/eniac.2020.12144

Baldi, Pierre, and Peter J. Sadowski (2013). Understanding dropout. Advances in Neural Information Processing Systems.

Bird, S., Klein, E. & Loper, E. (2009). Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc.

Birunda, S. & Devi, R. K. (2021). A review on word embedding techniques for text classification. In Innovative Data Communication Technologies and Application. Springer, 267-281. https://doi.org/10.1007/978-981-15- 9651-3_23

Cai, D., He, X., Wang, X., Bao, H., & Han, J. (2009, June). Locality preserving nonnegative matrix factorization. In 21st International Joint Conference on Artificial Intelligence.

Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.

Devlin, J., Chang, M.W., Lee, K. and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Dong, F., Zhang, Y. and Yang, J. (2017, August). Attentionbased recurrent convolutional neural network for automatic essay scoring. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017) (pp. 153-162). https://doi.org/10.18653/v1/K17-1017

Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179-211. https://doi.org/10.1207/ s15516709cog1402_1

Fernandez, N., Ghosh, A., Liu, N., Wang, Z., Choffin, B., Baraniuk, R. & Lan, A. (2022). Automated Scoring for Reading Comprehension via In-context BERT Tuning. arXiv preprint arXiv:2205.09864. https://doi.org/10.1007/978-3- 031-11644-5_69

Gao, J., He, Y., Zhang, X. & Xia, Y. (2017, November). Duplicate short text detection based on Word2vec. In 2017 8th IEEE International Conference on Software Engineering and Service Science (ICSESS) (pp. 33-37). IEEE. https://doi.org/10.1109/ICSESS.2017.8342858

Haider, M. M., Hossin, M. A., Mahi, H. R. & Arif, H. (2020, June). Automatic text summarization using gensim word- 2vec and k-means clustering algorithm. In 2020 IEEE Region 10 Symposium (TENSYMP) (pp. 283-286). IEEE https://doi.org/10.1109/TENSYMP50017.2020.9230670

Hendrycks, D., Lee, K. & Mazeika, M. (2019, May). Using pre-training can improve model robustness and uncertainty. In International Conference on Machine Learning (pp. 2712-2721). PMLR.

Jang, B., Kim, M., Harerimana, G., Kang, S. U. & Kim, J. W. (2020). Bi-LSTM model to increase accuracy in text classification: Combining Word2vec CNN and attention mechanism. Applied Sciences, 10(17), 5841. https://doi.org/10.3390/app10175841

Kao, C. C., Sun, M., Wang, W., & Wang, C. (2020, May). A comparison of pooling methods on LSTM models for rare acoustic event classification. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 316-320). https://doi.org/10.1109/ICASSP40776.2020.9053150

Kumar, V., & Boulanger, D. (2020, October). Explainable automated essay scoring: Deep learning really has pedagogical value. In Frontiers in education (Vol. 5, p. 572367). Frontiers Media SA. https://doi.org/10.3389/feduc.2020.572367

Landauer, T. K., Foltz, P. W. & Laham, D. (1998). An introduction to latent seman ic analysis. Discourse Processes, 25(2-3), 259-284. https://doi.org/10.1080/01638539809545028

Lottridge, S., Godek, B., Jafari, A., & Patel, M. (2021). Comparing the robustness of deep learning and classical automated scoring approaches to gaming strategies. Technical report, Cambium Assessment Inc.

Lottridge, S., Burkhardt, A. & Boyer, M. (2020). Digital module 18: Automated scoring https://ncme.elevate. commpartners.com. Educational Measurement: Issues and Practice, 39(3), 141-142. https://doi.org/10.1111/emip.12388

Levy, O. & Goldberg, Y. (2014, June). Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 302-308). https://doi.org/10.3115/v1/P14-2050 PMid:25270273

Mayfield, E. & Black, A. W. (2020, July). Should you finetune BERT for automated essay scoring? In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 151-162). https://doi.org/10.18653/v1/2020.bea-1.15

Mikolov, T., Chen, K., Corrado, G. & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv. https://doi.org/10.48550/arXiv.1301.3781

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26.

Ormerod, C., Malhotra, A. & Jafari, A. (2021). Automated essay scoring using efficient transformer-based language models. rXiv preprint: https://arxiv.org/abs/2102.13136

Pennington, J., Socher, R. & Manning, C.D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on Empirical Methods In Natural Language Processing (EMNLP) (pp. 1532-1543). https://doi.org/10.3115/v1/D14-1162

Pickard, T. (2020, December). Comparing word2vec and GloVe for automatic measurement of MWE compositionality. In Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons (pp. 95-100).

Pauls, A. & Yoder, J. (2018) Determining optimum drop-out rate for neural networks. Midwest Instructional Computing Symposium (MICS).

Rodriguez, P. U., Jafari, A. & Ormerod, C. M. (2019). Language models and automated essay scoring. arXiv preprint arXiv:1909.09482.

Salehi, B., Cook, P. & Baldwin, T. (2015). A word embedding approach to predicting the compositionality of multiword expressions. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 977-983). https://doi.org/10.3115/v1/N15-1099

Shin, J. & Gierl, M.J. (2020). More efficient processes for creating automated essay scoring frameworks: A demonstration of two algorithms. Language Testing, p. 0265532220937830. https://doi.org/10.1177/0265532220937830

Taghipour, K. & Ng, H.T. (2016, November). A neural approach to automated essay scoring. In Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 1882-1891). https://doi.org/10.18653/v1/D16-1193

Tian, W., Li, J. & Li, H. (2018, July). A method of feature selection based on Word2Vec in text categorization. In 2018 37th Chinese Control Conference (CCC) (pp. 9452-9455). IEEE. https://doi.org/10.23919/ChiCC.2018.8483345

Uto, M. (2021). A review of deep-neural automated essay scoring models. Behavior Metrika, 48(2), 459-484. https://doi.org/10.1007/s41237-021-00142-y

Uto, M., Xie, Y. & Ueno, M. (2020). Neural Automated Essay Scoring Incorporating Handcrafted Features. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6077-6088, Barcelona, Spain (Online). International Committee on Computational Linguistics https://doi.org/10.18653/v1/2020.coling-main.535

Williamson, D. M., Xi, X. & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2-13. https://doi.org/10.1111/j.1745-3992.2011.00223.x

Wu, C., Li, X., Guo, Y., Wang, J., Ren, Z., Wang, M. & Yang, Z. (2022). Natural language processing for smart construction: Current status and future directions. Automation in Construction, 134, 104059. https://doi.org/10.1016/j.autcon.2021.104059

Yang, D., Rupp, A.A. & Foltz, P.W. eds. (2020). Handbook of automated scoring: Theory into practice. New York, NY: Taylor & Francis Group/CRC Press. https://doi.org/10.1201/9781351264808

Zhao, S., Zhang, Y., Xiong, X., Botelho, A. & Heffernan, N. (2017, April). A memory-augmented neural model for automated grading. In Proceedings of the 4th (2017) ACM Conference on Learning@ scale (pp. 189-192). https://doi.org/10.1145/3051457.3053982

Loading...