The Effect of Fine-tuned Word Embedding Techniques on the Accuracy of Automated Essay Scoring Systems Using Neural Networks
Keywords:Automated Essay Scoring, Glove Embedding, Neural Networks, Word Embeddings, Word2Vec
AbstractAutomated Essay Scoring (AES) using neural networks has helped increase the accuracy and efficiency of scoring studentsâ€™ written tasks. Generally, the improved accuracy of neural network approaches has been attributed to the use of modern word embedding techniques. However, which word embedding techniques produce higher accuracy in AES systems with neural networks is still unclear. In addition, the importance of fine-tuned word embedding techniques on the accuracy of the AES systems is not justified yet. This study investigates the effect of fine-tuned modern word embedding techniques, including pretrained GloVe and Word2Vec, on the accuracy of a deep learning AES model using a Long-Short Term Memory (LSTM) network. The dataset used in this study consisted of 12,978 essays introduced in the 2012 Automated Scoring Assessment Prize (ASAP) competition. Results show that fine-tuned word embedding techniques could significantly improve the accuracy of the AES (QWK= 0.79) compared with the baseline model without pretrained embeddings (QWK = 0.73). Moreover, when used in AES, the pre-trained GloVe word embedding (QWK= 0.79) outperformed Word2Vec (QWK = 0.77). The results of this study can guide future AES studies in selecting more appropriate word representations and how to fine-tune the word embedding techniques for scoring-related tasks.
How to Cite
Alikaniotis, D., Yannakoudakis, H. & Rei, M. (2016). Automatic text scoring using neural networks. arXiv preprint arXiv:1606.04289. https://doi.org/10.18653/v1/P16-1068
Araujo, A., Golo, M., Viana, B., Sanches, F., Romero, R. & Marcacini, R. (2020, October). From bag-of-words to pre-trained neural language models: Improving automatic classification of app reviews for requirements engineering. In Anais do XVII Encontro Nacional de InteligÃªncia Artificial e Computacional (pp. 378-389). SBC. https://doi.org/10.5753/eniac.2020.12144
Baldi, Pierre, and Peter J. Sadowski (2013). Understanding dropout. Advances in Neural Information Processing Systems.
Bird, S., Klein, E. & Loper, E. (2009). Natural language processing with Python: analyzing text with the natural language toolkit. Oâ€™Reilly Media, Inc.
Birunda, S. & Devi, R. K. (2021). A review on word embedding techniques for text classification. In Innovative Data Communication Technologies and Application. Springer, 267-281. https://doi.org/10.1007/978-981-15- 9651-3_23
Cai, D., He, X., Wang, X., Bao, H., & Han, J. (2009, June). Locality preserving nonnegative matrix factorization. In 21st International Joint Conference on Artificial Intelligence.
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
Devlin, J., Chang, M.W., Lee, K. and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dong, F., Zhang, Y. and Yang, J. (2017, August). Attentionbased recurrent convolutional neural network for automatic essay scoring. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017) (pp. 153-162). https://doi.org/10.18653/v1/K17-1017
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179-211. https://doi.org/10.1207/ s15516709cog1402_1
Fernandez, N., Ghosh, A., Liu, N., Wang, Z., Choffin, B., Baraniuk, R. & Lan, A. (2022). Automated Scoring for Reading Comprehension via In-context BERT Tuning. arXiv preprint arXiv:2205.09864. https://doi.org/10.1007/978-3- 031-11644-5_69
Gao, J., He, Y., Zhang, X. & Xia, Y. (2017, November). Duplicate short text detection based on Word2vec. In 2017 8th IEEE International Conference on Software Engineering and Service Science (ICSESS) (pp. 33-37). IEEE. https://doi.org/10.1109/ICSESS.2017.8342858
Haider, M. M., Hossin, M. A., Mahi, H. R. & Arif, H. (2020, June). Automatic text summarization using gensim word- 2vec and k-means clustering algorithm. In 2020 IEEE Region 10 Symposium (TENSYMP) (pp. 283-286). IEEE https://doi.org/10.1109/TENSYMP50017.2020.9230670
Hendrycks, D., Lee, K. & Mazeika, M. (2019, May). Using pre-training can improve model robustness and uncertainty. In International Conference on Machine Learning (pp. 2712-2721). PMLR.
Jang, B., Kim, M., Harerimana, G., Kang, S. U. & Kim, J. W. (2020). Bi-LSTM model to increase accuracy in text classification: Combining Word2vec CNN and attention mechanism. Applied Sciences, 10(17), 5841. https://doi.org/10.3390/app10175841
Kao, C. C., Sun, M., Wang, W., & Wang, C. (2020, May). A comparison of pooling methods on LSTM models for rare acoustic event classification. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 316-320). https://doi.org/10.1109/ICASSP40776.2020.9053150
Kumar, V., & Boulanger, D. (2020, October). Explainable automated essay scoring: Deep learning really has pedagogical value. In Frontiers in education (Vol. 5, p. 572367). Frontiers Media SA. https://doi.org/10.3389/feduc.2020.572367
Landauer, T. K., Foltz, P. W. & Laham, D. (1998). An introduction to latent seman ic analysis. Discourse Processes, 25(2-3), 259-284. https://doi.org/10.1080/01638539809545028
Lottridge, S., Godek, B., Jafari, A., & Patel, M. (2021). Comparing the robustness of deep learning and classical automated scoring approaches to gaming strategies. Technical report, Cambium Assessment Inc.
Lottridge, S., Burkhardt, A. & Boyer, M. (2020). Digital module 18: Automated scoring https://ncme.elevate. commpartners.com. Educational Measurement: Issues and Practice, 39(3), 141-142. https://doi.org/10.1111/emip.12388
Levy, O. & Goldberg, Y. (2014, June). Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 302-308). https://doi.org/10.3115/v1/P14-2050 PMid:25270273
Mayfield, E. & Black, A. W. (2020, July). Should you finetune BERT for automated essay scoring? In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 151-162). https://doi.org/10.18653/v1/2020.bea-1.15
Mikolov, T., Chen, K., Corrado, G. & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv. https://doi.org/10.48550/arXiv.1301.3781
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26.
Ormerod, C., Malhotra, A. & Jafari, A. (2021). Automated essay scoring using efficient transformer-based language models. rXiv preprint: https://arxiv.org/abs/2102.13136
Pennington, J., Socher, R. & Manning, C.D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on Empirical Methods In Natural Language Processing (EMNLP) (pp. 1532-1543). https://doi.org/10.3115/v1/D14-1162
Pickard, T. (2020, December). Comparing word2vec and GloVe for automatic measurement of MWE compositionality. In Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons (pp. 95-100).
Pauls, A. & Yoder, J. (2018) Determining optimum drop-out rate for neural networks. Midwest Instructional Computing Symposium (MICS).
Rodriguez, P. U., Jafari, A. & Ormerod, C. M. (2019). Language models and automated essay scoring. arXiv preprint arXiv:1909.09482.
Salehi, B., Cook, P. & Baldwin, T. (2015). A word embedding approach to predicting the compositionality of multiword expressions. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 977-983). https://doi.org/10.3115/v1/N15-1099
Shin, J. & Gierl, M.J. (2020). More efficient processes for creating automated essay scoring frameworks: A demonstration of two algorithms. Language Testing, p. 0265532220937830. https://doi.org/10.1177/0265532220937830
Taghipour, K. & Ng, H.T. (2016, November). A neural approach to automated essay scoring. In Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 1882-1891). https://doi.org/10.18653/v1/D16-1193
Tian, W., Li, J. & Li, H. (2018, July). A method of feature selection based on Word2Vec in text categorization. In 2018 37th Chinese Control Conference (CCC) (pp. 9452-9455). IEEE. https://doi.org/10.23919/ChiCC.2018.8483345
Uto, M. (2021). A review of deep-neural automated essay scoring models. Behavior Metrika, 48(2), 459-484. https://doi.org/10.1007/s41237-021-00142-y
Uto, M., Xie, Y. & Ueno, M. (2020). Neural Automated Essay Scoring Incorporating Handcrafted Features. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6077-6088, Barcelona, Spain (Online). International Committee on Computational Linguistics https://doi.org/10.18653/v1/2020.coling-main.535
Williamson, D. M., Xi, X. & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2-13. https://doi.org/10.1111/j.1745-3992.2011.00223.x
Wu, C., Li, X., Guo, Y., Wang, J., Ren, Z., Wang, M. & Yang, Z. (2022). Natural language processing for smart construction: Current status and future directions. Automation in Construction, 134, 104059. https://doi.org/10.1016/j.autcon.2021.104059
Yang, D., Rupp, A.A. & Foltz, P.W. eds. (2020). Handbook of automated scoring: Theory into practice. New York, NY: Taylor & Francis Group/CRC Press. https://doi.org/10.1201/9781351264808
Zhao, S., Zhang, Y., Xiong, X., Botelho, A. & Heffernan, N. (2017, April). A memory-augmented neural model for automated grading. In Proceedings of the 4th (2017) ACM Conference on Learning@ scale (pp. 189-192). https://doi.org/10.1145/3051457.3053982