Evaluating Coherence in Writing: Comparing the Capacity of Automated Essay Scoring Technologies

Evaluating Coherence in Writing: Comparing the Capacity of Automated Essay Scoring Technologies


  • College of Education, University of Florida
  • Department of Educational Psychology, University of Alberta


Attribute-Specific Scoring, Automated Essay Scoring, Coherence Scoring, Deep-Neural Automated Essay Scoring


Automated Essay Scoring (AES) technologies provide innovative solutions to score the written essays with a much shorter time span and at a fraction of the current cost. Traditionally, AES emphasized the importance of capturing the “coherence†of writing because abundant evidence indicated the connection between coherence and the overall writing quality yet, limited studies have been conducted to investigate the capacity of the modern and traditional automated essay scoring technologies in capturing the sequential information (i.e., cohesion). In this study, we investigate the performance of traditional and modern AES systems in attribute-specific scoring. Traditional AES focuses on holistic scoring with limited application for the attribute-specific scoring. Hence, the current study focuses on understanding whether a deep-neural AES system using a convolutional neural networks approach could provide better performance in attribute-specific essay scoring compared to a traditional feature-based AES system in capturing coherence scores in essays. Our finding indicated that a deep-neural AES model showed improved accuracy in predicting coherence-related score categories. Implications for the scoring capacity of the two models are also discussed.


Download data is not yet available.


Metrics Loading ...




How to Cite

Shin, J., & Gierl, M. J. (2022). Evaluating Coherence in Writing: Comparing the Capacity of Automated Essay Scoring Technologies. Journal of Applied Testing Technology, 23, 04–20. Retrieved from http://jattjournal.net/index.php/atp/article/view/170472





Adler-Kassner, L., & O’Neill, P. (2010). Reframing writing assessment to improve teaching and learning; Utah State University Press. https://doi.org/10.2307/j.ctt4cgrtq

Alikaniotis, D., likaniotis, D., Yannakoudakis, H., & Rei, M. (2016). Automatic text scoring using neural networks.arXiv preprint arXiv:1606.04289. https://doi.org/10.18653/v1/P16-1068

Attali, Y., & Burstein, J. (2004). Automated essay scoring with eâ€rater® v. 2.0. ETS Research Report Series, 2004(2), i-21. https://doi.org/10.1002/j.2333-8504.2004.tb01972.x

Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: analyzing text with the natural language toolkit. “O’Reilly Media, Incâ€.

Bridgeman, B., Trapani, C., & Attali, Y. (2012). Comparison of human and machine scoring of essays: Differences by gender, ethnicity, and country. Applied Measurement in Education, 25(1), 27-40. https://doi.org/10.1080/08957347.2012.635502

Burstein, J., Tetreault, J., & Andreyev, S. (2010, June). Using entity-based features to model coherence in student essays. In Human language technologies: The 2010 annual conference of the North American chapter of the Association for Computational Linguistics (pp. 681684).

Coyle, J.P. (2010) Teaching writing skills that enhance student success in future employment. Collected Essays on Learning and Teaching, 3, pp.195-200. https://doi.org/10.22329/celt.v3i0.3262

Crossley, S. and McNamara, D. (2010). Cohesion, coherence, and expert evaluations of writing proficiency. In Proceedings of the Annual Meeting of the Cognitive Science Society (Vol. 32, No. 32).

DeVillez, R. (2003). Writing: Step by step. Kendall Hunt.

Dong, F. and Zhang, Y. (2016) November. Automatic features for essay scoring-an empirical study. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 1072-1077). https://doi.org/10.18653/v1/D16-1115 PMid:27154846

Dong, F., Zhang, Y. and Yang, J. (2017, August). Attentionbased recurrent convolutional neural network for automatic essay scoring. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017) (pp. 153-162). https://doi.org/10.18653/v1/K17-1017

Farag, Y., Yannakoudakis, H. and Briscoe, T. (2018). Neural automated essay scoring and coherence modeling for adversarially crafted input. arXiv preprint arXiv:1804.06898. https://doi.org/10.18653/v1/N181024

Graham, M., Milanowski, A., & Miller, J. (2012). Measuring and promoting inter-rater agreement of teacher and principal performance ratings. Center for Educator Compensation Reform. http://files.eric.ed.gov/fulltext/ ED532068.pdf.

Hamp-Lyons, L. (2002). The scope of writing assessment. Assessing writing, 8(1), pp.5-16. https://doi.org/10.1016/S1075-2935(02)00029-6

Higgins, D., Burstein, J., Marcu, D. and Gentile, C. (2004). Evaluating multiple aspects of coherence in student essays. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLTNAACL 2004 (pp. 185-192).

Hunter, D.M., Jones, R.M. and Randhawa, B.S. (1996). The use of holistic versus analytic scoring for large-scale assessment of writing. The Canadian Journal of Program Evaluation, 11(2), p.61.

Johns, A. M. (1986). Coherence and academic writing: Some definitions and suggestions for teaching. Tesol Quarterly, 20(2), 247-265. https://doi.org/10.2307/3586543

Ke, Z. and Ng, V. (2019), August. Automated Essay Scoring: A Survey of the State of the Art. In IJCAI (pp. 63006308). https://doi.org/10.24963/ijcai.2019/879

Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. https://doi.org/10.3115/v1/D14-1181

Landis, J. R., & Koch, G. G. (1977). An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics, 363-374. https://doi.org/10.2307/2529786 PMid:884196

LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324. https://doi.org/10.1109/5.726791

Lee, H., Grosse, R., Ranganath, R. and Ng, A.Y. (2009, June). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th annual international conference on machine learning (pp. 609-616). https://doi.org/10.1145/1553374.1553453

Li, J., Li, R. and Hovy, E. (2014, October). Recursive deep models for discourse parsing. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 2061-2069). https://doi.org/10.3115/v1/D14-1220

Lukhele, R., Thissen, D. and Wainer, H. (1994). On the relative value of multipleâ€choice, constructed response, and examineeâ€selected items on two achievement tests. Journal of Educational Measurement, 31(3), 234-250. https://doi.org/10.1111/j.1745-3984.1994.tb00445.x

Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in neural information processing systems, 30.

Mascle, D.D. (2013). Writing self-efficacy and written communication skills. Business Communication Quarterly, 76(2), 216-225. https://doi.org/10.1177/1080569913480234

Mathias, S. and Bhattacharyya, P. (2018, May). ASAP++: Enriching the ASAP automated essay grading dataset with essay attribute scores. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018).

McNamara, D. S., Crossley, S. A., Roscoe, R. D., Allen, L. K., & Dai, J. (2015). A hierarchical classification approach to automated essay scoring. Assessing Writing, 23, 35-59. https://doi.org/10.1016/j.asw.2014.09.002

Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., & Khudanpur, S. (2010, September). Recurrent neural network based language model. In Interspeech (Vol. 2, No. 3, pp. 1045-1048). https://doi.org/10.21437/ Interspeech.2010-343

Miltsakaki, E., & Kukich, K. (2004). Evaluation of text coherence for electronic essay scoring systems. Natural Language Engineering, 10(1), 25-55. https://doi.org/10.1017/S1351324903003206

Ng, H. T., Wu, S. M., Wu, Y. Ch. Hadiwinoto, & J. Tetreault. (2013). The CoNLL-2013 shared task on grammatical error correction. Proceedings of CoNLL: Shared Task. https://doi.org/10.3115/v1/W14-1701

Nopita, D. (2011). Constructing coherent ideas and using coherence devices in written descriptive essays: A study at the fourth grade English Department students of STBA Haji Agus Salim Bukittinggi. Lingua Didaktika: Jurnal Bahasa danPembelajaran Bahasa, 4(2), 96-104. https://doi.org/10.24036/ld.v4i2.1260

Page, E. B. (1994). Computer grading of student prose, using modern concepts and software. The Journal of experimental education, 62(2), 127-142. https://doi.org/10.108 0/00220973.1994.9943835

Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp.1532-1543). https://doi.org/10.3115/v1/D14-1162

Shin, J., & Gierl, M. J. (2021). More efficient processes for creating automated essay scoring frameworks: A demonstration of two algorithms. Language Testing, 38(2), 247-272. https://doi.org/10.1177/0265532220937830

Stecher, B. M., Rahn, M. L., Ruby, A., Alt, M. N., & Robyn, A. (1997). Using alternative assessments in vocational education: Appendix B: Kentucky Instructional Results Information System (KIRIS). Berkeley, CA: National Center for Research in Vocational Education.

Taghipour, K. and Ng, H.T. (2016, November). A neural approach to automated essay scoring. In Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 1882-1891). https://doi.org/10.18653/v1/D16-1193

Tay, Y., Luu, A. T., & Hui, S. C. (2018). Recurrently controlled recurrent networks. Advances in neural information processing systems, 31.

Tay, Y., Phan, M., Tuan, L. A., & Hui, S. C. (2018, April). Skip Flow: Incorporating neural coherence features for end-to-end automatic text scoring. In Proceedings of the AAAI conference on artificial intelligence, 32(1), 5948-5955. https://doi.org/10.1609/aaai.v32i1.12045

Viera, A. J., & Garrett, J. M. (2005). Understanding interobserver agreement: The kappa statistic. Family medicine, 37(5), 360-363.

Williams, R.J. & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2), 270-280. https://doi.org/10.1162/neco.1989.1.2.270

Zaidi, A.H., (2016). Neural Sequence Modelling for Automated Essay Scoring [Unpublished master’s thesis].

University of Cambridge. https://www.cl.cam.ac.uk/~ahz22/docs/mphil-thesis.pdf

Zhao, S., Zhang, Y., Xiong, X., Botelho, A. and Heffernan, N. (2017, April). A memory-augmented neural model for automated grading. In Proceedings of the Fourth (2017)

ACM Conference on Learning@ Scale (pp. 189-192) https://doi.org/10.1145/3051457.3053982