Designing Predictive Models for Early Prediction of Students’ Test-taking Engagement in Computerized Formative Assessments

Designing Predictive Models for Early Prediction of Students’ Test-taking Engagement in Computerized Formative Assessments


  • Department of Educational Psychology, University of Alberta, 6-110 Education Centre North,11210 87 Ave NW, Edmonton, AB T6G 2G5
  • Centre for Research in Applied Measurement and Evaluation, University of Alberta, 6-110 Education Centre North,11210 87 Ave NW, Edmonton, AB T6G 2G5


Item Response Time, Learning Analytics, Machine Learning, Predictive Models, Test-taking Engagement


The purpose of this study was to develop predictive models of student test-taking engagement in computerized formative assessments. Using different machine learning algorithms, the models utilize student data with item responses and response time to detect aberrant test behaviors such as rapid guessing. The dataset consisted of 7,602 students (grades 1 to 4) who responded to 90 multiple-choice questions in a computerized reading assessment two times (i.e., fall and spring) during the 2017-2018 school year. We completed data analysis in four phases: 1. A response time method was used to label student engagement in both semesters; 2. The training data from the fall semester was used for training the machine learning models; 3. The testing data from the fall semester was used for evaluating the models and 4. The spring semester was used for model evaluation. Among the different algorithms, naive Bayes and support vector machine which were built on response time data from the fall semester, out performed other algorithms in predicting student engagement in the spring semester in terms of accuracy, sensitivity, specificity, area under the curve, kappa, and absolute residual values. The results are promising for early prediction of student test-taking engagement to intervene with the test administration and ensure that the validity of test scores and inferences made based on them.


Download data is not yet available.


Metrics Loading ...




How to Cite

Yildirim-Erbasli, S. N., & Bulut, O. (2022). Designing Predictive Models for Early Prediction of Students’ Test-taking Engagement in Computerized Formative Assessments. Journal of Applied Testing Technology. Retrieved from





Biecek, P. (2018). DALEX: Explainers for complex predictive models in R. Journal of Machine Learning Research. 19(84):1–5.

Bulut, O., Cormier, D. C. (2018). Validity evidence for progress monitoring with star reading: Slope estimates, administration frequency and number of data points. Frontiers in Education, 3(68):1–12.

Bulut, O., Cormier, D. C., Shin, J. (2020). An intelligent recommender system for personalized test administration scheduling with computerized formative assessments. Frontiers in Education. 5:1–11.

Christ, T. J., Zopluoglu, C., Monaghen, B. D., Van Norman, E. R. (2013). Curriculum-based measurement of oral reading: Multi-study evaluation of schedule, duration and dataset quality on progress monitoring outcomes. Journal of School Psychology. 51(1):19–57. PMid: 23375171.

Eklof, H. (2006). Development and validation of scores from an instrument measuring student test-taking motivation. Educational and Psychological Measurement. 66(4):643–56.

Eklof, H. (2010). Skill and will: Testâ€taking motivation and assessment quality. Assessment in Education: Principles, Policy and Practice. 17(4):345–56. 0969594X.2010.516569

Finn, B. (2015). Measuring motivation in lowâ€stakes assessments. ETS Research Report Series. 2:1–17.

Gierl, M. J., Bulut, O., Zhang, X. (2018). Using computerized formative testing to support personalized learning in higher education: An application of two assessment technologies. R. Zheng (Ed.), Digital technologies and instructional design for personalized learning (pp. 99-119). Hershey, PA: IGI Global.

Guo, H., Rios, J. A., Haberman, S., Liu, O. L., Wang, J., Paek, I. (2016). A new procedure for detection of students’ rapid guessing responses using response time. Applied Measurement in Education. 29(3):173–83. /10.1080/08957347.2016.1171766

Haladyna, T. M., Downing, S. M. (2005). Constructirrelevant variance in high-stakes testing. Educational Measurement: Issues and Practice. 23(1):17–27.

Hauser, C., Kingsbury, G. G. (2009). Individual score validity in a modest-stakes adaptive educational testing setting. Paper presented at the meeting of the National Council on Measurement in Education, San Diego, CA.

Hew, K. F., Qiao, C., Tang, Y. (2018). Understanding student engagement in large-scale open online courses: A machine learning facilitated analysis of student’s reflections in 18 highly rated MOOCs. International Review of Research in Open and Distributed Learning. 19(3):70–93. http://dx.doi. org/10.19173/irrodl.v19i3.3596

Hussain, M., Zhu, W., Zhang, W., Abidi, S. M. R. (2018). Student engagement predictions in an e-learning system and their impact on student course assessment scores. Computational Intelligence and Neuroscience. 1–21. PMid: 30369946 PMCid: PMC6189675.

Hintze, J.M., Silberglitt, B. (2005). A longitudinal examination of the diagnostic accuracy and predictive validity of R-CBM and high-stakes testing. School Psychology Review. 34:37–86. 92

James, G., Witten, D., Hastie, T., Tibshirani, R. (2017). An introduction to statistical learning with applications in R. New York: Springer.

January, S. A. A., Van Norman, E. R., Christ, T. J., Ardoin, S. P., Eckert, T. L., White, M. J. (2018). Progress monitoring in reading: Comparison of weekly, bimonthly and monthly assessments for students at risk for reading difficulties in grades 2-4. School Psychology Review. 47(1):83–94.

Kilgus, S. P., Chafouleas, S. M., Riley-Tillman, T. C. (2013). Development and initial validation of the Social and Academic Behavior Risk Screener for elementary grades. School Psychology Quarterly. 28:210–26. PMid: 23773134.

Kuhn, M. (2020). Caret: Classification and Regression Training. R package version 6.0-86.

Lee, Y.-H., Jia, Y. (2014). Using response time to investigate students’ test-taking behaviors in a NAEP computer-based study. Large-scale Assessments in Education. 2(8):1–24.

Moubayed, A., Injadat, M., Shami, A., Lutfiyya, H. (2020). Student engagement level in an e-learning environment: Clustering using k-means. American Journal of Distance Education. 34(2):137–56. 647.2020.1696140

R Core Team. (2021). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing.

Rios, J. A., Guo, H., Mao, L., Liu, O. L. (2017). Evaluating the impact of careless responding on aggregated-scores: To filter unmotivated examinees or not? International Journal of Testing. 17(1):74–104. 58.2016.1231193

Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., Muller, M. (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics. 12(77):1–8. PMid: 21414208 PMCid: PMC3068975.

Shaw, A., Elizondo, F., Wadlington, P. L. (2020). Reasoning, fast and slow: How noncognitive factors may alter the abilityspeed relationship. Intelligence. 83:1–12.

Swerdzewski, P. J., Harmes, J. C., Finney, S. J. (2011). Two approaches for identifying low-motivated students in a low-stakes assessment context. Applied Measurement in Education. 24(2):162–88.

Van Norman, E. R., Nelson, P. M., Parker, D. C. (2017). Technical adequacy of growth estimates from a computer adaptive test: Implications for progress monitoring. School Psychology Quarterly. 32(3):379–91. PMid: 27504817.

Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://doi.


Williamson, D. M., Xi, X., Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice. 31(1):2–13.

Wise, S. L. (2006). An investigation of the differential effort received by items on a low-stakes computer-based test. Applied Measurement in Education. 19(2):95–114.

Wise, S. L. (2015). Effort analysis: Individual score validation of achievement test data. Applied Measurement in Education. 28:237–52.

Wise, S. L. (2017). Rapid-guessing behavior: Its identification, interpretation and implications. EducationalMeasurement: Issues and Practice. 36(4):52–61.

Wise, S. L. (2019). Controlling construct-irrelevant factors through computer-based testing: disengagement, anxiety and cheating. Education Inquiry. 10(1):21–33.

Wise, S. L., DeMars, C. E. (2005). Low examinee effort in low-stakes assessment: Problems and potential solutions. Educational Assessment. 10(1):1–17.

Wise, S. L., Kong, X. (2005). Response time effort: A new measure of examinee motivation in computer-based tests. Applied Measurement in Education. 18(2):163–83.

Wise, S. L., Ma, L. (2012). Setting response time thresholds for a CAT item pool: The normative threshold method. Paper presented at the annual meeting of the National Council on Measurement in Education, Vancouver, Canada.

Yildirim-Erbasli, S. N., Bulut, O. (2021). The impact of students’ test-taking effort on growth estimates in lowstakes educational assessments. Educational Research and Evaluation. 26(7-8):368–86.

Yin, Y., Shavelson, R. J., Ayala, C. C., Ruiz-Primo, M. A., Brandon, P. R., Furtak, E. M., Tomita, M. K., Young, D. B. (2008). On the impact of formative assessment on student motivation, achievement and conceptual change. Applied Measurement in Education, 21(4):335–59.