Evaluating the Quality of AI-Generated Items for a Certification Exam

Alan D. Mead; Chenxuan Zhou

Evaluating the Quality of AI-Generated Items for a Certification Exam

Authors

Alan D. Mead
Chief Psychometrician, Certiverse, 4803 N. Milwaukee Avenue, Suite B, Unit 103, Chicago, IL 60630
Chenxuan Zhou
Psychometrician, Certiverse, 4803 N. Milwaukee Avenue, Suite B, Unit 103, Chicago, IL 60630, USA

Keywords:

Item Writing; Artificial Intelligence; Machine Learning; Prompt Engineering

Abstract

OpenAI’s GPT-3 model can write multiple-choice exam items. This paper reviewed the literature on automatic item generation and then described the recent history of OpenAI GPT models and their operation, and then described a methodology for generating items using these models. This study then critically evaluated GPT-3 at the task of writing multiple-choice exam items for a hypothetical psychometrics exam. We also compared two versions of the GPT-3 model (text-davinci-002 and text-davinci-003) on 90 items generated by GPT-3. The vast majority (71% and 90%) of items were judged as useful, but the typical item will require revision to address problems with the stem, key or distractors. The most common error was a violation of the principles of multiple-choice items (e.g., having two correct responses).

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

Downloads

Published

2024-05-14

How to Cite

Mead, A. D., & Zhou, C. (2024). Evaluating the Quality of AI-Generated Items for a Certification Exam. Journal of Applied Testing Technology. Retrieved from https://jattjournal.net/index.php/atp/article/view/173204

Download Citation

Issue

Online First

Section

Articles

References

Attali, Y., Runge, A., LaFlair, G. T., Yancey, K., Goodwin, S., Park, Y., & von Davier, A. A. (2022). The interactive reading task: Transformer-based automatic item generation. Frontiers in Artificial Intelligence, 5, Article 903077. https://doi.org/10.3389/frai.2022.903077

Bejar, I. I. (1993). A generative approach to psychological and educational measurement. In N. Frederiksen, R. J. Mislevy, & I. I. Bejar (Eds.), Test theory for a new generation of tests (pp. 323–359). Hillsdale, NJ: Lawrence Erlbaum.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., …, & Amodei, D. (2020). Language Models are Few-Shot Learners. ArXiv. https://doi.org/10.48550/ arXiv.2005.14165

dkirmani. (2022). OpenAI’s alignment plans [Blog]. LESSWRONG. https://www.lesswrong.com/ posts/28sEs97ehEo8WZYb8/openai-s-alignment-plans

Drasgow, F., Luecht, R. M., & Bennett, R. E. (2006). Technology and testing. In S. H. Irvine, & P. C. Kyllonen (Eds.), Educational measurement (4th ed., pp. 471-516). Washington, DC: American Council on Education.

Embretson, S. E. (1998). A cognitive design system approach to generating valid tests: Application to abstract reasoning. Psychological Methods, 3, 300-396.

Embretson, S. E. (2002). Generating abstract reasoning items with cognitive theory. In S. H. Irvine & P. C. Kyllonen (Eds.), Item generation for test development (p. 219-250). Mahwah, NJ: Erlbaum.

Embretson, S. E., & Kingston, N. M. (2018). Automatic item generation: A more efficient process for developing mathematics achievement items? Journal of Educational Measurement, 55, 112–131.

Gao, L. (2021). On the sizes of OpenAI API Models [Blog]. EleutherAI. https://blog.eleuther.ai/gpt3-model-sizes/ Gierl, M. J., & Lai, H. (2013). Using automated processes to generate test items. Educational Measurement: Issues and Practice, 32, 36-50. https://doi.org/10.1111/emip.12018

Gierl, M. J., Lai, H., & Tanygin, V. (2021). Advanced methods in automatic item generation. New York, NY: Routledge.

Goldberg, L. R. (1999). A broad-bandwidth, public domain, personality inventory measuring the lower-level facets of several five-factor models. In I. Mervielde, I. Deary, F. De Fruyt, & F. Ostendorf (Eds.), Personality Psychology in Europe, Vol. 7 (pp. 7-28). Tilburg, The Netherlands: Tilburg University Press.

Goldberg, L. R. (1993). The structure of phenotypic personality traits. American Psychologist, 48, 26-34.

Grudzien, P. (2022). GPT-3 tokens explained - what they are and how they work [Blog]. QuickChat. https://blog.quickchat.ai/post/tokens-entropy-question/

Haladyna, T. M., & Downing, S. M. (1989). A taxonomy of multiple-choice item-writing rules. Applied Measurement in Education, 2, 37-50. https://doi.org/10.1207/s15324818ame0201_3

Irvine, S. H., & Kyllonen, P. C. (2002). Item generation for test development. Hillsdale, NJ: Lawrence Erlbaum.

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D (2020). Scaling laws for neural language models. https://arxiv.org/abs/2001.08361

Kublik, S., & Saboo, S. (2022). GPT-3: Building innovative NLP products using large language models. Sebastopol, CA: O’Reilly.

Lee, P., Fyffe, S., Son, M., Jia, A., & Yao, Z. (2022). A paradigm shift from human writing to machine generation in personality test development: An application of state‑of‑the‑art natural language processing. Journal of Business and Psychology. 38, 163-190. https://doi.org/10.1007/s10869-022-09864-6

Lee, J., & Seneff, S. (2007). Automatic generation of cloze items for prepositions. In 8th Annual Conference of the International Speech Communication Association. http://dx.doi.org/10.21437/Interspeech.2007-592

Liu, C.-L., Wang, C.-H., Gao, Z.-M., & Huang, S.-M. (2005). Applications of lexical information for algorithmically composing multiple-choice cloze items. In Proceedings of the Second Workshop on Building Educational Applications Using NLP. http://dx.doi.org/10.3115/1609829.1609830

Lowe, R., & Leike, J. (2022). Aligning language models to follow instructions [Blog]. OpenAI. https://openai.com/research/instruction-following

McCoy, R. T., Smolensky, P., Linzen, T., Gao, J., & Celikyilmaz, A. (2021). How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN. https://doi.org/10.48550/arXiv.2111.09509

Mead, A. (2022). Next-Generation JTA [Webinar]. Certiverse. https://certiverse.com

Mead, A. (2014). Automatic generation of verbal analogy items. Unpublished paper.

Mitkov, R., Ha, L. A., & Karamanis, N. (2006). A computeraided environment for generating multiple-choice test items. Natural Language Engineering, 12, 177-194.

OpenAI. (2023). GPT-4 technical report. OpenAI Technical Report. https://cdn.openai.com/papers/gpt-4.pdf

OpenAI. (n.d.). How do text-davinci-002 and textdavinci003 differ? [Blog.] OpenAI. https://web.archive.org/web/20230314223727/https://help.openai.com/en/ articles/6779149-how-do-text-davinci-002-and-textdavinci003-differ

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.

L., ... & Lowe, R. (2022) Training language models to follow instructions with human feedback. https://doi.org/10.48550/arXiv.2203.02155

Pino, J., Heilman, M., & Eskenazi, M. (2008). A selection strategy to improve cloze question quality. In Proceedings of the 9th International Conference on Intelligent Tutoring Systems. https://www.philippe-fournier-viger.com/illdefined/ w-its08-Pino.pdf

Saravia, E. (2022). Prompt Engineering Guide. Github. https://github.com/dair-ai/Prompt-Engineering-Guide

Sinharay, S., & Johnson, M. (2005). Analysis of data from an admissions test with item models. ETS Research Report RR-05-06. http://dx.doi.org/10.1002/j.2333-8504.2005.tb01983.x

Sumita, E., Sugaya, F., & Yamamoto, S. (2005). Measuring non-native speakers’ proficiency of English by using a test with automatically-generated fill-in-the-blank questions. In Proceedings of the Second Workshop on Building Educational Applications Using NLP. Association for Computational Linguistics. http://dl.acm.org/citation. cfm?id=1609829.1609839

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I.

(2017). Attention Is All You Need. ArXiv. https://doi.org/10.48550/arXiv.1706.03762

von Davier, M. (2018). Automated item generation with recurrent neural networks. Psychometrika, 83, 847–857.

von Davier, M. (2019, August 26). Training Optimus Prime, M.D.: Generating Medical Certification Items by Fine-Tuning OpenAI’s gpt2 Transformer Model. https://doi.org/10.48550/arXiv.1908.08594

Evaluating the Quality of AI-Generated Items for a Certification Exam