Evaluating the Quality of AI-Generated Items for a Certification Exam

Evaluating the Quality of AI-Generated Items for a Certification Exam

Authors

  • Chief Psychometrician, Certiverse, 4803 N. Milwaukee Avenue, Suite B, Unit 103, Chicago, IL 60630
  • Psychometrician, Certiverse, 4803 N. Milwaukee Avenue, Suite B, Unit 103, Chicago, IL 60630, USA

Keywords:

Item Writing; Artificial Intelligence; Machine Learning; Prompt Engineering

Abstract

OpenAI’s GPT-3 model can write multiple-choice exam items. This paper reviewed the literature on automatic item generation and then described the recent history of OpenAI GPT models and their operation, and then described a methodology for generating items using these models. This study then critically evaluated GPT-3 at the task of writing multiple-choice exam items for a hypothetical psychometrics exam. We also compared two versions of the GPT-3 model (text-davinci-002 and text-davinci-003) on 90 items generated by GPT-3. The vast majority (71% and 90%) of items were judged as useful, but the typical item will require revision to address problems with the stem, key or distractors. The most common error was a violation of the principles of multiple-choice items (e.g., having two correct responses).

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

Downloads

Published

2024-05-14

How to Cite

Mead, A. D., & Zhou, C. (2024). Evaluating the Quality of AI-Generated Items for a Certification Exam. Journal of Applied Testing Technology. Retrieved from https://jattjournal.net/index.php/atp/article/view/173204

Issue

Section

Articles

References

Attali, Y., Runge, A., LaFlair, G. T., Yancey, K., Goodwin, S., Park, Y., & von Davier, A. A. (2022). The interactive reading task: Transformer-based automatic item generation. Frontiers in Artificial Intelligence, 5, Article 903077. https://doi.org/10.3389/frai.2022.903077

Bejar, I. I. (1993). A generative approach to psychological and educational measurement. In N. Frederiksen, R. J. Mislevy, & I. I. Bejar (Eds.), Test theory for a new generation of tests (pp. 323–359). Hillsdale, NJ: Lawrence Erlbaum.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., …, & Amodei, D. (2020). Language Models are Few-Shot Learners. ArXiv. https://doi.org/10.48550/ arXiv.2005.14165

dkirmani. (2022). OpenAI’s alignment plans [Blog]. LESSWRONG. https://www.lesswrong.com/ posts/28sEs97ehEo8WZYb8/openai-s-alignment-plans

Drasgow, F., Luecht, R. M., & Bennett, R. E. (2006). Technology and testing. In S. H. Irvine, & P. C. Kyllonen (Eds.), Educational measurement (4th ed., pp. 471-516). Washington, DC: American Council on Education.

Embretson, S. E. (1998). A cognitive design system approach to generating valid tests: Application to abstract reasoning. Psychological Methods, 3, 300-396.

Embretson, S. E. (2002). Generating abstract reasoning items with cognitive theory. In S. H. Irvine & P. C. Kyllonen (Eds.), Item generation for test development (p. 219-250). Mahwah, NJ: Erlbaum.

Embretson, S. E., & Kingston, N. M. (2018). Automatic item generation: A more efficient process for developing mathematics achievement items? Journal of Educational Measurement, 55, 112–131.

Gao, L. (2021). On the sizes of OpenAI API Models [Blog]. EleutherAI. https://blog.eleuther.ai/gpt3-model-sizes/ Gierl, M. J., & Lai, H. (2013). Using automated processes to generate test items. Educational Measurement: Issues and Practice, 32, 36-50. https://doi.org/10.1111/emip.12018

Gierl, M. J., Lai, H., & Tanygin, V. (2021). Advanced methods in automatic item generation. New York, NY: Routledge.

Goldberg, L. R. (1999). A broad-bandwidth, public domain, personality inventory measuring the lower-level facets of several five-factor models. In I. Mervielde, I. Deary, F. De Fruyt, & F. Ostendorf (Eds.), Personality Psychology in Europe, Vol. 7 (pp. 7-28). Tilburg, The Netherlands: Tilburg University Press.

Goldberg, L. R. (1993). The structure of phenotypic personality traits. American Psychologist, 48, 26-34.

Grudzien, P. (2022). GPT-3 tokens explained - what they are and how they work [Blog]. QuickChat. https://blog.quickchat.ai/post/tokens-entropy-question/

Haladyna, T. M., & Downing, S. M. (1989). A taxonomy of multiple-choice item-writing rules. Applied Measurement in Education, 2, 37-50. https://doi.org/10.1207/s15324818ame0201_3

Irvine, S. H., & Kyllonen, P. C. (2002). Item generation for test development. Hillsdale, NJ: Lawrence Erlbaum.

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D (2020). Scaling laws for neural language models. https://arxiv.org/abs/2001.08361

Kublik, S., & Saboo, S. (2022). GPT-3: Building innovative NLP products using large language models. Sebastopol, CA: O’Reilly.

Lee, P., Fyffe, S., Son, M., Jia, A., & Yao, Z. (2022). A paradigm shift from human writing to machine generation in personality test development: An application of state‑of‑the‑art natural language processing. Journal of Business and Psychology. 38, 163-190. https://doi.org/10.1007/s10869-022-09864-6

Lee, J., & Seneff, S. (2007). Automatic generation of cloze items for prepositions. In 8th Annual Conference of the International Speech Communication Association. http://dx.doi.org/10.21437/Interspeech.2007-592

Liu, C.-L., Wang, C.-H., Gao, Z.-M., & Huang, S.-M. (2005). Applications of lexical information for algorithmically composing multiple-choice cloze items. In Proceedings of the Second Workshop on Building Educational Applications Using NLP. http://dx.doi.org/10.3115/1609829.1609830

Lowe, R., & Leike, J. (2022). Aligning language models to follow instructions [Blog]. OpenAI. https://openai.com/research/instruction-following

McCoy, R. T., Smolensky, P., Linzen, T., Gao, J., & Celikyilmaz, A. (2021). How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN. https://doi.org/10.48550/arXiv.2111.09509

Mead, A. (2022). Next-Generation JTA [Webinar]. Certiverse. https://certiverse.com

Mead, A. (2014). Automatic generation of verbal analogy items. Unpublished paper.

Mitkov, R., Ha, L. A., & Karamanis, N. (2006). A computeraided environment for generating multiple-choice test items. Natural Language Engineering, 12, 177-194.

OpenAI. (2023). GPT-4 technical report. OpenAI Technical Report. https://cdn.openai.com/papers/gpt-4.pdf

OpenAI. (n.d.). How do text-davinci-002 and textdavinci003 differ? [Blog.] OpenAI. https://web.archive.org/web/20230314223727/https://help.openai.com/en/ articles/6779149-how-do-text-davinci-002-and-textdavinci003-differ

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.

L., ... & Lowe, R. (2022) Training language models to follow instructions with human feedback. https://doi.org/10.48550/arXiv.2203.02155

Pino, J., Heilman, M., & Eskenazi, M. (2008). A selection strategy to improve cloze question quality. In Proceedings of the 9th International Conference on Intelligent Tutoring Systems. https://www.philippe-fournier-viger.com/illdefined/ w-its08-Pino.pdf

Saravia, E. (2022). Prompt Engineering Guide. Github. https://github.com/dair-ai/Prompt-Engineering-Guide

Sinharay, S., & Johnson, M. (2005). Analysis of data from an admissions test with item models. ETS Research Report RR-05-06. http://dx.doi.org/10.1002/j.2333-8504.2005.tb01983.x

Sumita, E., Sugaya, F., & Yamamoto, S. (2005). Measuring non-native speakers’ proficiency of English by using a test with automatically-generated fill-in-the-blank questions. In Proceedings of the Second Workshop on Building Educational Applications Using NLP. Association for Computational Linguistics. http://dl.acm.org/citation. cfm?id=1609829.1609839

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I.

(2017). Attention Is All You Need. ArXiv. https://doi.org/10.48550/arXiv.1706.03762

von Davier, M. (2018). Automated item generation with recurrent neural networks. Psychometrika, 83, 847–857.

von Davier, M. (2019, August 26). Training Optimus Prime, M.D.: Generating Medical Certification Items by Fine-Tuning OpenAI’s gpt2 Transformer Model. https://doi.org/10.48550/arXiv.1908.08594

Loading...