Leveraging Machine Learning Technology to Improve Accuracy and Efficiency of Identification of Enemy Item Pairs

Leveraging Machine Learning Technology to Improve Accuracy and Efficiency of Identification of Enemy Item Pairs


  • National Board of Medical Examiners, Philadelphia, PA 19104
  • National Board of Medical Examiners, Philadelphia, PA 19104
  • American Osteopathic Association, Chicago, IL 60611-2864


Item Banks, Item Enemies, Machine Learning, Natural Language Processing, TF-IDF Function


The interpretations of test scores in secure, high-stakes environments are dependent on several assumptions, one of which is that examinee responses to items are independent and no enemy items are included on the same forms. This paper documents the development and implementation of a C#-based application that uses Natural Language Processing (NLP) and Machine Learning (ML) techniques to produce prioritized predictions of item enemy statuses within a large item bank, which can then be followed by medical editor review of the prioritized predictions as part of an iterative process. An item bank of 4130 items from a large-scale healthcare specialist certification exam was used, in which it was assumed that many unidentified enemy pairs existed. For each pair of items, cosine similarities using TF-IDF weights were computed for the stem and answer text separately, with additional dichotomous classification variables added indicating content and existing enemy relationships. Each item pairs’ existing enemy status (enemy or non-enemy) was the dependent variable for the supervised ML model, the coefficients of which were then used to generate probabilities that a given pair of items were enemies. Medical editors reviewed prioritized lists of the actual versus predicted enemy relationships in an iterative fashion. Of the 700 untagged enemy item pairs reviewed, 666 were confirmed and tagged by editors as enemies (95.1% accuracy). Thus, this application was successful in allowing editors to efficiently identify the most egregious uncoded enemy item pairs in a large item bank. The ultimate goal of this research is to inform discussion about the potential for NLP and ML applications to greatly improve accuracy and efficiency of human expert work in test construction.


Download data is not yet available.


Metrics Loading ...

Author Biographies

Ian Micir, National Board of Medical Examiners, Philadelphia, PA 19104

Designer, Test Development Innovations

Kimberly Swygert, National Board of Medical Examiners, Philadelphia, PA 19104

Director, Test Development Innovations

Jean D’Angelo, American Osteopathic Association, Chicago, IL 60611-2864

Director of Assessment, Certifying Board Services




How to Cite

Micir, I., Swygert, K., & D’Angelo, J. (2022). Leveraging Machine Learning Technology to Improve Accuracy and Efficiency of Identification of Enemy Item Pairs. Journal of Applied Testing Technology, 23, 30–40. Retrieved from http://jattjournal.net/index.php/atp/article/view/167170





American Educational Research Association. (2018). Standards for Educational and Psychological Testing. American Educational Research Association.

Association of Test Publishers. (2021). Artificial Intelligence and the Testing Industry: A Primer. Association of Test Publishers.

Becker, K., Kao, S. (May, 2009). Finding stolen items and improving item banks. Paper presentation at the American Educational Research Council, San Diego, CA.

Becker, K. A., McLeod, J. (2013). Automated Item Bank Referencing: A Comparison of NLP Methods.

Cho, E., Xie, H., Lalor, J. P., Kumar, V., Campbell, W. M. (2019). Efficient semi-supervised learning for natural language understanding by optimizing diversity. IEEE Automatic Speech Recognition and Understanding Workshop. PMid: 30744338 PMCid: PMC6599963. https://arxiv.org/pdf/1910.04196.pdf https://doi.org/10.1109/ASRU46091.2019.9003747

Gierl, M., Lai, H., Tanygin, V. (2021). Advanced methods in automatic item generation (pp. 138-141). Routledge: New York. https://doi.org/10.4324/9781003025634

Jurafsky, D., Martin, J. H. (2009). Speech and language processing. Pearson: Upper Saddle River, NJ.

Kaplan, A., Haenlein, M. (2019). Siri, Siri, in my hand: Who’s the fairest in the land? On the interpretations, illustrations and implications of artificial intelligence. Business Horizons, 62(1), 15–25. https://doi. org/10.1016/j.bushor.2018.08.004

Kusner, M., Sun, Y., Kolkin, N., Weinberger, K. (2015). From word embeddings to document distances. In: Proceedings of the 32nd International Conference on Machine Learning, pp. 957–66.

Lai, H., Becker, K. A. (May, 2010). Detecting enemy item pairs using Artificial Neural Networks. Poster presented at the National Council for Measurement in Education annual meeting, Denver, CO.

Lane, S., Raymond, M. R., Haladyna, T. M. (Eds.). (2016). Handbook of test development (pp. 3-18). New York, NY: Routledge.

Mao, X., Zhang, Q., Clem, A. (2021). An exploration of an integrated approach for enemy item identification. International Journal of Intelligent Technologies and Applied Statistics, 14(2).

McCarthy, J., Minsky, M. L., Rochester, N., Shannon, C. E. 1956. A proposal for the Dartmouth Summer Research Project on Artificial Intelligence.

NBME. (2021). NBME item writing guide: Constructing written test questions for the Health Sciences (6th Ed). Philadelphia, PA: NBME. URL: https://www.nbme.org/ item-writing-guide

Peng, F. (2020). Automatic enemy item detection using natural language processing (Doctoral dissertation, University of Illinois at Chicago).

Peng, F., Swygert, K. A., Micir, I. (April, 2019). Automatic item enemy detection using natural language processing: Latent semantic analysis. Paper presented at the National Council for Measurement in Education annual meeting, Toronto, CN.

Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–7. https://doi.org/10.1108/ eb046814

Punjabi, S., Arsikere, H., Garimella, S. (2019). Language model bootstrapping using neural machine translation for conversational speech recognition. IEEE Automatic Speech Recognition and Understanding Workshop. URL: https://arxiv.org/pdf/1912.00958.pdf https://doi. org/10.1109/ASRU46091.2019.9003982

Semmler, S., Rose, Z. (2017). Artificial Intelligence: Application today and implications tomorrow. Duke L. & Tech. Rev., 16,85.

Sitikhu, P., Pahi, K., Thapa, P., Shakya, S. (2020). A comparison of semantic similarity methods for maximum human interpretability. In 2019 Artificial Intelligence for Transforming Business and Society, 1, pp. 1–4. URL: https://arxiv.org/pdf/1910.09129.pdf. https://doi.org/10.1109/AITB48515.2019.8947433

Taulli, T. (2019). Artificial Intelligence basics: A non-technical introduction. Apress: Monrovia, CA. https://doi. org/10.1007/978-1-4842-5028-0

Wang, Z., & von Davier, A. A. (2014). Monitoring of Scoring Using the e‐rater® Automated Scoring System and Human Raters on a Writing Test. ETS Research Report Series, 2014(1), 1-21. https://doi.org/10.1002/ets2.12005

Woo, A., Gorham, J. L. (2010). Understanding the impact of enemy items on test validity and measurement precision. CLEAR Exam Review, 21(1), 15–7.