Identifying Enemy Item Pairs using Natural Language Processing

Kirk A. Becker; Shu-chuan Kao

Identifying Enemy Item Pairs using Natural Language Processing

Authors

Kirk A. Becker
Senior Research Scientist, Pearson VUE
Shu-chuan Kao
Senior Manager, Measurement and Testing, Examinations, National Council of State Boards of Nursing

Keywords:

Cosine Similarity, Enemy Items, Item Banking, Natural Language, Processing, Text Indexing

Abstract

Natural Language Processing (NLP) offers methods for understanding and quantifying the similarity between written documents. Within the testing industry these methods have been used for automatic item generation, automated scoring of text and speech, modeling item characteristics, automatic question answering, machine translation, and automated referencing. This paper presents research into the use of NLP for the identification of enemy and duplicate items to improve the maintenance of test item banks. Similar pairs of items can be identified using NLP, limiting the number of items content experts must review to identify enemy and duplicat items. Results from multiple testing programs show that previousely unidentified enemy pairs can be discovered with this method.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

Downloads

Published

2022-12-07

How to Cite

Becker, K. A., & Kao, S.- chuan. (2022). Identifying Enemy Item Pairs using Natural Language Processing. Journal of Applied Testing Technology, 23, 41–52. Retrieved from https://jattjournal.net/index.php/atp/article/view/172634

Download Citation

Issue

Vol 23(Special Issue 1), 2022

Section

Articles

References

Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater V.2.0. Journal of Technology, Learning, and Assessment, 4(3), 1-34.

Attali, Y, Powers, D., Freedman, N, Harrison, M., & Obetz, S. (2008). Automated Scoring of Short- Answer Open-Ended GRE Subject Test Items, ETS Research Report No. RR-08-20. Princeton, NJ: ETS https://doi.org/10.1002/j.2333-8504.2008.tb02106.x

Bates, M. (1995). Models of natural language understanding. Proceedings of the National Academy of Sciences of the United States of America, 92(22), 9977-9982. https://doi.org/10.1073/pnas.92.22.9977 PMid:7479812 PMCid:PMC40721

Becker, K. A., & Kao, S. (2009, April). Finding stolen items and improving item banks. Paper presented at the American Educational Research Association Annual Meeting, San Diego, CA.

Bejar, I. I. (1996). Generative response modeling: Leveraging the computer as a test delivery medium. Princeton, NJ: ETS. https://doi.org/10.1002/j.2333-8504.1996.tb01691.x

Bejar, I. I., Lawless, R. R., Morley, M. E., Wagner, M. E., Bennett, R. E., & Revuelta, J. (2003). A Feasibility study of on-thefly item generation in adaptive testing. The Journal of Technology, Learning, and Assessment, 2(3), 3-28. https://doi.org/10.1002/j.2333-8504.2002.tb01890.x

Belov, D. I., & Knezevich, L. (2008, April). Predicting item difficulty with semantic similarity measures. Paper presented at annual meeting of the National Council on Measurement in Education, New York, NY.

Belov, D. I., & Kary, D. (2012, April). A heuristic-based approach for computing semantic similarity between single-topic texts. Paper presented at the annual meeting of the National Council on Measurement in Education, Vancouver, BC.

Benedetto, L., Cappelli, A., Turrin, R., Cremonesi, P. (2020). Introducing a Framework to Assess Newly Created Questions with Natural Language Processing. In: Bittencourt, I., Cukurova, M., Muldner, K., Luckin, R., Millán, E. (eds) Artificial Intelligence in Education. AIED 2020. Lecture Notes in Computer Science. Springer, Cham. pp. 43–54. https://doi.org/10.1007/978-3-030-52237-7_4 PMCid:PMC7334176

Bennett, R. E. (2004). Moving the Field Forward: Some Thoughts on Validity and Automated Scoring, ETS Research Memorandum No. RM-04-01 Princeton, NJ: ETS.

Brown, J., Firshkoff, G., & Eskenazi, M. (2005). Automatic question generation for vocabulary. assessment. Proceedings of HLT/EMNLP. Vancuver, Canada. https://doi.org/10.3115/1220575.1220678

Burstein, J., Braden-Harder, L., Chodorow, M., Hua, S., Kaplan, B., Kukich, K., Lu, C., Nolan, J., Rock, D., & Wolff, S. (1998). Computer analysis of essay content for automated score prediction: A prototype automated scoring system for GMAT analytical writing assessment, ETS Research Rep. No. RR-98-15. Princeton, NJ: ETS. https://doi.org/10.1002/j.2333-8504.1998.tb01764.x

Cheville, J. (2004). Automated scoring technologies and the rising influence of error. The English Journal, 93(4), 47-52. https://doi.org/10.2307/4128980

Chodorow, M., & Burstein, J. (2004). Beyond essay length: Evaluating e-rater’s performance on TOEFL essays, TOEFL Research Rep. No. RR-73. Princeton, NJ: ETS. https://doi.org/10.1002/j.23338504.2004.tb01931.x

Deane, P., & Sheehan, K. (2003). Automatic item generation via frame semantics: Natural language generation of math word problems. Princeton, NJ: ETS.

Deane, P., Quinlan T., & Kostin, I. (2011). Automated Scoring Within a Developmental, Cognitive Model of Writing Proficiency, ETS Research Report No. RR-11-16. Princeton, NJ: ETS. https://doi.org/10.1002/j.2333-8504.2011. tb02252.x

Downing, S. M., & Haladyna, T. M. (2006), Handbook of test development. Mahwah, NJ: Lawrence Erlbaum Associates, Inc.

Embretson, S. E., & Yang, X. (2007). Automatic item generation and cognitive psychology. In C. R. Rao & S. Sinharay (Eds.) Handbook of Statistics: Psychometrics, 26, 747-768. North Holland, UK: Elsevier. https://doi.org/10.1016/S0169- 7161(06)26023-1

Fellbaum, C. (1998). Wordnet: An electronic lexical database. Cambridge, MA: MIT Press. https://doi.org/10.7551/mitpress/7287.001.0001

Fu, Y., & Han, K. (2022, April). Enemy item identification for different item types. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA.

Gierl, M. J., & Lai, H. (2012, April). Methods for evaluating the item model structure used in automated item generation. Paper presented at the annual meeting of the National Council on Measurement in Education, Vancouver, BC.

Gierl, M. J., & Lai, H. (2013). Instructional topics in educational measurement (ITEMS) module: Using automated processes to generate test items. Educational Measurement: Issues and Practice, 32(3), 36-50. https://doi.org/10.1111/emip.12018

Gierl, M. J., Zhou, J., & Alves, C. (2008). Developing a Taxonomy of Item Model Types to Promote Assessment Engineering. The Journal of Technology, Learning and Assessment, 7(2). Retrieved from https://ejournals.bc.edu/index.php/jtla/article/view/1629

Glas, C. A. W., & van der Linden, W. J. (2003). Computerized Adaptive Testing with item cloning. Applied Psychological Measurement, 27(4), 247-261. https://doi.org/10.1177/0146621603027004001

Gomaa, W.H., & Fahmy, A. A. (2013). A survey of text similarity approaches. International Journal of Computer Applications, 68, 13-18. https://doi.org/10.5120/11638-7118

Haladyna, T. M. (1999). Developing and validating multiplechoice test items. Mahwah, NJ: Lawrence Erlbaum Associates, Inc.

Higgins, D. (2007). Item Distiller: Text Retrieval for Computer- Assisted Test Item Creation, ETS Research Memorandum No. RM-07-05. Princeton, NJ: ETS.

Hombo, C. M., & Dresher, A. R. (2001). A simulation study of the impact of automatic item generation under NAEP-like data conditions. Paper presented at the annual meeting of the National Council of Educational Measurement, Seattle.

Hoshino, A., & Nakagawa, H. (2005). Real-time multiple choice question generation for language testing: a preliminary study. Proceedings of the Second Workshop on Building Educational Applications using Natural Language Processing, 17-20. Ann Arbor, US. https://doi.org/10.3115/1609829.1609832

IntelliMetric Engineer [computer software]. (1997). Yardley, PA: Vantage Technilogies. Irvine, S. H., & Kyllonen, P. (Eds.) (2002). Item Generation for test development. Mahwah, NJ: Lawrence Earlbaum Associates, Inc.

Khan, S. M., Hamer, J., & Almeida, T. (2021). Generate: A NLG system for educational content creation. Proceedings of the 14th International Conference on Educational Data Mining (EDM 2021).

LaDuca, A., Staples, W. I., Templeton, B., & Holzman, G. B. (1986). Item modeling procedure for constructing contentequivalent multiple-choice questions. Medical Education, 20(1), 52-56. https://doi.org/10.1111/j.1365-2923.1986. tb01042.x PMid:3951382

Lai, H., & Becker, K. A. (2010, May). Detecting enemy item pairs using artificial neural networks. Poster presented at annual meeting of the National Council on Measurement in Education, Denver, CO.

Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211-240. https://doi.org/10.1037/0033-295X.104.2.211

Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduciton to latent sematic analysis. Discourse Processes, 25, 259-284. https://doi.org/10.1080/01638539809545028

Lai, H., & Becker, K. A. (May, 2010). Detecting enemy item pairs using artificial neural networks. Poster presented at annual meeting of the National Council on Measurement in Education, Denver, CO.

Larkey, L. S., & Croft, W. B. (2003). A text categorization approach to automated essay grading. In M. D. Shermis & J. C. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 55-70). Mahwah, NJ: Lawrence Erlbaum Associates, Inc

Le An Ha & Yaneva, V. (2019, September). Automatic Question Answering for Medical MCQs: Can It Go Further than Information Retrieval? Proceedings of the 12th Conference on Recent Advances in Natural Language Processing (RANLP 2019), Varna, Bulgaria.

Le An Ha, Yaneva, V., Baldwin, P. and Mee, J. (2019). Predicting the Difficulty of Multiple Choice Questions in a Highstakes Medical Exam. Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications (BEA), held in conjunction with ACL 2019, Florence, Italy, 2 August, 2019. https://doi.org/10.18653/v1/ W19-4402

Li, X., Hu, A., & Wilmurth, G. (2022, April). Enemy item detection using word embedding. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA.

Li, F., & Shen, L. (2011, February). Detecting duplicate items by semantic similarity measures. Paper presented at the annual meeting of the Association of Test Publishers, Phoenix, AZ.

Li, F., Shen, L., & Bodett, S. (2012, April). Can enemy items be automatically identified? Paper presented at the annual meeting of the National Council on Research in Education, Vancouver, BC.

Lin, C-Y & Hovy. E. H. (2003). Automatic evaluation of summaries using N-gram co-occurrence statistics. Proceedings of 2003 Language Technology Conference (HLT-NAACL 2003), pp. 71-78. https://doi.org/10.3115/1073445.1073465

Luecht, R. M. (1998). Computer-assisted test assembly using optimization heuristics. Applied Psychological Measurement, 22, 224-236. https://doi.org/10.1177/01466216980223003

Minaee, S. & Liu, Z. (2017). Automatic question-answering using a deep similarity neural network. 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP), 2017, pp. 923-927. https://doi.org/10.1109/GlobalSIP.2017.8309095

Manning, C. D., & Schütz, H. (1999). Foundations of statistical natural language processing. The Cambridge, MA: The MIT Press.

Mitkov, R., Ha, A. A., & Karamanis, N. (2005). A computeraided environment for generating multiple-choice test items. Natural Language Engineering, 12(2), 177-194. https://doi.org/10.1017/S1351324906004177

National Library of Medicine (2009). UMLS Reference Manual. Bethesda, MD: Author.

National Library of Medicine (2021). UMLS(R) Reference Manual. National Institutes of Health. https://www.ncbi.nlm.nih.gov/books/NBK9676/pdf/Bookshelf_NBK9676. pdf

Page, E. B. (1966). The imminence of grading essays by computer. Phi Delta Kappan, 47, 238-243.

Page, E. B. (1994). Computer Grading of Student Prose, Using Modern Concepts and Software. Journal of Experimental Education, 62, 127-142. https://doi.org/10.1080/00220973. 1994.9943835

Page, E. B. (2003). Project essay grade: PEG. In M. Shermis & J. Burstein (Eds.), Automated essay scoring: A crossdisciplinary perspective. Hillsdale, NJ: Lawrence Erlbaum Associates.

Peng, F. (2020). Automatic enemy item detection using natural language processing. (Unpublished doctoral dissertation). The University of Illinois at Chicago, Chicago, IL.

Peng, F., Xiao, L., Qian, H., & Woo, Ada. (2018). Automatic detection of enemy item pairs using Latent Semantic Analysis. Paper presented at the Annual Meeting of National Council on Measurement in Education, New York, NY.

Peng, F., Swygert, K. A., & Micir, I. (2019). Automatic enemy item detection using natural language processing. Paper presented at the 2019 Annual Meeting of National Council on Measurement in Education, Toronto, ON, Canada.

Pommerich, M., & Seagall, D. O. (2006, April). Local dependence in an operational CAT: Diagnosis and implications. Paper presented at annual meeting of the National Council on Measurement in Education, San Francisco, CA.

Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130-137. https://doi.org/10.1108/eb046814

Porter Stemming Algorithm (n.d.). Retrieved March 6, 2005, from http://www.tartarus.orgmartin/PorterStemmer

Roid, G. H., & Haladyna, T. M. (1982). A technology for testitem writing. New York: Academic Press.

Rudner, L. M., Garcia, V., & Welch, C. (2006). An evaluation of the Intellimetric essay scoring system. Journal of Technology, Learning and Assessment, 4(4). Available from http://escholarship.bc.edu/jtla/

Shin, J., Guo, Q., & Gierl, M. J. (2019). Multiple-choice item distractor development using topic modeling approaches. Frontiers in Psychology, 10, 825. https://doi.org/10.3389/fpsyg.2019.00825 PMid:31133911 PMCid:PMC6524712

Steyvers, M., & Griffiths, T. (2004). Probabilistic topics models. In T. Landauer, D. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of latent semantic analysis (pp. 427-448).

Lawrence Erlbaum Associates, Mahway, New Jersey. Sukkarieh, J. A., & Stoyanchev, S. (2009). Automating Model Building in c-rater. Paper in Proceedings of Text Infer: The ACL/IJCNLP 2009 Workshop on Applied Textual Inference, pp. 61-69. https://doi.org/10.3115/1708141.1708153

Swanson, L., & Stocking, M. L. (1993). A method and heuristic for solving very large item selection problems. Applied Psychological Measurement, 17, 151-166. https://doi.org/10.1177/014662169301700205

van der Linden, W. J. (1998). Optimal assembly of psychological and educational tests. Applied Psychological Measurement, 22, 195-211. https://doi.org/10.1177/01466216980223001

Veldkamp, B., & van der Linden, W. (2000). Designing Item Pools for Computerized Adaptive Testing in W. van der Linden and C. Glas, (Eds.) Computerized Adaptive Testing: Theory and Practice. Netherlands: Springer. pp. 149-162. https://doi.org/10.1007/0-306-47531-6_8

Wang, J. & Dong, Y. (2020). Measurement of Text Similarity: A Survey. Information, 11(9), 421. https://doi.org/10.3390/info11090421

Weir, J. B. (2019). Enemy item detection using data mining methods. (Unpublished doctoral dissertation). The University of North Carolina at Greensboro, Greensboro, NC.

Widdows, D., & Ferraro, K. (2008). Semantic vectors: A scalable open source package and online technology management application. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odjik, S. Piperidis, & D. Tapias (Eds.) Proceedings of the Sixth International Conference on Language Resources and Evaluation. pp. 1183-1190.

Xi, X., Higgins, D., Zechner, K., & Williamson, D. (2008). Automated Scoring of Spontaneous Speech Using SpeechRater v1.0. ETS Research Report No. RR-08-62. https://doi.org/10.1002/j.23338504.2008.tb02148.x

Yen, W. (1993). Scaling Performance Assessments: Strategies for Managing Local Item Dependence. Journal of Educational Measurement, 30(3), 187-213. https://doi. org/10.1111/j.1745-3984.1993.tb00423.x