Investigating the Psychometric Impact of Negative Worded Items in Reading Comprehension Passages with a 3PL Cross-Classified Testlet Model

Yong Luo (National Center for Assessment)
Junhui Liu (University of Maryland)

Article ID: 418


Negative worded (NW) items used in psychological instruments have been studied with the bifactor model to investigate whether the NW items form a secondary factor due to negative wording orthogonal to the measured latent construct, a validation procedure which checks whether NW items form a source of construct irrelevant variance (CIV) and hence constitute a validity threat. In the context of educational testing, however, no such validation attempts have been made. In this study, we studied the psychometric impact of NW items in an English proficiency reading comprehension test using a modeling approach similar to the bifactor model, namely the three-parameter logistic cross-classified testlet response theory (3PL CCTRT) model, to account for both guessing and possible local item dependence due to passage effect in the data set. The findings indicate that modeling the NW items with a separate factor leads to noticeable improvement in model fit, and the factor variance is marginal but nonzero. However, item and ability parameter estimates are highly similar between the 3PL CCTRT model and other models that do not model the NW items. It is concluded that the NW items introduce CIV into the data, but its magnitude is too small to change item and person ability parameter estimates to an extent of practical significance. 


Negative wording, Bifactor model, Cross-classified testlet model, Validation

Full Text:



[1] Abedi, J. (2006). Language issues in item development. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development. Mahwah, NJ: Lawrence Erlbaum Associates.

[2] Ackerman, T. A. (1987). The robustness of LOGIST and BILOG IRT estimation programs to violations of local independence. ACT Research Report Series, 87-14. Iowa City, IA: American College Testing.

[3] Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov & F. Csaki (Eds.), Second International Symposium on Information Theory, (pp. 267-281). Budapest, Hungary: Akademiai Kiado.

[4] Anderson, D. R. (2008). Model based inference in the life sciences: A primer on evidence. New York, NY: Springer.

[5] Baghaei, P., & Aryadoust, V. (2015). Modeling local item dependence due to common test format with a multidimensional Rasch model. International Journal of Testing, 15(1), 71-87.

[6] Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64(2), 153. doi:10.1007/bf02294533

[7] Braeken, J., Tuerlinckx, F., & De Boeck, P. (2007). Copula Functions for Residual Dependency. Psychometrika, 72(3), 393. doi:10.1007/S11336-007-9005-4

[8] Caldwell, D. J., & Pate, A. N. (2013). Effects of question formats on student and item performance. American journal of pharmaceutical education, 77(4), 71.

[9] Casler, L. (1983). Emphasizing the negative: A note on the "not" in multiple-choice questions. Paper presented at the meeting of the American Psychological Association.

[10] Cassels, J. R. T., & Johnstone, A. H. (1984). The effect of language on student performance on multiple-choice tests in chemistry. Journal of Chemical Education, 61, 613-615.

[11] Chen, W. H., & Thissen, D. (1997). Local Dependence Indexes for Item Pairs Using Item Response Theory. Journal of Educational and Behavioral Statistics, 22(3), 265. doi:10.3102/10769986022003265

[12] Chessa, A. G., & Holleman, B. C. (2007). Answering attitudinal questions: Modelling the

[13] response process underlying contrastive questions. Applied Cognitive Psychology, 21,

[14] -225. doi:10.1002/acp.1337

[15] Congdon, P. (2003). Applied Bayesian modelling. New York, NY: Wiley.

[16] Deemer, S. A., & Minke, K. M. (1999). An investigation of the factor structure of the teacher efficacy scale. The Journal of Educational Research, 93(1), 3-10.

[17] DeMars, C. E. (2006). Application of the Bi‐Factor multidimensional item response theory model to Testlet‐Based tests. Journal of Educational Measurement, 43(2), 145-168.

[18] Downing, S. M. (2005). The effects of violating standard item writing principles on tests and students: the consequences of using flawed test items on achievement examinations in medical education. Advances in health sciences education, 10(2), 133-143.

[19] Downing, S. M., Dawson-Saunders, B., Case, S. M., & Powell, R. D. (1991). The psychometric effects of negative stems, unfocused questions, and heterogeneous options on NBME Part I and Part II item characteristics. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago.

[20] Dudycha, A. L., & Carpenter, J. B. (1973). Effects of item formats on item discrimination and difficulty. Journal of Applied Psychology, 58, 116-121.

[21] Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 457-472.

[22] Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2014). Bayesian data analysis. Boca Raton, FL, USA: Chapman & Hall/CRC.

[23] Gitchel, W. D., Roessler, R. T., & Turner, R. C. (2011). Gender effect according to item directionality on the perceived stress scale for adults with multiple sclerosis. Rehabilitation Counseling Bulletin, 55(1), 20-28.

[24] Haladyna, T. M. (2004). Developing and validating multiple-choice test items (3rd ed.). Mahwah, NJ: Erlbaum.

[25] Haladyna, T. M., & Downing, S. M. (1989a). A taxonomy of multiple-choice item-writing rules. Applied Measurement in Education, 2(1), 37-50.

[26] Haladyna, T. M., & Downing, S. M. (1989b). Validity of a taxonomy of multiple-choice item-writing rules. Applied Measurement in Education, 2(1), 51-78.

[27] Haladyna, T. M., & Downing, S. M. (2004). Construct‐irrelevant variance in high‐stakes testing. Educational Measurement: Issues and Practice, 23(1), 17-27.

[28] Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

[29] Harasym, P. H., Price, P. G., Brant, R., Violato, C., & Lorscheider, F. L. (1992). Evaluation of negation in stems of multiple-choice items. Evaluation and the Health Professions, 15, 198-220.

[30] Hoskens, M., & De Boeck, P. (1997). A parametric model for local dependence among test items. Psychological methods, 2(3), 261.

[31] Jiao, H., Wang, S., Wan, L., & Lu, R. (2009, April). Investigation of local item dependence in scenario-based science assessment. Paper presented at the Annual Meeting of the American Educational Research Association, San Diego, CA.

[32] Jiao, H., Kamata, A., & Xie, C. (2015). A multilevel cross-classified testlet model for complex item and person clustering in item response modeling. In J. Harring, L. Stapleton, & S. Beretvas (Eds.), Advances in multilevel modeling for educational research: Addressing practical issues found in real-world applications. Charlotte, NC: Information Age Publishing.

[33] Johnstone, A. H. (1983). Training teachers to be aware of the student learning difficulties. In P. Tamir, A. Hofstein A, & M. Ben Peretz (Eds.), Preservice and Inservice Education of Science Teachers. Rehovot (Isreal) – Philadelphia (USA): Balaban International Science Services.

[34] Kamata, A. (2001). Item Analysis by the Hierarchical Generalized Linear Model. Journal of Educational Measurement, 38(1), 79. doi:10.1111/j.1745-3984.2001.tb01117.x

[35] Kamata, A., & Bauer, D. J. (2008). A note on the relation between factor analytic and item response theory models. Structural Equation Modeling, 15(1), 136-153.

[36] Kieruj, N. D., & Moors, G. (2013). Response style behavior: Question format dependent or

[37] personal style. Quality & Quantity, 47, 193-211. doi:10.1007/s11135-011-9511-4

[38] Li, Y., Bolt, D. M., & Fu, J. (2006). A comparison of alternative models for testlets. Applied Psychological Measurement, 30(1), 3-21.

[39] Li, Y., & Lissitz, R. W. (2012). Exploring the full-information bifactor model in vertical scaling with construct shift. Applied Psychological Measurement, 36(1), 3-20.

[40] Lindwall, M., Barkoukis, V., Grano, C., Lucidi, F., & Raudsepp, L. (2012). Method effects:

[41] The problem with negatively versus positively keyed items. Journal of Personality

[42] Assessment, 94, 196-204. doi:10.1080/00223891.2011.645936

[43] Luo, Y. (2019). LOO and WAIC as model selection methods for polytomous items. Psychological Test and Assessment Modeling, 61(2), 161-185.

[44] Luo, Y., & Al-Harbi, K. (2016). The Utility of the Bifactor Method for Unidimensionality Assessment When Other Methods Disagree. SAGE Open, 6(4).

[45] Luo, Y., & Jiao, H. (2018). Using the Stan program for Bayesian item response theory. Educational and psychological measurement, 78(3), 384-408.

[46] Luo, Y., & Liang, X. (2019). Simultaneously Modeling Differential Testlet Functioning and Differential Item Functioning: Addressing Variance Heterogeneity with a Multigroup One-Parameter Testlet Model. Measurement: Interdisciplinary Research and Perspectives, 17(2), 93-105.

[47] Magazine, S. L., Williams, L. J., & Williams, M. L. (1996). A confirmatory factor analysis examination of reverse coding effects in Meyer and Allen's affective and continuance commitment scales. Educational and Psychological Measurement, 56(2), 241-250.

[48] Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement. New York: American Council on Education and Macmillan

[49] Rachor, R. E., & Gray, G. T. (1996, April). Must all stems be green? A study of two guidelines for writing multiple choice stems. Paper presented at the annual meeting of the American Educational Research Association, New York.

[50] Reckase, M. D. (2009). Multidimensional item response theory. New York, NY: Springer.

[51] Reise, S. P., Morizot, J., & Hays, R. D. (2007). The role of the bifactor model in resolving dimensionality issues in health outcomes measures. Quality of Life Research, 16(1), 19-31.

[52] Rijmen, F. (2010). Formal Relations and an Empirical Comparison among the Bi‐Factor, the Testlet, and a Second‐Order Multidimensional IRT Model. Journal of Educational Measurement, 47(3), 361-372.

[53] Rosenbaum, P. R. (1988). Items bundles. Psychometrika, 53(3), 349. doi:10.1007/bf02294217

[54] Roszkowski, M. J., & Soven, M. (2010). Shifting gears: Consequences of including two negatively worded items in the middle of a positively worded questionnaire. Assessment & Evaluation in Higher Education, 35, 117-134. doi:10.1080/02602930802618344

[55] Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461-464.

[56] Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & Van Der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(4), 583-639.

[57] Takane, Y., & De Leeuw, J. (1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 52(3), 393-408.

[58] Tamir, P. (1991). Multiple choice items: How to gain the most out of them. Biochemical

[59] Education, 19(4), 188-192.

[60] Tamir, P. (1993). Positive and negative multiple choice items: How different are they? Studies in Educational Evaluation, 19, 311-325.

[61] Terranova, C. (1969). The effects of negative stems in multiple-choice test items. Dissertation Abstracts International, 30, 2390A.

[62] van Sonderen, E., Sanderman, R., & Coyne, J. C. (2013). Ineffectiveness of reverse wording of questionnaire items: Let’s learn from cows in the rain. PLoS ONE, 8(7), 1–7.

[63] Wainer, H., Bradlow, E. T., & Wang, X. (2007). Testlet response theory and its applications. Cambridge: Cambridge University Press.

[64] Wang, W. C., Chen, H. F., & Jin, K. Y. (2015). Item response theory models for wording effects in mixed-format scales. Educational and Psychological Measurement, 75(1), 157-178.

[65] Xie, C. (2014). Cross-classified modeling of dual local item dependence (Unpublished doctoral dissertation). University of Maryland, College Park, MD.

[66] Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of educational measurement, 30(3), 187-213.

[67] Zhang, B. (2010). Assessing the accuracy and consistency of language proficiency classification under competing measurement models. Language Testing, 27(1), 119-140.

[68] Zhang, X., Noor, R., & Savalei, V. (2016). Examining the effect of reverse worded items on the factor structure of the need for cognition scale. PLoS ONE, 11(6), 1–15.



  • There are currently no refbacks.
Copyright © 2019 Yong Luo

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.