DIF vs. DTF
ASWB, like most high-stakes exam developers, relies on DIF (differential item functioning) analysis to assess for measurement bias at the item (question) level, as opposed to DTF (differential test functioning) analysis, which assesses for measurement bias at the test level.
Why doesn’t ASWB run DTF analyses?
DTF analysis is the counterpart to DIF analysis and was initially proposed because item writing is an expensive and time-consuming effort and evaluating a test for measurement bias, and, in turn, remedying said bias, could be achieved by identifying the smallest number of items that could be removed such that the bias at the test level would cancel. (See, for example, Raju et al., 1995). This is in contrast to DIF analysis, which is a much more stringent, or conservative approach to test development, in that items are screened individually during pretesting, and items that are identified as biased are deleted or revised before using them as operational (i.e., scored) items.
ASWB uses a similarly conservative approach. Any item displaying DIF during pretesting is pulled from their pretest pool and does not make it into their pool of operational test items. If ASWB were to run DTF analyses and rely on that information to remove problematic items, it is highly likely that the results of those analyses would lead to removal of far fewer items than does the approach ASWB currently uses. ASWB is, nevertheless, exploring the potential value of using DTF analysis as an additional assurance of fairness.
Raju, N. S., van der Linden, W. J., & Fleer, P. F. (1995). IRT-based internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19, 353-368.
Doesn’t page 52 of the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education) specify that DTF (differential test functioning) analyses should be used to analyze for measurement bias?
No. The parenthetical note on page 52 of the Standards where DTF is mentioned as part of a larger discussion focused on different types of bias, specifically that different types of biases should be evaluated independently of one another because they are not necessarily related. The note does not state that it is necessary to run DTF analyses as part of an exam development program, nor does it prescribe the use of any one particular analysis over another (or in tandem with each other) when evaluating for measurement bias. Rather, DTF, like DIF, is one of several approaches exam developers may use to evaluate measurement bias.
Also, as discussed in the response to the previous question, discarding all items that exhibit DIF during pretesting is a more stringent approach to mitigating bias concerns than selectively removing DIF items from an operational test until DTF is negligible. Although it is theoretically possible that DIF analyses may fail to identify some problematic items and small amounts of bias may accumulate to produce DTF, it is very unlikely that practically important DTF will result, because there is often high power to detect small magnitudes of DIF, and DIF does not typically favor one examinee group consistently (e.g., Nye et al., 2011; Stark et al., 2004).
Nye, C. D., & Drasgow, F. (2011). Effect size indices for analyses of measurement equivalence: Understanding the practical importance of differences between groups. Journal of Applied Psychology, 96, 966-980.
Stark, S., Chernyshenko, O. S., & Drasgow, F. (2004). Examining the effects of differential item (functioning and differential) test functioning on selection decisions: When are statistically significant effects practically important? Journal of Applied Psychology, 89, 497-508.