Blog | Differential Item Functioning in COA research

Background

As interest grows in ensuring COAs perform consistently across diverse populations, Differential Item Functioning (DIF) is becoming increasingly important. Yet its use in COA research remains inconsistent. DIF refers to methods that evaluate whether subgroups (e.g. age, gender, race) respond differently to items, despite sharing the same latent trait ^1-3. For example, some subgroups may endorse certain items more than others because of socio-demographic factors rather than true differences in health status. As Zumbo (2025) describes, DIF therefore offers a “window into human diversity, offering insights into how people's experiences, histories, and communities shape the ways they approach tests” ⁴.

DIF is typically assessed using quantitative methods such as ordinal logistic regression, Item-Response-Theory, or Mantel-Haenszel⁵. A small number of qualitative approaches have also been used including gender-based DIF analysis on the Kansas City Cardiomyopathy Questionnaire (KCCQ) ⁶. Initially developed in educational testing in the 1980s⁷, DIF began appearing in clinical outcome measurement literature in the 1990s and early 2000s driven by growing recognition that subgroup differences can threaten COA validity⁸. Despite this, the US Food and Drug Administration’s (FDA) 2009 patient-reported outcome guidance did not reference DIF, an important omission given the influence of these guidelines on COA development practices ⁹. More recently, DIF has gained attention within the FDA’s Patient‑Focused Drug Development (PFDD) Guidance 3 on selecting, developing, or modifying fit‑for‑purpose COAs¹⁰. At Mapi Research Trust, we also noticed what felt like an increasing number of COA development papers assessing DIF as part of early validation.

To our knowledge, no recent review has examined the use of DIF in COA research. In light of this gap, recent regulatory attention, and our own observations, we conducted an exploratory review to understand how DIF has been applied in COA research over the past five years.

What we did

We searched PubMed for English-language publications using “Differential Item Functioning” as a search term over the past five years (n=461 results). To maintain breadth, we included all forms of DIF (uniform and/or non-uniform). We excluded methodological papers and publications that did not apply DIF to a COA resulting in a final dataset of 384 publications.

As this was an exploratory study, we classified a broad range of variables including therapeutic indication and therapeutic area (following the MeSH library¹¹ and Orphanet rare disease classifications¹²), concept of interest (from the abstract, PROQOLIDTM database¹³, or full text), DIF approach (quantitative/qualitative), and subgroups examined. Publications were classified using their abstract when possible, otherwise the full text was reviewed. The resulting dataset offered a useful snapshot of DIF practices across recent COA research.

What we found

Types of papers

Most papers were COA validation studies (n=234) with fewer development studies (n=86). This meant DIF was generally assessed after a COA had already been developed to confirm the absence of subgroup response differences. Notably, original development/validation papers incorporating DIF assessment increased steadily, from seven papers in 2021 to 26 papers in 2025, which aligned with our initial impressions.

This rise may reflect increased emphasis on DIF as one of many indicators which can support the evidence base for whether a COA is fit-for-purpose, as emphasized in the FDA’s PFDD guidance 3¹⁰ and the International Society for Quality of Life Research’s 2019 Psychometrics Special Interest Group papers on PRO measurement property analysis^14-16. Both of which underlined the importance of DIF evaluation.

Qualitative vs Quantitative Methods

Given DIF’s roots in psychometrics, it was unsurprising that only one paper employed a qualitative DIF approach. However, with growing emphasis on centering the patient voice at the heart of health outcomes research¹⁷, the qualitative methodology used for the KCCQ⁶ (a PRO qualified through the FDA’s COA Qualification Program) highlights how qualitative methods can complement traditional DIF analyses and offer more patient-centered insights.

COAs

Patient-reported outcome measures (PROMs) were the most commonly cited COA type (n=338/86%), see figure 1. Common PROs included the Patient Health Questionnaire (n=6), Catquest-9SF, and the Burnout Assessment Tool (n=4 respectively). Twenty-five publications used a Patient Reported Outcomes Measurement Information System® (PROMIS) measure as a primary COA. This was expected since PROMIS development standards state that DIF “should be assessed”¹⁸.

Clinician-reported outcome measures (ClinROs; n=16/4%) were far less common, consistent with DIF assessment being primarily used for self-report measures historically⁴. Nevertheless, ClinROs rely on human judgement, which may be influenced by clinicians’ socio-demographic characteristics¹⁹ as well as those of the patients²⁰. Moreover, because ClinROs often serve as primary or key secondary trial endpoints, ensuring they are free from DIF is crucial for accurate assessment across heterogeneous populations. Reflecting this, the ISPOR‘s 2017 COA Emerging Good Practices Task Force recommended DIF assessment for evaluating ClinRO comprehensiveness²¹.

Figure 1: Types of COA that DIF Analysis was Conducted on

Therapeutic Areas and Concepts Measured

Mental health and Psychological Functioning

COAs were most frequently used in non-disease specific contexts, typically the general population (n=88/23%), followed by mental disorders (n=69/18%) and nervous system disorders (n=46/12%). Psychological functioning was the most common concept assessed (n=175/34%), followed by signs and symptoms (n=94/18%), and physical functioning (n=72/14%) – see figure 2.

A key reason for this pattern may be that some mental health constructs are not directly observable²², meaning that differences in interpretation or symptom expression can be more likely to introduce DIF. Symptoms linked to mental health and nervous system disorders (e.g. Parkinson’s or motor neuron disease) also often manifest differently across demographic groups²³. Our findings echo Teresi et al.’s 2008 review which likewise found DIF in many depression-related PROs⁸.

Rare diseases

Studies involving rare diseases were largely absent (n=10/3.8%), which was expected as quantitative DIF analyses typically require large sample sizes and are often impractical in rare disease contexts²⁴. Given this limitation, qualitative DIF methods as outlined by Coles et al., 2022 for the KCCQ, may be a promising approach for rare-disease COAs.

Health literacy

A notable and somewhat unexpected finding was the relatively high number of PROs measuring health literacy, which was as high as the number of COAs measuring activities of daily living (n=16/3% respectively). It makes sense that measures of health literacy would be particularly concerned with assessing DIF since health literacy is influenced by education, age, and gender, and these subgroup comparisons were reflected in our results^25-26. This finding also reflects the growing interest in understanding health literacy and how it interacts with patient outcomes, evidenced in the published literature^27-28 and from the push from health bodies (namely the World Health Organization²⁶ and the European Union²⁹) to prioritize health literacy.

Figure 2: Concept of Interests Measured by Identified COAs

DIF Sub-group Analysis

Age, gender, and sex were the most common subgroup comparisons (n=181/47%, n=125/33%, and n=104/27% respectively). Their prominence was expected since these data are routinely collected and often required by regulators³⁰. These variables are also known modifiers of health status and treatment response meaning that assessing DIF is important for identifying problematic items and strengthening the validity evidence for specific contexts of use^31-32.

In contrast, relatively few studies examined DIF by race/ethnicity (n=41/11%). Studies that did so followed similar patterns to other subgroups. For example, most were PROs (n=30/73%), measured psychological functioning (n=21/51%), and involved mental health (n=8/20%) and nervous system disease (n=7/17%) populations. Of note, the limited use of race/ethnicity subgroups mirrors findings from Teresi et al, 20088. Given the established importance of examining DIF for race/ethnicity^33-34, COA researchers could reflect on including such analysis before using a COA to ensure it is fit-for-purpose in diverse populations.

Limitations

As an exploratory study, our search was restricted to PubMed and we chose to select full text articles to maximize classification accuracy. Future work could examine whether our findings would be comparable in other databases, grey literature, and non-freely available publications. Additionally, we focused only on the primary COA listed in the study, meaning the first COA listed in the abstract, and did not distinguish between uniform or non-uniform DIF.

Even with these limitations, our findings highlight several areas for greater reflection and action by COA researchers. Our broad inclusion criteria also supports insights across a wide range of COA types and applications.

Implications for COA Researchers

Consider incorporating DIF analyses earlier in COA development, rather than relying solely on post‑hoc validation.
Use qualitative DIF approaches when quantitative methods are not feasible, such as in rare‑disease research or small‑sample contexts.
Assess DIF by race/ethnicity where subgroup sizes permit, given its importance for evaluating equity and measurement fairness.
Ensure subgroup data (e.g., age, gender, socio‑demographics) is collected consistently so DIF analyses are possible.
Apply DIF findings to refine or revise poorly performing items, strengthening validity evidence for specific contexts of use.

Conclusions

This review suggests that although DIF is being assessed in some COA research, its application remains limited and further research is required to understand these trends. DIF analyses are concentrated in a small set of therapeutic areas, and the COAs examined largely measure a narrow range of concepts. Broader and more systematic use of DIF analysis could help COA researchers align with FDA recommendations, strengthen validity evidence, and ensure COAs measure concepts as intended across diverse patient groups.

Thank you to Tilly Stott and Nadine Kraft for their contributions to this article.

References

Chen, WH., Revicki, D. Differential Item Functioning (DIF). In: Michalos, A.C. (eds) Encyclopedia of Quality of Life and Well-Being Research. 2014. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-0753-5_728
Scott NW, Fayers PM, Aaronson NK, Bottomley A, de Graeff A, Groenvold M, Gundy C, Koller M, Petersen MA, Sprangers MA; EORTC Quality of Life Group and the Quality of Life Cross-Cultural Meta-Analysis Group. Differential item functioning (DIF) analyses of health-related quality of life instruments using logistic regression. Health Qual Life Outcomes. 2010 Aug 4;8:81
Wild D, Eremenco S, Mear I, Martin M, Houchin C, Gawlicki M, Hareendran A, Wiklund I, Chong LY, von Maltzahn R, Cohen L, Molsen E. Multinational trials-recommendations on the translations required, approaches to using the same language in different countries, and the approaches to support pooling the data: the ISPOR Patient-Reported Outcomes Translation and Linguistic Validation Good Research Practices Task Force report. Value Health. 2009 Jun;12(4):430-40
Zumbo, B.D. Looking through the lens of Differential Item Functioning (DIF): Embracing the many ways of being human. The Score. 2025. Available at: https://www.apadivisions.org/division-5/publications/score/2025/10/differential-item-functioning [Accessed on 10-03-2026].
Stover, A.M., McLeod, L.D., Langer, M.M. et al. State of the psychometric methods: patient-reported outcome measure development and refinement using item response theory. J Patient Rep Outcomes. 2019. 3, 50.
Coles TM, Lucas N, McFatrich M, Henke D, Ridgeway JL, Behnken EM, Weinfurt K, Reeve BB, Corneli A, Dunlay SM, Spertus JA, Lin L, Piña IL, Bocell FD, Tarver ME, Dohse H, Saha A, Caldwell B. Investigating gender-based differential item functioning on the Kansas City Cardiomyopathy Questionnaire (KCCQ) using qualitative content analysis. Qual Life Res. 2023 Mar;32(3):841-852.
Pagano IS, Gotay CC. Ethnic differential item functioning in the assessment of quality of life in cancer patients. Health Qual Life Outcomes. 2005 Oct 7;3:60
Teresi JA, Ramirez M, Lai JS, Silver S. Occurrences and sources of Differential Item Functioning (DIF) in patient-reported outcome measures: Description of DIF methods, and review of measures of depression, quality of life and general health. Psychol Sci Q. 2008;50(4):538.
Oehrlein et al. An Interview With the Food and Drug Administration About Draft Patient-Focused Drug Development Guidance 3: Selecting, Developing, or Modifying Fit-for-Purpose Clinical Outcome Assessments. Value in Health, 2023. Volume 26, Issue 6, 791 - 795
US Food and Drug Administration’s (FDA). Patient-Focused Drug Development: Selecting, Developing, or Modifying Fit-for-Purpose Clinical Outcome Assessments. 2025. Available at: https://www.fda.gov/regulatory-information/search-fda-guidance-documents/patient-focused-drug-development-selecting-developing-or-modifying-fit-purpose-clinical-outcome [Accessed on 06-03-2026].
National Library of Medicine. Medical Subject Headings 2026 (MeSH). Available at: https://meshb.nlm.nih.gov/search. [Accessed 12-Jan-2026]
Orphanet. Search for a rare disease. Available at: https://www.orpha.net/en/disease [Accessed 12-Jan-2026]
Mapi Research Trust. PROQOLID^TM– via ePROVIDE^TM. Available at: https://eprovide.mapi-trust.org/advanced-search?database=proqolid. [Accessed 12-Jan-2026]
Patrick, D.L. Many ways to skin a cat: psychometric methods options illustrated. J Patient Rep Outcomes. 2019. 3, 48
Cleanthous, S., Barbic, S., Smith, S. et al. Psychometric performance of the PROMIS® depression item bank: a comparison of the 28- and 51-item versions using Rasch measurement theory. J Patient Rep Outcomes. 2019. 3, 47
Nolte, S., Coon, C., Hudgens, S. et al. Psychometric evaluation of the PROMIS® Depression Item Bank: an illustration of classical test theory methods. J Patient Rep Outcomes. 2019 3, 46
European Medicines Agency (EMA). Patient experience data (PED) reflection paper. 2026. Available at: https://www.ema.europa.eu/en/patient-experience-data-ped-reflection-paper. [Accessed on 06-03-2026].
PROMIS® Instrument Development and Validation Scientific Standards Version 2.0. 2013. Available at: https://www.healthmeasures.net/images/PROMIS/PROMISStandards_Vers2.0_Final.pdf [Accessed on 06-03-2026].
Dazzi F, Fonzi L, Pallagrosi M, Duro M, Biondi M, Picardi A. Relationship Between Gender and Clinician's Subjective Experience during the Interaction with Psychiatric Patients. Clin Pract Epidemiol Ment Health. 2021 Dec 22;17:190-197.
Markowitz DM. Gender and ethnicity bias in medicine: a text analysis of 1.8 million critical care records. PNAS Nexus. 2022 Aug 18;1(4):pgac157.
Powers JH III, Patrick DL, Walton MK, et al. Clinician-reported outcome (ClinRO) assessments of treatment benefit: report of the ISPOR Clinical Outcome Assessment Emerging Good Practices Task Force. Value Health. 2017; 20(1):2-14.
Wright AGC. Latent Variable Models in Clinical Psychology. In: Wright AGC, Hallquist MN, eds. The Cambridge Handbook of Research Methods in Clinical Psychology. Cambridge Handbooks in Psychology. Cambridge University Press; 2020:66-79.
Bradshaw M, Shiba K, Jang SJ, Kent BV, Bonhag R, Johnson BR, VanderWeele TJ. Demographic variation in symptoms of depression and anxiety across 22 Global Flourishing Study countries. Commun Med (Lond). 2026 Jan 9;6(1):100; Duong T, Krosschell KJ, James MK, Nelson L, Alfano LN, Eichinger K, Mazzone E, Rose K, Lowes LP, Mayhew A, Florence J, King W, Senesac CR and Eagle M (2021) Consensus Guidelines for Improving Quality of Assessment and Training for Neuromuscular Diseases. Front. Genet. 12:735936.
Professional society for health economics and outcomes research (ISPOR). Response to FDA about Patient Focused Drug Development: Selecting, Developing, or Modifying Fit-for-Purpose Clinical Outcome Assessments; Draft Guidance for Industry, Food and Drug Administration Staff, and Other Stakeholders. 2022. Available at: https://www.ispor.org/docs/default-source/strategic-initiatives/ispor-response-to-us-fda-pfdd-coa-september-2022.pdf?sfvrsn=55f67124_5 [Accessed on 10-03-2026].
Gonçalves-Fernández ML, Pino-Juste M. Health literacy in healthy adults: A systematic review of recent evidence. Aten Primaria. 2025 Nov;57(11):103300. doi: 10.1016/j.aprim.2025
World Health Organization (WHO). Health Literacy. 2025. Available at: https://www.who.int/news-room/fact-sheets/detail/health-literacy. [Accessed on 10-03-2026].
Wang J and Shahzad F (2022) A Visualized and Scientometric Analysis of Health Literacy Research. Front. Public Health 9:811707
Smith C, Behan S, Belton S, Nicholl C, Murray M, Goss H. An update on health literacy dimensions: An umbrella review. PLoS One. 2025;20(6):e0321227.
González, P.M., García, F.C., López-Ventoso, M., González P., Hidalgo, L.R.I. (Eds.), & On behalf of the IDEAHL Consortium. (2024). IDEAHL European Digital Health Literacy Strategy. Consejería de Salud del Principado de Asturias. https://doi.org/10.5281/zenodo.11395540 [Accessed on 10-03-2026].
Clayton JA, Tannenbaum C. Reporting Sex, Gender, or Both in Clinical Research? JAMA. 2016;316(18):1863–1864.
Janet W Rich-Edwards, Ursula B Kaiser, Grace L Chen, JoAnn E Manson, Jill M Goldstein, Sex and Gender Differences Research Design for Basic, Clinical, and Population Studies: Essentials for Investigators, Endocrine Reviews, Volume 39, Issue 4, August 2018, Pages 424–439
Penton, Hannah et al. An Investigation of Age-Related Differential Item Functioning in the EQ-5D-5L Using Item Response Theory and Logistic Regression. Value in Health, 2022. Volume 25, Issue 9, 1566 – 1574.
Jones, R. N., Tommet, D., Ramirez, M., Jensen, R., & Teresi, J. A. Differential item functioning in Patient Reported Outcomes Measurement Information System® (PROMIS®) Physical Functioning short forms: Analyses across ethnically diverse groups. Psychological Test and Assessment Modeling, 2016. 58(2), 371–402
Moteane M. Critically exploring the use of race and ethnicity as grouping variables in studies that use or include differential item functioning analyses [dissertation]. Greensboro (NC): University of North Carolina at Greensboro; 2024.

Differential Item Functioning in Clinical Outcome Assessment Research

Page tools