Inter-rater Reliability in Qualitative Coding: Considerations for its Use

by Sean N. Halpin

This week’s blog post is from Dr. Sean Halpin, who is a Qualitative Analyst with RTI-International, on the Genomics, Ethics, and Translational Research team. Dr. Halpin has over a decade of experience leading socio-behavioral studies across a wide range of chronic and infectious disease areas and has published numerous journal articles to do with patient care. His responsibilities at RTI include preparing research proposals, developing and executing research protocols, overseeing data collection and analysis, interpreting the research results and supporting sponsors’ strategic goals, managing operational and financial aspects of research studies, and disseminating results. Dr. Halpin has a Ph.D. in qualitative research and evaluation methodologies from the University of Georgia and an MA in developmental psychology from Teachers College, Columbia University.

As a qualitative researcher in the health sciences, sometimes I have had to use inter-rater reliability as proof of the rigor of my analysis. Inter-rater reliability is a tool for quantifying the consistency for which codes are applied to text, yet it is important to consider the limitations of inter-rater reliability to optimize its use. Below I provide a brief overview of inter-rater reliability followed by three challenges for consideration and some solutions to each of those challenges.

The Demand for Rigor

The burgeoning demand for qualitative research in the health sciences, and in other research-focused areas, includes calls for evidence of rigorous analysis. Given the historically quantitative landscape of health sciences, the types of “evidence for rigor” relied on are often given a statistical spin. Mathematical concepts of reliability and validity have been retrofitted and applied to qualitative data in an effort to assure readers the analysis being presented is indeed valid and reliable using familiar quantitative terms. Yet it is important to account for the differences in these very different research methods (qualitative versus quantitative) to ensure the goal of presenting valid and reliable data is met.

Defining Inter-rater Reliability

For our purposes, inter-rater reliability refers to the consistency in which two or more coders apply the same code to discrete segments of text. It is important to recognize that Inter-rater reliability is often used synonymously with very similar sounding methods meant to assess reliability in qualitative coding. Sometimes words like “inter-rater agreement” or “inter-coder agreement” are used as stand-ins but there are differences defined in the literature. For example, Gisev and colleagues (2013) state that “integrated agreement indices assess the extent to which the responses of 2 or more independent raters are concordant. Integrated reliability indices assess the extent to which raters consistently distinguish between different responses” (p. 330).

Strategies for Calculating Inter-rater Reliability

My purpose is not to provide a step-by-step guide to completing inter-rater reliability (that process would be exceptionally long and has been documented extensively both through publications and the ever-evolving manuals for qualitative data analysis software). Rather, I intend to define challenges with using inter-rater reliability in qualitative research and provide some potential solutions (below). Still, it is important to recognize that inter-rater reliability is often calculated either by hand or using different software approaches. The software approaches vary in the statistical techniques and often offer multiple options for calculating inter-rater reliability. An overview on the statistical approaches to inter-rater reliability is provided in the following article:

McAister, A.M., Lee, D.M., Ehlert, K.M., Kajfez, R.L., Faber, C.J., and Kennedy, M.S. (2017) Qualitative Coding: An approach to assess inter-rater reliability. In 2017 ASEE annual conference and exposition. https://peer.asee.org/qualitative-coding-an-approach-to-assess-inter-rater-reliability

Challenges with Applying the Concept

Janice Morse (1997) concisely argued that it is unreasonable to attempt to apply inter-rater reliability to open-ended qualitative questions. Regardless of whether you accept such a hardline stance, as a qualitative researcher, you may be expected to provide an inter-rater reliability value depending on your audience. Below are three major challenges to using inter-rater reliability along with some potential solutions.

Challenge 1: Who makes the final “Reliability” decision? (i.e., How was “Consensus” reached?)

In published articles, qualitative researchers aiming to show rigor in coding often state that “coding was refined until consensus was reached”—indicating there were some disagreements in the initial coding but then the researchers met and discussed the differences until everyone agreed to apply codes in a uniform fashion. The lack of detail provided leaves the possibility that a single researcher may have swayed how codes were applied. Single-sided influence on coding may be due to a power dynamic or a particularly persuasive or passionate speaker, but it is rarely explicitly evaluated. One must acknowledge the possibility that the single researcher may allow their own biases to influence how they encourage the entire team to code. In such a case, while the final inter-rater reliability number may be reported as high, it is flawed.

Possible Solution to Challenge 1: Researchers should systemically document and report differences of opinion in coding and how they are resolved including who is involved in making those decisions.

Challenge 2: Was coding creep addressed? (i.e., Coders’ changing understanding of concepts over time)

As Jacelon and O’Dell (2005) asked, have “the data within the code changed over time?” (p. 218). If so, the way a coder applies a certain code for an earlier transcript may not be consistent in a transcript they coded later in the project. Inclusion of transcripts in an inter-rater reliability assessment can take several forms but it is often limited based on budget, time, or both. As a result, researchers may perform an inter-rater reliability assessment on only the first transcript, or perhaps the first three and then the last three.

Possible Solution to Challenge 2: Researchers should be sensitized to identify when their coding approach may shift, then document those changes and discuss them with the broader team. If it is deemed that the coding approach should change based on new evidence, the previously coded data should be reviewed and revised to ensure consistency throughout the project.

Double-coded transcript data should also be approached with care. Ideally, all qualitative data will be coded by two randomly assigned qualitative coders, and reconciliation will occur at regular intervals. If double-coding all data is not viable then efforts should be made to ensure a systematic approach, such as randomly double-coding 10% of data.

Challenge 3: Was the coding applied consistently? (i.e., What text is included?)

Often qualitative data analysis software calculates inter-rater reliability based on the spaces that were selected for a particular code. For example, you may code a sentence and include everything from the first word through the punctuation, but a fellow coder may only code a portion of the sentence that they feel is relevant. Alternately, one coder may code an even larger chunk of text including the time stamp and time tags for an interview transcript while these characteristics might not be selected by the second coder, leading to misclassification of inter-rater reliability results.

Possible Solution to Challenge 3: Clearly define what text should be included in a coded section.

Conclusion

Inter-rater reliability for qualitative research is an imperfect tool. Nevertheless, in cases where researchers feel compelled to use inter-rater reliability, it is critical to carefully consider its potential pitfalls. Moreover, it is important to recognize that multiple non-quantitative methods exist for demonstrating rigor in qualitative research. Various forms of triangulation (Denzin, 1978), member checking (Lincoln and Guba, 1985), and thick description (Geertz, 1973) of the data being reported are just a few powerful and often used methods for assuring rigor.

References

Denzin, N.K. (1978). The research act: A theoretical introduction to sociological methods (2nd ed.). McGraw Hill.

Geertz, C. (1973). Thick description: Toward an interpretive theory of culture. In the Interpretation of Cultures: Selected Essays (pp. 3-30). Basic Books.

Gisev, N., Bell, J. S., & Chen, T. F. (2013). Interrater agreement and interrater reliability: key concepts, approaches, and applications. Research in Social and Administrative Pharmacy9(3), 330-338.

Jacelon, C.S., &O’Dell, K.K. (2005). Analyzing qualitative data. Urologic Nursing, 25(3), 217-220.

Lincoln, Y.S., & Guba, E.G. (1985). Naturalistic Inquiry. SAGE

Morse, J.M. (1997). “Perfectly healthy, but dead”: The myth of inter-rater reliability. Qualitative Health Research, 7(4), 445-447.

Leave a comment