**Kappa** is a nonparametric test that can be used to measure interobserver agreement on imaging studies. **Cohen's kappa** compares two observers, or in the case of machine learning can be used to compare a specific algorithm's output versus labels. **Fleiss' kappa** assesses interobserver agreement between more than two observers.

If comparing two observers, the concept behind the test is similar to the chi-squared test. Two 2 x 2 tables are set up: one with the expected values if there were chance agreement, and one with your actual data. Kappa will indicate how much of your interobserver agreement was due to chance.

To find the expected values, find the product of the marginals:

To find the *expected *value for the +/+ cell: [(O_{1} + O_{2)} x (O_{1} +O_{3})] / total observations

To find the *expected *value for the -/- cell: [(O_{3} + O_{4}) x (O_{2} +O_{4})] / total observations.

Rating systems for kappa are controversial, as they cannot be proven, but one system classifies kappa values as ^{2}

- >0.75: excellent
- 0.40-0.75: fair to good
- <0.40: poor

Kappa can be extrapolated out to 3+ readers using more elaborate equations. Kappa in that setting assesses if all radiologists involved agree on a finding (more stringent).

Kappa is used for categorical values (e.g. larger vs. smaller, has condition vs. does not have the condition). The Bland-Altman analysis is used for continuous variables.