Module 3 Item Response Theory and ELO

KT Learning Lab 3: A Conceptual Overview

Item Response Theory

  • A classic approach for assessment, used for decades in tests and some online learning environments

  • In its classical form, has some key limitations that make it less useful for assessment in online learning

    • But variants such as ELO and T-IRT address some of those limitations

Key goal of IRT

  • Measuring how much of some latent trait a person has

  • How intelligent is Bob?

    • How much does Bob know about snorkeling?

Typical use of IRT

  • Assess a student’s current knowledge of topic X

  • Based on a sequence of items that are dichotomously scored

    • E.g. the student can get a score of 0 or 1 on each item

Key assumptions

  • There is only one latent trait or skill being measured per set of items

    • This assumption is relaxed in the extension Cognitive Diagnosis Models (CDM) (Henson, Templin, & Willse, 2009)
  • No learning is occurring in between items

    • E.g. a testing situation with no help or feedback

Key assumptions

  • Each learner has ability θ

  • Each item has difficulty b and discriminability a

  • From these parameters, we can compute the probability P(θ) that the learner will get the item correct

Note

  • The assumption that all items tap the same latent construct, but have different difficulties

  • Is a very different assumption than is seen in PFA or BKT

The Rasch (1PL) model

  • Simplest IRT model, very popular

  • Mathematically the same model (with a different coefficient), but some different practices surrounding the math (that are out of scope for our discussion)

  • There is an entire special interest group of AERA devoted solely to the Rasch model and modeling related to Rasch (Rasch Measurement)

The Rasch (1PL) model

  • No discriminability parameter

  • Parameters for student ability and item difficulty

The Rasch (1PL) model

  • Each learner has ability θ

  • Each item has difficulty b

\[ P(\theta ) = \frac{1}{1+e^{-1(\theta -b)}} \]

Item Characteristic Curve

  • A visualization that shows the relationship between student skill and performance

As student skill goes up, correctness goes up

  • This graph represents b=0

  • When θ=b (knowledge=difficulty), performance = 50%

As student skill goes up, correctness goes up

Changing difficulty parameter

  • Green line: b=-2 (easy item)

  • Orange line: b=2 (hard item)

Note

  • The good student finds the easy and medium items almost equally difficult

Note

  • The weak student finds the medium and hard items almost equally hard

Note

  • When b=θ

  • Performance is 50%

The 2PL model

  • Another simple IRT model, very popular

  • Discriminability parameter a added

Formula

\[ Rasch: P(\theta ) = \frac{1}{1+e^{-1(\theta -b)}} \]

\[ 2PL:P(\theta ) = \frac{1}{1+e^{-a(\theta -b)}} \]

Different values of a

  • Green line: a = 2 (higher discriminability)

  • Blue line: a = 0.5 (lower discriminability)

Extremely high and low discriminability

  • a=0

  • a approaches infinity

Model degeneracy

  • a below 0…

The 3PL model

  • A more complex model

  • Adds a guessing parameter c

The 3PL model

\[ P(\theta ) = c+ (1-c)\frac{1}{1+e^{-a(\theta -b)}} \]

  • Either you guess (and get it right)

  • Or you don’t guess (and get it right based on knowledge)

Fitting an IRT model

  • Can be done with Expectation Maximization

  • Estimate knowledge and difficulty together

    • Then, given item difficulty estimates, you can assess a student’s knowledge in real time

Uses…

  • IRT is used quite a bit in computer-adaptive testing

  • Not used quite so often in online learning, where student knowledge is changing as we assess it

ELO (Elo & Sloan, 1978; Pelánek, 2016)

  • A variant of the Rasch model which can be used in a running system

  • Continually estimates item difficulty and student ability, updating both every time a student encounters an item

ELO (Elo & Sloan, 1978; Pelánek, 2016)

  • You may know of ELO from Chess or Pokemon rankings!

ELO (Elo & Sloan, 1978; Pelánek, 2016)

\[ \theta_{i+1} = \theta_{i} + K(c-P(c)) \]

\[ b_{i+1} = b_{i} + K(c-P(c)) \]

  • Where K is a parameter for how strongly the model should consider new information

Multivariate ELO (Abdi et al, 2019)

  • Allows an item to involve multiple skills

  • Averages difficulty across skills

MV-Glicko (Abdi, Khosravi, & Sadiq, 2021)

  • Allows an item to involve multiple skills

  • Averages difficulty across skills

  • Takes time between practices of a skill into account (much like LKT extensions to PFA)

Using ELO in the real world

  • ELO is used in several real-world systems

    • MathGarden

    • Slepempapy.cz

    • Top Parent

    • Shadowspect

  • There are some issues that need to be considered to use it sanely in a real-world system (Ruiperez-Valiente et al., 2023)

Using ELO in real world

  • The biggest is that you don’t actually want to let ELO item values change in real time

  • If you let that happen, then two students in the same classroom, with the exact same history of correct and incorrect, on the exact same items

  • Will end up with different knowledge estimates

Using ELO in real world

  • The solution: fit item parameters with initial data set

  • Then keep item parameters constant during actual real-world use

Using ELO in real world

  • The solution: fit item parameters with initial data set

  • Then keep item parameters constant during actual real-world use

  • This is generally a good policy – if you don’t do this, the first several students will have really inaccurate knowledge estimates

When to use ELO

  • Items have very different difficulty within skills

  • Relatively small amounts of data is OK

  • New items can be added without refitting the entire model – but you have to wait a little while to get valid estimates for those new items