Module 3 Item Response Theory and ELO

Item Response Theory

A classic approach for assessment, used for decades in tests and some online learning environments
In its classical form, has some key limitations that make it less useful for assessment in online learning
- But variants such as ELO and T-IRT address some of those limitations

Key goal of IRT

Measuring how much of some latent trait a person has
How intelligent is Bob?
- How much does Bob know about snorkeling?

Typical use of IRT

Assess a student’s current knowledge of topic X
Based on a sequence of items that are dichotomously scored
- E.g. the student can get a score of 0 or 1 on each item

Key assumptions

There is only one latent trait or skill being measured per set of items
- This assumption is relaxed in the extension Cognitive Diagnosis Models (CDM) (Henson, Templin, & Willse, 2009)
No learning is occurring in between items
- E.g. a testing situation with no help or feedback

Key assumptions

Each learner has ability θ
Each item has difficulty b and discriminability a
From these parameters, we can compute the probability P(θ) that the learner will get the item correct

Note

The assumption that all items tap the same latent construct, but have different difficulties
Is a very different assumption than is seen in PFA or BKT

The Rasch (1PL) model

Simplest IRT model, very popular
Mathematically the same model (with a different coefficient), but some different practices surrounding the math (that are out of scope for our discussion)
There is an entire special interest group of AERA devoted solely to the Rasch model and modeling related to Rasch (Rasch Measurement)

The Rasch (1PL) model

No discriminability parameter
Parameters for student ability and item difficulty

The Rasch (1PL) model

Each learner has ability θ
Each item has difficulty b

\[ P(\theta ) = \frac{1}{1+e^{-1(\theta -b)}} \]

Item Characteristic Curve

A visualization that shows the relationship between student skill and performance

As student skill goes up, correctness goes up

This graph represents b=0
When θ=b (knowledge=difficulty), performance = 50%

As student skill goes up, correctness goes up

Changing difficulty parameter

Green line: b=-2 (easy item)
Orange line: b=2 (hard item)

Note

The good student finds the easy and medium items almost equally difficult

Note

The weak student finds the medium and hard items almost equally hard

Note

When b=θ
Performance is 50%

The 2PL model

Another simple IRT model, very popular
Discriminability parameter a added

Formula

\[ Rasch: P(\theta ) = \frac{1}{1+e^{-1(\theta -b)}} \]

\[ 2PL:P(\theta ) = \frac{1}{1+e^{-a(\theta -b)}} \]

Different values of a

Green line: a = 2 (higher discriminability)
Blue line: a = 0.5 (lower discriminability)

Extremely high and low discriminability

a=0
a approaches infinity

Model degeneracy

a below 0…

The 3PL model

A more complex model
Adds a guessing parameter c

The 3PL model

\[ P(\theta ) = c+ (1-c)\frac{1}{1+e^{-a(\theta -b)}} \]

Either you guess (and get it right)
Or you don’t guess (and get it right based on knowledge)

Fitting an IRT model

Can be done with Expectation Maximization
Estimate knowledge and difficulty together
- Then, given item difficulty estimates, you can assess a student’s knowledge in real time

Uses…

IRT is used quite a bit in computer-adaptive testing
Not used quite so often in online learning, where student knowledge is changing as we assess it

ELO (Elo & Sloan, 1978; Pelánek, 2016)

A variant of the Rasch model which can be used in a running system
Continually estimates item difficulty and student ability, updating both every time a student encounters an item

ELO (Elo & Sloan, 1978; Pelánek, 2016)

You may know of ELO from Chess or Pokemon rankings!

ELO (Elo & Sloan, 1978; Pelánek, 2016)

\[ \theta_{i+1} = \theta_{i} + K(c-P(c)) \]

\[ b_{i+1} = b_{i} + K(c-P(c)) \]

Where K is a parameter for how strongly the model should consider new information

Multivariate ELO (Abdi et al, 2019)

Allows an item to involve multiple skills
Averages difficulty across skills

MV-Glicko (Abdi, Khosravi, & Sadiq, 2021)

Allows an item to involve multiple skills
Averages difficulty across skills
Takes time between practices of a skill into account (much like LKT extensions to PFA)

Using ELO in the real world

ELO is used in several real-world systems
- MathGarden
- Slepempapy.cz
- Top Parent
- Shadowspect
There are some issues that need to be considered to use it sanely in a real-world system (Ruiperez-Valiente et al., 2023)

Using ELO in real world

The biggest is that you don’t actually want to let ELO item values change in real time
If you let that happen, then two students in the same classroom, with the exact same history of correct and incorrect, on the exact same items
Will end up with different knowledge estimates

Using ELO in real world

The solution: fit item parameters with initial data set
Then keep item parameters constant during actual real-world use

Using ELO in real world

The solution: fit item parameters with initial data set
Then keep item parameters constant during actual real-world use
This is generally a good policy – if you don’t do this, the first several students will have really inaccurate knowledge estimates

When to use ELO

Items have very different difficulty within skills
Relatively small amounts of data is OK
New items can be added without refitting the entire model – but you have to wait a little while to get valid estimates for those new items