Module 3 Item Response Theory and ELO
KT Learning Lab 3: A Conceptual Overview
Item Response Theory
A classic approach for assessment, used for decades in tests and some online learning environments
In its classical form, has some key limitations that make it less useful for assessment in online learning
But variants such as ELO and T-IRT address some of those limitations
Speaker Notes:
A classic approach for assessment has been used for decades in tests and some online learning environments. In its classical form, it has some key limitations that make it less useful for assessment in online learning, but variants like ELO and T-IRT address some of those limitations.
Key goal of IRT
Speaker Notes:
The key goal of IRT:
Typical use of IRT
Speaker Notes:
Typical use of IRT is:
Key assumptions
Each learner has ability θ
Each item has difficulty b and discriminability a
From these parameters, we can compute the probability P(θ) that the learner will get the item correct
Speaker Notes:
Other assumptions:
Each student has an ability called theta, and each item has a difficulty b, and discriminability a
From these parameters, we can compute the probability P of theta that the learner will get the current item correct
Note
The assumption that all items tap the same latent construct, but have different difficulties
Is a very different assumption than is seen in PFA or BKT
Speaker Notes:
Now the assumption is that all items tap the same latent construct, but they have different difficulties. This is a very different assumption than seen in PFA or BKT.
The Rasch (1PL) model
Simplest IRT model, very popular
Mathematically the same model (with a different coefficient), but some different practices surrounding the math (that are out of scope for our discussion)
There is an entire special interest group of AERA devoted solely to the Rasch model and modeling related to Rasch (Rasch Measurement)
Speaker Notes:
Let’s start with the Rasch (1PL) model. It’s the simplest IRT model and is very popular. Rasch is mathematically the same model with a different coefficient as another model called 1PL, but there are some different practices surrounding the math.
The Rasch (1PL) model
Speaker Notes:
It’s got no discriminability parameter. But it does have two sets of parameters, a set of parameters for student ability, and a set of parameters for item difficulty.
The Rasch (1PL) model
\[
P(\theta ) = \frac{1}{1+e^{-1(\theta -b)}}
\]
Speaker Notes:
Each learner has the ability theta, each item has difficulty b, and you can compute the probability that the student will get something right by a function involving an exponential function with theta minus b. So we take the ability, we subtract the difficulty, they’re on the same scale.
Item Characteristic Curve
A visualization that shows the relationship between student skill and performance
Speaker Notes:
In IRT, we can understand an item’s difficulty by looking at:
For that item, the relationship between student skill and performance. As you can see on this graph, the x-axis is the theta, student ability, and the y-axis is correctness
When theta equals zero, for this graph, the probability of correctness equals 0.5. This actually has to mean that b is equal to zero because for P of correct to be 0.5, theta and b have to be exactly balanced
As student skill goes up, correctness goes up
Speaker Notes:
Let’s look at the item characteristic curve. This is a visualization that shows the relationship between student skill and performance. This graph represents b equals zero. When theta equals b, knowledge equals difficulty, performance is 50%.
So in other words, if you look across the x-axis and at the thetas, theta zero is 50%, so this has to be b equals zero as well. And the y-axis has the probability of correctness.
As student skill goes up, correctness goes up
Changing difficulty parameter
Speaker Notes:
If we change the difficulty parameter, we get different lines:
If you look at the green line, B equals negative two, that’s an easy item. In this case, the student with theta of three doesn’t improve much, but the student with theta of zero goes up from 50% to 88%. And the student with theta of negative three still goes up a good bit from like 5% to almost 30%
The orange line is B equals two, that’s a hard item. So in other words, now the really good student, even the really good student, whose theta of three has only a 70% chance of getting it right
The student with a theta of zero, the average student, is barely above 10% chance of getting it right. And the student whose theta of negative three, just no hope
Note
The good student finds the easy and medium items almost equally difficult
Speaker Notes:
You’ll notice, that the good student finds the easy and medium items almost equally difficult. If you look over the top right corner of the graph, you can see that way up at the top, if the item is easy or medium, it doesn’t really matter if the student is really good.
Note
The weak student finds the medium and hard items almost equally hard
Speaker Notes:
Similarly, if you look down at the bottom, the weak student finds the medium and hard items almost equally hard. So IRT is most informational, kind of close to the middle for items and close to the middle for students.
Note
When b=θ
Performance is 50%
Speaker Notes:
Now when b equals theta, performance is 50%. This happens all through the spectrum.
If you’ve got a really easy item and a really weak student, they’ll get performance of 50%
If you’ve got a really strong student and a really hard item, you’ll get a performance of 50%
This model accounts for the fact that the difficulty of items and the skill of students can be in balance
The 2PL model
Speaker Notes:
Another simple IRT model that’s pretty popular is the 2PL model. In this one, there’s a discriminability parameter a added.
Different values of a
Speaker Notes:
Now if we look at different values of a, we can see:
The green line has a equals 2, and that gives it higher discriminability. It’s kind of making a faster transition as students get better between being really hard and really easy
The blue line, a equals 0.5, has lower discriminability. It makes a much slower transition as students get better between correctness and incorrectness
Generally, you want items with relatively high discriminability, although in some cases being too high for discriminability isn’t very useful because it only distinguishes a very small set of the range
Extremely high and low discriminability
a=0
a approaches infinity
Model degeneracy
Speaker Notes:
Model degeneracy, which can’t happen in Rasch but can happen in 2PL, occurs when you get a — discriminability below 0. In this case, the model tells you that the smarter you are, the more likely you are to get things wrong. And the less strong you are as a student, the more likely you are to get things right.
The 3PL model
Speaker Notes:
The 3PL model is a more complex model, which adds a guessing parameter — c.
The 3PL model
\[
P(\theta ) = c+ (1-c)\frac{1}{1+e^{-a(\theta -b)}}
\]
Speaker Notes:
The probability you get things right is the probability you guessed it plus the probability you didn’t guess times the probability you knew it
So either you guess and you just get it right, or you don’t guess, and then you get it right based on knowledge
Uses…
IRT is used quite a bit in computer-adaptive testing
Not used quite so often in online learning, where student knowledge is changing as we assess it
Speaker Notes:
IRT has been used quite a bit in assessment. On the online world, you see it in computer adaptive testing. Testing that tries to infer what a student knows and tries to give items the right level of difficulty and discriminability to better pinpoint what the student knows. It’s not used quite so often in online learning where student knowledge is changing as we assess it. IRT doesn’t have any provision for that. For those situations, BKT and PFA are more popular.
ELO (Elo & Sloan, 1978; Pelánek, 2016)
A variant of the Rasch model which can be used in a running system
Continually estimates item difficulty and student ability, updating both every time a student encounters an item
Speaker Notes:
It’s worth briefly mentioning one extension to IRT, which is ELO:
It’s a variant of a Rasch model that can be used in a running system, which is a big difference than classical IRT. It continually estimates both item difficulty and the student’s ability, updating both every time a student encounters an item
ELO (Elo & Sloan, 1978; Pelánek, 2016)
You may know of ELO from Chess or Pokemon rankings!
Speaker Notes:
ELO has been around for a long time. It was used to rank chess players
ELO (Elo & Sloan, 1978; Pelánek, 2016)
\[
\theta_{i+1} = \theta_{i} + K(c-P(c))
\]
\[
b_{i+1} = b_{i} + K(c-P(c))
\]
Where K is a parameter for how strongly the model should consider new information
Speaker Notes:
ELO does this based on two very simple formulas, where it changes the student parameter and it changes the item parameter every time it encounters a new experience, taking the difference between the student’s actual response and the predicted response, and weighting that by some factor k, a parameter for how strongly the model should consider new information when updating the new estimates of ability and item difficulty.
Multivariate ELO (Abdi et al, 2019)
Speaker Notes:
An important extension to ELO is:
Multivariate ELO (Abdi et al., 2019), which allows an item to involve multiple skills and averages difficulty across skills
MV-Glicko (Abdi, Khosravi, & Sadiq, 2021 )
Allows an item to involve multiple skills
Averages difficulty across skills
Takes time between practices of a skill into account (much like LKT extensions to PFA)
Using ELO in the real world
ELO is used in several real-world systems
MathGarden
Slepempapy.cz
Top Parent
Shadowspect
There are some issues that need to be considered to use it sanely in a real-world system (Ruiperez-Valiente et al., 2023)
Using ELO in real world
The biggest is that you don’t actually want to let ELO item values change in real time
If you let that happen, then two students in the same classroom, with the exact same history of correct and incorrect, on the exact same items
Will end up with different knowledge estimates
Using ELO in real world
The solution: fit item parameters with initial data set
Then keep item parameters constant during actual real-world use
This is generally a good policy – if you don’t do this, the first several students will have really inaccurate knowledge estimates
When to use ELO
Items have very different difficulty within skills
Relatively small amounts of data is OK
New items can be added without refitting the entire model – but you have to wait a little while to get valid estimates for those new items