Introduction
Adaptive learning systems aim to personalize educational content to each learner's needs. Two prominent approaches have emerged from recent research:
ZPDES
Zone of Proximal Development and Empirical Success
- • Multi-Armed Bandit algorithm
- • Learning Progress Hypothesis
- • INRIA Flowers Lab (2013-2024)
- • Deployed in French schools
Our DQN Approach
Deep Q-Network with Flow Zone Optimization
- • Deep Reinforcement Learning
- • Multi-component reward function
- • Cog-Ace (2024)
- • Optimized for engagement + learning
Reference: ZPDES Thesis
Title: Development and evaluation of AI-based personalization algorithms for attention training
Author: Maxime Adolphe
Institution: Université de Bordeaux, INRIA Flowers Lab
Year: 2024
https://theses.hal.science/tel-04884647
Theoretical Foundations
ZPDES: Learning Progress Hypothesis
The Learning Progress Hypothesis (LPH) posits that humans are intrinsically motivated to engage in activities where they experience measurable improvement. Neural circuits reward situations of progress, directing learning toward maximally satisfying experiences.
Core Principle:
Motivation ∝ Learning Progress
DQN: Flow Theory + Engagement
Flow Theory (Csikszentmihalyi) identifies an optimal zone where challenge matches skill. We extend this with explicit engagement modeling, treating dropout as a catastrophic outcome to be prevented through reward shaping.
Core Principle:
Outcome = f(Flow, Engagement, Retention)
Mathematical Formulations
ZPDES Algorithm
1. Learning Progress Estimation
For activity a with window size L, the learning progress is:
LP(a) = R̄recent(a) - R̄older(a)
where:
R̄recent(a) = (1/⌈L/2⌉) × Σi=⌊L/2⌋+1L ri(a)
R̄older(a) = (1/⌊L/2⌋) × Σi=1⌊L/2⌋ ri(a)
ri(a) ∈ {0, 1} is the success/failure on trial i for activity a
2. Activity Selection (UCB-style)
Select activity from the Zone of Proximal Development using Upper Confidence Bound:
a* = argmaxa∈ZPD [ LP(a) + c × √(ln(N) / n(a)) ]
where:
• ZPD = {a : all prerequisites of a are mastered}
• N = total trials across all activities
• n(a) = trials on activity a
• c = exploration constant
3. ZPD Update Rules
# Mastery condition:
mastered(a) = True if R̄recent(a) > θmaster
# ZPD expansion on mastery:
if mastered(a): ZPD ← ZPD ∪ {successors(a)}
# ZPD adaptation on plateau:
if LP(a) < θplateau: ZPD ← ZPD ∪ {alternatives(a)} \ {a}
Our DQN Algorithm
1. State Space Definition
9-dimensional continuous state vector:
s = [s0, s1, ..., s8]T ∈ ℝ9
where:
s0 = ability_score / 100 ∈ [0, 1]
s1 = uncertainty / 50 ∈ [0, 1]
s2 = min(session_count / 100, 1) ∈ [0, 1]
s3 = recent_accuracy ∈ [0, 1]
s4 = rt_trend ∈ [-1, 1]
s5 = dprime_trend ∈ [-1, 1]
s6 = current_difficulty ∈ [0, 1]
s7 = min(trials / 100, 1) ∈ [0, 1]
s8 = session_accuracy ∈ [0, 1]
2. Action Space
A = {0, 1, 2, 3}
where:
a = 0: DECREASE → d' = max(0, d - 0.25)
a = 1: MAINTAIN → d' = d
a = 2: INCREASE → d' = min(1, d + 0.25)
a = 3: MICRO_ADJ → d' = d + 0.2 × (0.75 - acc)
3. Multi-Component Reward Function
Total reward is a weighted sum of five components:
R(s, a, s') = wf·Rflow + we·Rengage + wd·Rdropout + wi·Rimprove + wt·Rtime
Flow Zone Reward (wf = 0.40):
Rflow = exp(-0.5 × ((acc - 0.75) / σ)2)
σ = 0.12 (widened for struggling students: σ × 1.5)
Engagement Reward (we = 0.20):
Rengage = { +1.0 if completed, -3.0 × m if dropped }
m = 1.5 for struggling students, 1.0 otherwise
Dropout Penalty (wd = 0.20):
Rdropout = { -3.0 × 1.5 if early dropout (<15 trials), -3.0 otherwise }
Improvement Reward (wi = 0.10):
Rimprove = clip(Δability / 5, -0.5, +0.5)
Response Time Reward (wt = 0.10):
Rtime = { +0.5 if RT ∈ [400, 4000]ms, penalty otherwise }
4. Q-Network and Bellman Update
Neural Network Architecture:
Qθ: ℝ9 → ℝ4
Qθ(s) = W4 · ReLU(W3 · ReLU(W2 · ReLU(W1 · s + b1) + b2) + b3) + b4
Layer sizes: 9 → 128 → 64 → 32 → 4
Action Selection:
a* = argmaxa Qθ(s, a) with probability 1-ε
a* ~ Uniform(A) with probability ε
Loss Function (TD Error):
L(θ) = 𝔼(s,a,r,s')~D[(r + γ · maxa' Qθ̄(s', a') - Qθ(s, a))2]
γ = 0.95 (discount factor), θ̄ = target network (soft update τ = 0.005)
5. IRT-Based Student Simulation
2-Parameter Logistic IRT model for response probability:
P(correct | θ, d) = 1 / (1 + exp(-a(θeff - b)))
where:
θ = (ability - 50) / 15 # latent ability
b = 6d - 3 # item difficulty
a = 0.5 + d × (2.0 - 0.5) × 0.5 # discrimination
θeff = θ - trials × 0.001 # fatigue effect
Detailed Comparison
| Aspect | ZPDES | Our DQN |
|---|---|---|
| Algorithm Class | Multi-Armed Bandit (UCB) | Deep Reinforcement Learning |
| Learning Signal | Learning Progress gradient | Multi-component reward |
| State Representation | Binary mastery beliefs per activity | 9-dim continuous vector |
| Temporal Horizon | Myopic (immediate LP) | Discounted future (γ=0.95) |
| Expert Knowledge | Prerequisite graph required | None required |
| Dropout Modeling | Implicit (low LP → boredom) | Explicit penalty (-3.0 to -4.5) |
| Model Complexity | O(|activities|) parameters | ~15,000 neural network params |
| Inference Cost | O(|ZPD|) comparisons | Single forward pass (~1ms) |
| Exploration Strategy | UCB + ZPD constraints | ε-greedy + cold-start schedule |
| Transfer Learning | Activity-specific | Generalizes across games |
Key Insight from the Thesis
⚠️ ZPDES Motivation Challenge
The thesis by Adolphe (2024) found that while ZPDES improved performance on trained tasks, motivation and engagement were lower in the personalized groups compared to non-personalized conditions.
This was attributed to cognitive load from rapid difficulty changes and the system's focus on learning progress at the expense of user experience.
How Our Approach Addresses This
1. Adjustment Frequency Control
Minimum 5 trials between difficulty adjustments, preventing rapid oscillation that causes cognitive load.
2. MICRO_ADJUST Action
Fine-grained adjustments (±10%) instead of coarse level jumps (±25%), providing smoother difficulty curves.
3. Session Length Bonus
Explicit reward for sustained engagement, not just accuracy optimization. Encourages keeping users in comfortable zones longer.
4. Warmup Period
Very low dropout probability in first 10 trials (0.1% per trial), giving users time to settle in before adaptation begins.
Performance Results
ZPDES (Thesis Results)
- Task Performance✓ Superior to baseline
- Motivation✗ Lower than baseline
- Engagement✗ Lower than baseline
- Cognitive Load⚠ Problematic
Study: n=72 young adults, n=50 older adults, 8 hours training
Our DQN (Simulation Results)
- Flow Zone Rate89% (target: ≥65%)
- Dropout Rate7% (target: ≤20%)
- Mean Accuracy75.6%
- Struggling Users75% flow rate
Training: 500k steps, IRT-simulated students, 5 archetypes
Per-Archetype Flow Zone Rates (Our DQN)
Philosophical Differences
| Dimension | ZPDES Philosophy | Our DQN Philosophy |
|---|---|---|
| Core Belief | Learning progress = motivation | Flow zone = optimal learning + engagement |
| Primary Goal | Maximize learning rate | Maximize time in flow while minimizing dropout |
| Dropout View | Side effect of low progress | Primary outcome to optimize against |
| Expert Knowledge | Leverage curriculum structure | Learn everything from data |
| Complexity Trade-off | Simple but requires domain expertise | Complex but fully automated |
Conclusion
Both ZPDES and our DQN approach represent valid solutions to the adaptive learning challenge, with different trade-offs:
Choose ZPDES when:
- • Expert curriculum knowledge is available
- • Interpretability is critical
- • Limited training data
- • Activities have clear prerequisites
- • Simple deployment needed
Choose DQN when:
- • User retention is paramount
- • No expert knowledge available
- • Rich user telemetry exists
- • Cross-game generalization needed
- • Long-term optimization matters
The thesis's finding that ZPDES reduced motivation while improving performance highlights the importance of explicitly modeling engagement—which our multi-component reward function addresses directly through dropout penalties and session length bonuses.
References
- [1] Adolphe, M. (2024). Development and evaluation of AI-based personalization algorithms for attention training. PhD Thesis, Université de Bordeaux. hal:tel-04884647
- [2] Clément, B., Roy, D., Oudeyer, P.-Y., & Lopes, M. (2015). Multi-Armed Bandits for Intelligent Tutoring Systems. Journal of Educational Data Mining, 7(2), 20-48. arXiv:1310.3174
- [3] Clément, B., et al. (2024). Improved Performances and Motivation in Intelligent Tutoring Systems: Combining Machine Learning and Learner Choice. hal:04433127
- [4] Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533.
- [5] Csikszentmihalyi, M. (1990). Flow: The Psychology of Optimal Experience. Harper & Row.
- [6] Embretson, S. E., & Reise, S. P. (2000). Item Response Theory for Psychologists. Lawrence Erlbaum Associates.