← Back to Adaptive Learning

ZPDES vs DQN

A technical comparison of Multi-Armed Bandit and Deep Reinforcement Learning approaches for adaptive cognitive training

Introduction

Adaptive learning systems aim to personalize educational content to each learner's needs. Two prominent approaches have emerged from recent research:

ZPDES

Zone of Proximal Development and Empirical Success

  • • Multi-Armed Bandit algorithm
  • • Learning Progress Hypothesis
  • • INRIA Flowers Lab (2013-2024)
  • • Deployed in French schools

Our DQN Approach

Deep Q-Network with Flow Zone Optimization

  • • Deep Reinforcement Learning
  • • Multi-component reward function
  • • Cog-Ace (2024)
  • • Optimized for engagement + learning

Reference: ZPDES Thesis

Title: Development and evaluation of AI-based personalization algorithms for attention training
Author: Maxime Adolphe
Institution: Université de Bordeaux, INRIA Flowers Lab
Year: 2024
https://theses.hal.science/tel-04884647

Theoretical Foundations

ZPDES: Learning Progress Hypothesis

The Learning Progress Hypothesis (LPH) posits that humans are intrinsically motivated to engage in activities where they experience measurable improvement. Neural circuits reward situations of progress, directing learning toward maximally satisfying experiences.

Core Principle:

Motivation ∝ Learning Progress

DQN: Flow Theory + Engagement

Flow Theory (Csikszentmihalyi) identifies an optimal zone where challenge matches skill. We extend this with explicit engagement modeling, treating dropout as a catastrophic outcome to be prevented through reward shaping.

Core Principle:

Outcome = f(Flow, Engagement, Retention)

Mathematical Formulations

ZPDES Algorithm

1. Learning Progress Estimation

For activity a with window size L, the learning progress is:

LP(a) = R̄recent(a) - R̄older(a)

where:

recent(a) = (1/⌈L/2⌉) × Σi=⌊L/2⌋+1L ri(a)

older(a) = (1/⌊L/2⌋) × Σi=1⌊L/2⌋ ri(a)

ri(a) ∈ {0, 1} is the success/failure on trial i for activity a

2. Activity Selection (UCB-style)

Select activity from the Zone of Proximal Development using Upper Confidence Bound:

a* = argmaxa∈ZPD [ LP(a) + c × √(ln(N) / n(a)) ]

where:

• ZPD = {a : all prerequisites of a are mastered}

• N = total trials across all activities

• n(a) = trials on activity a

• c = exploration constant

3. ZPD Update Rules

# Mastery condition:

mastered(a) = True if R̄recent(a) > θmaster

# ZPD expansion on mastery:

if mastered(a): ZPD ← ZPD ∪ {successors(a)}

# ZPD adaptation on plateau:

if LP(a) < θplateau: ZPD ← ZPD ∪ {alternatives(a)} \ {a}

Our DQN Algorithm

1. State Space Definition

9-dimensional continuous state vector:

s = [s0, s1, ..., s8]T ∈ ℝ9

where:

s0 = ability_score / 100           ∈ [0, 1]

s1 = uncertainty / 50               ∈ [0, 1]

s2 = min(session_count / 100, 1)   ∈ [0, 1]

s3 = recent_accuracy               ∈ [0, 1]

s4 = rt_trend                         ∈ [-1, 1]

s5 = dprime_trend                   ∈ [-1, 1]

s6 = current_difficulty             ∈ [0, 1]

s7 = min(trials / 100, 1)           ∈ [0, 1]

s8 = session_accuracy               ∈ [0, 1]

2. Action Space

A = {0, 1, 2, 3}

where:

a = 0: DECREASE   → d' = max(0, d - 0.25)

a = 1: MAINTAIN   → d' = d

a = 2: INCREASE   → d' = min(1, d + 0.25)

a = 3: MICRO_ADJ → d' = d + 0.2 × (0.75 - acc)

3. Multi-Component Reward Function

Total reward is a weighted sum of five components:

R(s, a, s') = wf·Rflow + we·Rengage + wd·Rdropout + wi·Rimprove + wt·Rtime

Flow Zone Reward (wf = 0.40):

Rflow = exp(-0.5 × ((acc - 0.75) / σ)2)

σ = 0.12 (widened for struggling students: σ × 1.5)

Engagement Reward (we = 0.20):

Rengage = { +1.0 if completed, -3.0 × m if dropped }

m = 1.5 for struggling students, 1.0 otherwise

Dropout Penalty (wd = 0.20):

Rdropout = { -3.0 × 1.5 if early dropout (<15 trials), -3.0 otherwise }

Improvement Reward (wi = 0.10):

Rimprove = clip(Δability / 5, -0.5, +0.5)

Response Time Reward (wt = 0.10):

Rtime = { +0.5 if RT ∈ [400, 4000]ms, penalty otherwise }

4. Q-Network and Bellman Update

Neural Network Architecture:

Qθ: ℝ9 → ℝ4

Qθ(s) = W4 · ReLU(W3 · ReLU(W2 · ReLU(W1 · s + b1) + b2) + b3) + b4

Layer sizes: 9 → 128 → 64 → 32 → 4

Action Selection:

a* = argmaxa Qθ(s, a)    with probability 1-ε

a* ~ Uniform(A)            with probability ε

Loss Function (TD Error):

L(θ) = 𝔼(s,a,r,s')~D[(r + γ · maxa' Qθ̄(s', a') - Qθ(s, a))2]

γ = 0.95 (discount factor), θ̄ = target network (soft update τ = 0.005)

5. IRT-Based Student Simulation

2-Parameter Logistic IRT model for response probability:

P(correct | θ, d) = 1 / (1 + exp(-a(θeff - b)))

where:

θ = (ability - 50) / 15               # latent ability

b = 6d - 3                              # item difficulty

a = 0.5 + d × (2.0 - 0.5) × 0.5    # discrimination

θeff = θ - trials × 0.001             # fatigue effect

Detailed Comparison

AspectZPDESOur DQN
Algorithm ClassMulti-Armed Bandit (UCB)Deep Reinforcement Learning
Learning SignalLearning Progress gradientMulti-component reward
State RepresentationBinary mastery beliefs per activity9-dim continuous vector
Temporal HorizonMyopic (immediate LP)Discounted future (γ=0.95)
Expert KnowledgePrerequisite graph requiredNone required
Dropout ModelingImplicit (low LP → boredom)Explicit penalty (-3.0 to -4.5)
Model ComplexityO(|activities|) parameters~15,000 neural network params
Inference CostO(|ZPD|) comparisonsSingle forward pass (~1ms)
Exploration StrategyUCB + ZPD constraintsε-greedy + cold-start schedule
Transfer LearningActivity-specificGeneralizes across games

Key Insight from the Thesis

⚠️ ZPDES Motivation Challenge

The thesis by Adolphe (2024) found that while ZPDES improved performance on trained tasks, motivation and engagement were lower in the personalized groups compared to non-personalized conditions.

This was attributed to cognitive load from rapid difficulty changes and the system's focus on learning progress at the expense of user experience.

How Our Approach Addresses This

1. Adjustment Frequency Control

Minimum 5 trials between difficulty adjustments, preventing rapid oscillation that causes cognitive load.

min_trials_between_actions = 5

2. MICRO_ADJUST Action

Fine-grained adjustments (±10%) instead of coarse level jumps (±25%), providing smoother difficulty curves.

Δd = 0.2 × (0.75 - accuracy)

3. Session Length Bonus

Explicit reward for sustained engagement, not just accuracy optimization. Encourages keeping users in comfortable zones longer.

Rlength = 0.3 × min(trials/50, 1)

4. Warmup Period

Very low dropout probability in first 10 trials (0.1% per trial), giving users time to settle in before adaptation begins.

dropout_prob = 0.001 if trials < 10

Performance Results

ZPDES (Thesis Results)

  • Task Performance✓ Superior to baseline
  • Motivation✗ Lower than baseline
  • Engagement✗ Lower than baseline
  • Cognitive Load⚠ Problematic

Study: n=72 young adults, n=50 older adults, 8 hours training

Our DQN (Simulation Results)

  • Flow Zone Rate89% (target: ≥65%)
  • Dropout Rate7% (target: ≤20%)
  • Mean Accuracy75.6%
  • Struggling Users75% flow rate

Training: 500k steps, IRT-simulated students, 5 archetypes

Per-Archetype Flow Zone Rates (Our DQN)

Struggling
75%
Developing
90%
Average
95%
Proficient
95%
Advanced
90%

Philosophical Differences

DimensionZPDES PhilosophyOur DQN Philosophy
Core BeliefLearning progress = motivationFlow zone = optimal learning + engagement
Primary GoalMaximize learning rateMaximize time in flow while minimizing dropout
Dropout ViewSide effect of low progressPrimary outcome to optimize against
Expert KnowledgeLeverage curriculum structureLearn everything from data
Complexity Trade-offSimple but requires domain expertiseComplex but fully automated

Conclusion

Both ZPDES and our DQN approach represent valid solutions to the adaptive learning challenge, with different trade-offs:

Choose ZPDES when:

  • • Expert curriculum knowledge is available
  • • Interpretability is critical
  • • Limited training data
  • • Activities have clear prerequisites
  • • Simple deployment needed

Choose DQN when:

  • • User retention is paramount
  • • No expert knowledge available
  • • Rich user telemetry exists
  • • Cross-game generalization needed
  • • Long-term optimization matters

The thesis's finding that ZPDES reduced motivation while improving performance highlights the importance of explicitly modeling engagement—which our multi-component reward function addresses directly through dropout penalties and session length bonuses.

References

  • [1] Adolphe, M. (2024). Development and evaluation of AI-based personalization algorithms for attention training. PhD Thesis, Université de Bordeaux. hal:tel-04884647
  • [2] Clément, B., Roy, D., Oudeyer, P.-Y., & Lopes, M. (2015). Multi-Armed Bandits for Intelligent Tutoring Systems. Journal of Educational Data Mining, 7(2), 20-48. arXiv:1310.3174
  • [3] Clément, B., et al. (2024). Improved Performances and Motivation in Intelligent Tutoring Systems: Combining Machine Learning and Learner Choice. hal:04433127
  • [4] Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533.
  • [5] Csikszentmihalyi, M. (1990). Flow: The Psychology of Optimal Experience. Harper & Row.
  • [6] Embretson, S. E., & Reise, S. P. (2000). Item Response Theory for Psychologists. Lawrence Erlbaum Associates.