ZPDES vs DQN: Adaptive Learning Approaches Compared

Introduction

Adaptive learning systems aim to personalize educational content to each learner's needs. Two prominent approaches have emerged from recent research:

ZPDES

Zone of Proximal Development and Empirical Success

• Multi-Armed Bandit algorithm
• Learning Progress Hypothesis
• INRIA Flowers Lab (2013-2024)
• Deployed in French schools

Our DQN Approach

Deep Q-Network with Flow Zone Optimization

• Deep Reinforcement Learning
• Multi-component reward function
• Cog-Ace (2024)
• Optimized for engagement + learning

Reference: ZPDES Thesis

Title: Development and evaluation of AI-based personalization algorithms for attention training
Author: Maxime Adolphe
Institution: Université de Bordeaux, INRIA Flowers Lab
Year: 2024
https://theses.hal.science/tel-04884647

Theoretical Foundations

ZPDES: Learning Progress Hypothesis

The Learning Progress Hypothesis (LPH) posits that humans are intrinsically motivated to engage in activities where they experience measurable improvement. Neural circuits reward situations of progress, directing learning toward maximally satisfying experiences.

Core Principle:

Motivation ∝ Learning Progress

DQN: Flow Theory + Engagement

Flow Theory (Csikszentmihalyi) identifies an optimal zone where challenge matches skill. We extend this with explicit engagement modeling, treating dropout as a catastrophic outcome to be prevented through reward shaping.

Core Principle:

Outcome = f(Flow, Engagement, Retention)

Mathematical Formulations

ZPDES Algorithm

1. Learning Progress Estimation

For activity a with window size L, the learning progress is:

LP(a) = R̄_recent(a) - R̄_older(a)

where:

R̄_recent(a) = (1/⌈L/2⌉) × Σ_{i=⌊L/2⌋+1}^L r_i(a)

R̄_older(a) = (1/⌊L/2⌋) × Σ_i=1^⌊L/2⌋ r_i(a)

r_i(a) ∈ {0, 1} is the success/failure on trial i for activity a

2. Activity Selection (UCB-style)

Select activity from the Zone of Proximal Development using Upper Confidence Bound:

a* = argmax_a∈ZPD [ LP(a) + c × √(ln(N) / n(a)) ]

where:

• ZPD = {a : all prerequisites of a are mastered}

• N = total trials across all activities

• n(a) = trials on activity a

• c = exploration constant

3. ZPD Update Rules

# Mastery condition:

mastered(a) = True if R̄_recent(a) > θ_master

# ZPD expansion on mastery:

if mastered(a): ZPD ← ZPD ∪ {successors(a)}

# ZPD adaptation on plateau:

if LP(a) < θ_plateau: ZPD ← ZPD ∪ {alternatives(a)} \ {a}

Our DQN Algorithm

1. State Space Definition

9-dimensional continuous state vector:

s = [s₀, s₁, ..., s₈]^T ∈ ℝ⁹

where:

s₀ = ability_score / 100 ∈ [0, 1]

s₁ = uncertainty / 50 ∈ [0, 1]

s₂ = min(session_count / 100, 1) ∈ [0, 1]

s₃ = recent_accuracy ∈ [0, 1]

s₄ = rt_trend ∈ [-1, 1]

s₅ = dprime_trend ∈ [-1, 1]

s₆ = current_difficulty ∈ [0, 1]

s₇ = min(trials / 100, 1) ∈ [0, 1]

s₈ = session_accuracy ∈ [0, 1]

2. Action Space

A = {0, 1, 2, 3}

where:

a = 0: DECREASE → d' = max(0, d - 0.25)

a = 1: MAINTAIN → d' = d

a = 2: INCREASE → d' = min(1, d + 0.25)

a = 3: MICRO_ADJ → d' = d + 0.2 × (0.75 - acc)

3. Multi-Component Reward Function

Total reward is a weighted sum of five components:

R(s, a, s') = w_f·R_flow + w_e·R_engage + w_d·R_dropout + w_i·R_improve + w_t·R_time

Flow Zone Reward (w_f = 0.40):

R_flow = exp(-0.5 × ((acc - 0.75) / σ)²)

σ = 0.12 (widened for struggling students: σ × 1.5)

Engagement Reward (w_e = 0.20):

R_engage = { +1.0 if completed, -3.0 × m if dropped }

m = 1.5 for struggling students, 1.0 otherwise

Dropout Penalty (w_d = 0.20):

R_dropout = { -3.0 × 1.5 if early dropout (<15 trials), -3.0 otherwise }

Improvement Reward (w_i = 0.10):

R_improve = clip(Δability / 5, -0.5, +0.5)

Response Time Reward (w_t = 0.10):

R_time = { +0.5 if RT ∈ [400, 4000]ms, penalty otherwise }

4. Q-Network and Bellman Update

Neural Network Architecture:

Q_θ: ℝ⁹ → ℝ⁴

Q_θ(s) = W₄ · ReLU(W₃ · ReLU(W₂ · ReLU(W₁ · s + b₁) + b₂) + b₃) + b₄

Layer sizes: 9 → 128 → 64 → 32 → 4

Action Selection:

a* = argmax_a Q_θ(s, a) with probability 1-ε

a* ~ Uniform(A) with probability ε

Loss Function (TD Error):

L(θ) = 𝔼_(s,a,r,s')~D[(r + γ · max_a' Q_θ̄(s', a') - Q_θ(s, a))²]

γ = 0.95 (discount factor), θ̄ = target network (soft update τ = 0.005)

5. IRT-Based Student Simulation

2-Parameter Logistic IRT model for response probability:

P(correct | θ, d) = 1 / (1 + exp(-a(θ_eff - b)))

where:

θ = (ability - 50) / 15 # latent ability

b = 6d - 3 # item difficulty

a = 0.5 + d × (2.0 - 0.5) × 0.5 # discrimination

θ_eff = θ - trials × 0.001 # fatigue effect

Detailed Comparison

Aspect	ZPDES	Our DQN
Algorithm Class	Multi-Armed Bandit (UCB)	Deep Reinforcement Learning
Learning Signal	Learning Progress gradient	Multi-component reward
State Representation	Binary mastery beliefs per activity	9-dim continuous vector
Temporal Horizon	Myopic (immediate LP)	Discounted future (γ=0.95)
Expert Knowledge	Prerequisite graph required	None required
Dropout Modeling	Implicit (low LP → boredom)	Explicit penalty (-3.0 to -4.5)
Model Complexity	O(\|activities\|) parameters	~15,000 neural network params
Inference Cost	O(\|ZPD\|) comparisons	Single forward pass (~1ms)
Exploration Strategy	UCB + ZPD constraints	ε-greedy + cold-start schedule
Transfer Learning	Activity-specific	Generalizes across games

Key Insight from the Thesis

⚠️ ZPDES Motivation Challenge

The thesis by Adolphe (2024) found that while ZPDES improved performance on trained tasks, motivation and engagement were lower in the personalized groups compared to non-personalized conditions.

This was attributed to cognitive load from rapid difficulty changes and the system's focus on learning progress at the expense of user experience.

How Our Approach Addresses This

1. Adjustment Frequency Control

Minimum 5 trials between difficulty adjustments, preventing rapid oscillation that causes cognitive load.

min_trials_between_actions = 5

2. MICRO_ADJUST Action

Fine-grained adjustments (±10%) instead of coarse level jumps (±25%), providing smoother difficulty curves.

Δd = 0.2 × (0.75 - accuracy)

3. Session Length Bonus

Explicit reward for sustained engagement, not just accuracy optimization. Encourages keeping users in comfortable zones longer.

R_length = 0.3 × min(trials/50, 1)

4. Warmup Period

Very low dropout probability in first 10 trials (0.1% per trial), giving users time to settle in before adaptation begins.

dropout_prob = 0.001 if trials < 10

Performance Results

ZPDES (Thesis Results)

Task Performance✓ Superior to baseline
Motivation✗ Lower than baseline
Engagement✗ Lower than baseline
Cognitive Load⚠ Problematic

Study: n=72 young adults, n=50 older adults, 8 hours training

Our DQN (Simulation Results)

Flow Zone Rate89% (target: ≥65%)
Dropout Rate7% (target: ≤20%)
Mean Accuracy75.6%
Struggling Users75% flow rate

Training: 500k steps, IRT-simulated students, 5 archetypes

Per-Archetype Flow Zone Rates (Our DQN)

Struggling

75%

Developing

90%

Average

95%

Proficient

95%

Advanced

90%

Philosophical Differences

Dimension	ZPDES Philosophy	Our DQN Philosophy
Core Belief	Learning progress = motivation	Flow zone = optimal learning + engagement
Primary Goal	Maximize learning rate	Maximize time in flow while minimizing dropout
Dropout View	Side effect of low progress	Primary outcome to optimize against
Expert Knowledge	Leverage curriculum structure	Learn everything from data
Complexity Trade-off	Simple but requires domain expertise	Complex but fully automated

Conclusion

Both ZPDES and our DQN approach represent valid solutions to the adaptive learning challenge, with different trade-offs:

Choose ZPDES when:

• Expert curriculum knowledge is available
• Interpretability is critical
• Limited training data
• Activities have clear prerequisites
• Simple deployment needed

Choose DQN when:

• User retention is paramount
• No expert knowledge available
• Rich user telemetry exists
• Cross-game generalization needed
• Long-term optimization matters

The thesis's finding that ZPDES reduced motivation while improving performance highlights the importance of explicitly modeling engagement—which our multi-component reward function addresses directly through dropout penalties and session length bonuses.

References

[1] Adolphe, M. (2024). Development and evaluation of AI-based personalization algorithms for attention training. PhD Thesis, Université de Bordeaux. hal:tel-04884647
[2] Clément, B., Roy, D., Oudeyer, P.-Y., & Lopes, M. (2015). Multi-Armed Bandits for Intelligent Tutoring Systems. Journal of Educational Data Mining, 7(2), 20-48. arXiv:1310.3174
[3] Clément, B., et al. (2024). Improved Performances and Motivation in Intelligent Tutoring Systems: Combining Machine Learning and Learner Choice. hal:04433127
[4] Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533.
[5] Csikszentmihalyi, M. (1990). Flow: The Psychology of Optimal Experience. Harper & Row.
[6] Embretson, S. E., & Reise, S. P. (2000). Item Response Theory for Psychologists. Lawrence Erlbaum Associates.