Q-learning, a form of off-policy reinforcement learning (RL), is currently not scalable for long-horizon problems that require many decision steps.
Most real-world successes in RL have been achieved with on-policy algorithms which require new data samples and can't efficiently reuse old data.
Off-policy RL, such as Q-learning, could potentially be more sample efficient as it can use any set of data.
However, the bias accumulation in Q-learning's predictions is a fundamental hindrance to scalability, particularly in complex, long-horizon tasks.
Empirical studies show current Q-learning algorithms do not perform well on difficult tasks, even with large data sets, when bias accumulates over longer decision horizons.
Horizon reduction techniques like n-step returns and hierarchical RL help improve Q-learning's scalability but don't fully solve the underlying problem.
The post calls for research to find a scalable off-policy RL algorithm that can efficiently handle complex, long-horizon problems.
Get notified when new stories are published for "🇺🇸 Hacker News English"