Solving a Real-World Optimization Problem Using Proximal Policy Optimization with Curriculum Learning and Reward Engineering
arxiv(2024)
摘要
We present a proximal policy optimization (PPO) agent trained through
curriculum learning (CL) principles and meticulous reward engineering to
optimize a real-world high-throughput waste sorting facility. Our work
addresses the challenge of effectively balancing the competing objectives of
operational safety, volume optimization, and minimizing resource usage. A
vanilla agent trained from scratch on these multiple criteria fails to solve
the problem due to its inherent complexities. This problem is particularly
difficult due to the environment's extremely delayed rewards with long time
horizons and class (or action) imbalance, with important actions being
infrequent in the optimal policy. This forces the agent to anticipate long-term
action consequences and prioritize rare but rewarding behaviours, creating a
non-trivial reinforcement learning task. Our five-stage CL approach tackles
these challenges by gradually increasing the complexity of the environmental
dynamics during policy transfer while simultaneously refining the reward
mechanism. This iterative and adaptable process enables the agent to learn a
desired optimal policy. Results demonstrate that our approach significantly
improves inference-time safety, achieving near-zero safety violations in
addition to enhancing waste sorting plant efficiency.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要