Q-learning as a feedback control mechanism in deterministic systems
This paper studies tabular Q-learning when implemented as a feedback controller in deterministic discrete-time systems, where the learning process interacts with the system dynamics to generate closed-loop trajectories. Unlike classical convergenceresults, which assume infinite visitation as an external sampling condition, this property arises from the interaction between recurrence of the induced state dynamics and persistent exploration under GLIE-compliant ϵ-greedy policies. In particular, we prove that exploration remains sufficiently persistent along recurrent states, implying infinite visitation of all state-action pairs within the recurrent region. The results provide a dynamical-systems characterization of the sampling condition underlying classical Q-learning convergence and establish a formal connection between reinforcement learning and recurrence analysis in discrete-time systems. Numerical experiments on nonlinear epidemic dynamics and linear quadratic control benchmarks illustrate the resulting closed-loop behavior.
1. Huang M, Liu C, He X, Ma L, Lu Z, Su, H. Reinforcement Learning-Based Control for Nonlinear Discrete-Time Systems. Neurocomputing. 2020;402:50–65. https://doi.org/10.1016/j.neucom.2020.03.061
2. Kiumarsi B, Lewis FL, Modares H, Karimpour A, Naghibi-Sistani MB. Reinforcement Qlearning for Optimal Tracking Control. Automatica. 2014;50(4):1167–1175. https://doi.org/10.1016/j.automatica.2014.02.015
3. Mukhopadhyay R, Sutradhar A, Chattopadhyay P. A Novel Investigation on The Effects of State and Reward Structure in Designing Deep Reinforcement Learning-Based Controller for Nonlinear Dynamical Systems. Int J Dyn Control. 2024;12(8):3017–3032. https://doi.org/10.1007/s40435-024-01407-6
4. Yadav KP, Narayan J, Kushwaha P. Learning to Balance: Reinforcement Learning Control for Single-Leg Balance of An Underactuated Biped Robot. Int J Dyn Control. 2025;13(7):268. https://doi.org/10.1007/s40435-025-01782-8
5. Semenov SS, Tsurkov VI. Reinforcement Learning for Optimal Control Problems. J Comput Syst Sci Int. 2023;62(3):508–521. https://doi.org/10.31857/S0002338823030125
6. Chen S, Zheng J. A Q-learning Grey Wolf Optimizer for A Distributed Hybrid Flowshop Rescheduling Problem with Urgent Job Insertion. J Appl Math Comput. 2025;71(3):3645–3670. https://doi.org/10.1007/s12190-024-02364-1
7. Kankashvar M, Rafiee S, Bolandi H. Fault-Tolerant Q-Learning for Discrete-Time Linear Systems with Actuator and Sensor Faults Using Input-Output Measured Data. Frankl Open. 2025;11:100259. https://doi.org/10.1016/j.fraope.2025.100259
8. Rifanti UM, Aryati L, Susyanto N, Susanto H. A Reinforcement Learning Based Decision-Support System for Mitigate Strategies During COVID-19: A Systematic Review. Jambura J Biomath. 2025;6(1):60–70. https://doi.org/10.37905/jjbm.v6i1.30513
9. Watkins CJCH, Dayan P. Q-learning. Mach Learn. 1992;8:279–292. https://doi.org/10.1007/BF00992698
10. Tsitsiklis JN. Asynchronous Stochastic Approximation and Q-Learning. Mach Learn. 1994;16(3):185–202. https://doi.org/10.1023/A:1022689125041
11. Bertsekas DP. Reinforcement Learning and Optimal Control. Athena Scientific; 2019. Available from: https://web.mit.edu/dimitrib/www/RL OC Short View.pdf [Last accessed on June 4, 2026].
12. Hariharan N, Anand GP. A Brief Study of Deep Reinforcement Learning with Epsilon-Greedy Exploration. Int J Comput Digit Syst. 2022;11(1):541–551. https://doi.org/10.12785/ijcds/110144
13. Aghanim A, Chekenbah H, Oulhaj O, Lasri, R. Q-learning Empowered Cavity Filter Tuning with Epsilon Decay Strategy. Prog Electromagn Res C. 2024;140:31–40. https://doi.org/10.2528/PIERC23111903
14. Tokic M, Palm G. Value-Difference Based Exploration: Adaptive Control between Epsilon-Greedy and Softmax. In: KI 2011: Advances in Artificial Intelligence. Springer; 2011:335–346. https://doi.org/10.1007/978-3-642-24455-1
15. Mignon A, Rocha RLA. An Adaptive Implementation of Epsilon-Greedy. Procedia Comput Sci. 2017;109:1146–1151. https://doi.org/10.1016/j.procs.2017.05.431
16. Kumar A, Singh D. Adaptive Epsilon-Greedy Reinforcement Learning for IoT Security. Discov Internet Things. 2024;4(1):27. https://doi.org/10.1007/s43926-024-00080-7
17. Aslan E, Arserim MA, U,car A. Development of Push-Recovery Control System for Humanoid Robots Using Deep Reinforcement Learning. Ain Shams Eng J. 2023;14(10):102167. https://doi.org/10.1016/j.asej.2023.102167
18. Ben-Akka M, Tanougast C, Diou C. Novel Design of Reward and Epsilon-Greedy Decay Strategy Tailored for Q-Learning in Optimizing Local Mobile Robot Path Planning. Knowl-Based Syst. 2025;324:113836. https://doi.org/10.1016/j.knosys.2025.113836
19. Du D, Han S, Qi N, Ammar HB, Wang J, Pan W. Reinforcement Learning for Safe Robot Control Using Control Lyapunov Barrier Functions. In: Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA). 2023:9442–9448. https://doi.org/10.1109/ICRA48891.2023.10160991
20. Zhao L, Gatsis K, Papachristodoulou A. Stable and Safe Reinforcement Learning via A Barrier-Lyapunov Actor-Critic Approach. In: Proceedings of the 2023 62nd IEEE Conference on Decision and Control (CDC). 2023:1320–1325. https://doi.org/10.1109/CDC49753.2023.103837
21. Lewis FL, Vrabie D, Vamvoudakis KG. Reinforcement Learning and Feedback Control. IEEE Control Syst Mag. 2012;32(6):76–105. https://doi.org/10.1109/MCS.2012.2214134
22. Sutton RS, Barto AG. Reinforcement Learning: An Introduction. 2nd ed. MIT Press; 2018. Available from: https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf [Last accessed on June 4, 2026].
23. Liu S, Pu T, Zeng L, Wang Y, Cheng H, Liu Z. Reinforcement Learning-Based Network Dismantling by Targeting Maximum-Degree Nodes in The Giant Connected Component. Mathematics. 2024;12(17):2766. https://doi.org/10.3390/math12172766
24. Vamvoudakis KG, Wan Y, Lewis FL, & Cansever D, eds. Handbook of Reinforcement Learning and Control. Springer; 2021. https://doi.org/10.1007/978-3-030-60990-0
25. Ladosz P, Weng L, Kim M, Oh H. Exploration in Deep Reinforcement Learning: A Survey. Inf Fusion. 2022;85:1–22. https://doi.org/10.1016/j.inffus.2022.03.003
26. Wang D,Wei W, Li L,Wang X, Liang J. Rethinking Exploration-Exploitation Trade-Off. Neural Netw. 2025;187:107342. https://doi.org/10.1016/j.neunet.2025.107342
27. Zhang Y, Lyu Y, Zhan G, Zou W, Li SE. Boosting Exploration in Reinforcement Learning for Sparse Reward Tasks. In: Proceedings of the 2025 American Control Conference (ACC). 2025:3492–3499. https://doi.org/10.23919/ACC63710.2025.11107911
28. Sledge IJ, Principe JC. Balancing Exploration and Exploitation in Reinforcement Learning Using A Value of Information Criterion. In: Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2017:2816–2820. https://doi.org/10.1109/ICASSP.2017.7952670
29. Garg P, Silvas E, Willems F. Safe and Time-Efficient Exploration in Reinforcement Learning-Based Control of A Vehicle Thermal Systems. Control Eng Pract. 2025;164:106458. https://doi.org/10.1016/j.conengprac.2025.106458
30. Ma Z, Liu Z. An Improved Q-learning Algorithm with Particle Swarm Optimization for Path Planning. In: Proceedings of the 2024 6th International Conference on Frontier Technologies of Information and Computer (ICFTIC). 2024:1662–1667. https://doi.org/10.1109/ICFTIC64248.2024.10913234
31. Rokhlin DB. Robbins-Monro Conditions for Persistent Exploration Learning Strategies. In: Springer Proceedings in Mathematics & Statistics. Springer; 2019:237–247. https://doi.org/10.1007/978-3-030-26748-3 14
32. Littman ML. Value-Function Reinforcement Learning in Markov Games. Cogn Syst Res. 2001;2(1):55–66. https://doi.org/10.1016/S1389-0417(01)00015-8
33. Szepesv´ari C, Littman ML. A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms. Neural Comput. 1999;11(8):2017–2060. https://doi.org/10.1162/089976699300016070
34. Singh S, Jaakkola T, Littman ML, Szepesv´ari C. Convergence Results for Single-Step Reinforcement Learning Algorithms. Mach Learn. 2000;38(3):287–308. https://doi.org/10.1023/A:1007678930559
35. Beresnevich V, Velani S. The Divergence Borel-Cantelli Lemma Revisited. J Math Anal Appl. 2023;519(1):126750. https://doi.org/10.1016/j.jmaa.2022.126750
36. Lemos-Silva M, Vaz S, Torres DFM. Exact Solution for A Discrete-Time SIR Model. Appl Numer Math. 2025;207:339–347. https://doi.org/10.1016/j.apnum.2024.09.014
37. Gairat A, Shcherbakov V. Discrete SIR Model on A Homogeneous Tree and Its Continuous Limit. J Phys A Math Theor. 2022;55(43):434004. https://doi.org/10.1088/1751-8121/ac9655
38. Rokaya M, Hemdan DI, Alzain MA, Atlam E-S. A Novel Fractional-Order Model with Data-Driven Validation for The Dynamics of Complex Epidemic Spreading in Networks. Int J Optim Control Theor Appl. 2026;16(1):111–137. https://doi.org/10.36922/IJOCTA025220107
39. Bertsekas DP. Model Predictive Control and Reinforcement Learning: A Unified Framework Based on Dynamic Programming. IFACPapersOnLine. 2024;58:363–383. https://doi.org/10.1016/j.ifacol.2024.09.056
40. Morcego B, Yin W, Boersma S, van Henten E, Puig V, Sun C. Reinforcement Learning Versus Model Predictive Control on Greenhouse Climate Control. Comput Electron Agric. 2023;215:108372. https://doi.org/10.1016/j.compag.2023.108372
41. Ernst D, Glavic M, Capitanescu F, Wehenkel L. Reinforcement Learning Versus Model Predictive Control: A Comparison on A Power System Problem. IEEE Trans Syst Man Cybern B Cybern. 2009;39(2):517–529. https://doi.org/10.1109/TSMCB.2008.2007630
42. Sajjadi SS, Pariz N, Karimpour A, Jajarmi A. An Off-Line NMPC Strategy for Continuous-Time Nonlinear Systems Using An Extended Modal Series Method. Nonlinear Dyn. 2014;78(4):2651–2674. https://doi.org/10.1007/s11071-014-1616-6
43. Camacho EF, ordons C. Model Predictive Control. Springer; 2007. https://doi.org/10.1007/978-0-85729-398-5
44. Sassano M. Policy Algebraic Equation for The Discrete-Time Linear Quadratic Regulator Problem. IEEE Trans Autom Control. 2025;70(4):2106–2121. https://doi.org/10.1109/TAC.2024.3465566
45. Zhang H, Duan G, Xie L. Linear Quadratic Regulation for Linear Time-Varying Systems with Multiple Input Delays. Automatica. 2006;42(9):1465–1476. https://doi.org/10.1016/j.automatica.2006.04.007
46. Zhang H, Li L, Xu J, Fu M. Linear Quadratic Regulation and Stabilization of Discrete-Time Systems with Delay and Multiplicative Noise. IEEE Trans Autom Control. 2015;60(10):2599–2613. https://doi.org/10.1109/TAC.2015.2411911
