Q-learning as a feedback control mechanism in deterministic systems

© 2026 by the Author(s). This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution -Noncommercial 4.0 International License (CC-by the license) ( https://creativecommons.org/licenses/by-nc/4.0/ )

Download PDF

XML

Cite

Abstract

This paper studies tabular Q-learning when implemented as a feedback controller in deterministic discrete-time systems, where the learning process interacts with the system dynamics to generate closed-loop trajectories. Unlike classical convergenceresults, which assume infinite visitation as an external sampling condition, this property arises from the interaction between recurrence of the induced state dynamics and persistent exploration under GLIE-compliant ϵ-greedy policies. In particular, we prove that exploration remains sufficiently persistent along recurrent states, implying infinite visitation of all state-action pairs within the recurrent region. The results provide a dynamical-systems characterization of the sampling condition underlying classical Q-learning convergence and establish a formal connection between reinforcement learning and recurrence analysis in discrete-time systems. Numerical experiments on nonlinear epidemic dynamics and linear quadratic control benchmarks illustrate the resulting closed-loop behavior.

Keywords

Q-learning Feedback control

Deterministic discrete-time systems

Dynamical systems

Infinite visitation

State recurrence

Funding

None.

Conflict of interest

The authors declare that they have no conflict of interest.

References

1. Huang M, Liu C, He X, Ma L, Lu Z, Su, H. Reinforcement Learning-Based Control for Nonlinear Discrete-Time Systems. Neurocomputing. 2020;402:50–65. https://doi.org/10.1016/j.neucom.2020.03.061

2. Kiumarsi B, Lewis FL, Modares H, Karimpour A, Naghibi-Sistani MB. Reinforcement Qlearning for Optimal Tracking Control. Automatica. 2014;50(4):1167–1175. https://doi.org/10.1016/j.automatica.2014.02.015

3. Mukhopadhyay R, Sutradhar A, Chattopadhyay P. A Novel Investigation on The Effects of State and Reward Structure in Designing Deep Reinforcement Learning-Based Controller for Nonlinear Dynamical Systems. Int J Dyn Control. 2024;12(8):3017–3032. https://doi.org/10.1007/s40435-024-01407-6

4. Yadav KP, Narayan J, Kushwaha P. Learning to Balance: Reinforcement Learning Control for Single-Leg Balance of An Underactuated Biped Robot. Int J Dyn Control. 2025;13(7):268. https://doi.org/10.1007/s40435-025-01782-8

5. Semenov SS, Tsurkov VI. Reinforcement Learning for Optimal Control Problems. J Comput Syst Sci Int. 2023;62(3):508–521. https://doi.org/10.31857/S0002338823030125

6. Chen S, Zheng J. A Q-learning Grey Wolf Optimizer for A Distributed Hybrid Flowshop Rescheduling Problem with Urgent Job Insertion. J Appl Math Comput. 2025;71(3):3645–3670. https://doi.org/10.1007/s12190-024-02364-1

7. Kankashvar M, Rafiee S, Bolandi H. Fault-Tolerant Q-Learning for Discrete-Time Linear Systems with Actuator and Sensor Faults Using Input-Output Measured Data. Frankl Open. 2025;11:100259. https://doi.org/10.1016/j.fraope.2025.100259

8. Rifanti UM, Aryati L, Susyanto N, Susanto H. A Reinforcement Learning Based Decision-Support System for Mitigate Strategies During COVID-19: A Systematic Review. Jambura J Biomath. 2025;6(1):60–70. https://doi.org/10.37905/jjbm.v6i1.30513

9. Watkins CJCH, Dayan P. Q-learning. Mach Learn. 1992;8:279–292. https://doi.org/10.1007/BF00992698

10. Tsitsiklis JN. Asynchronous Stochastic Approximation and Q-Learning. Mach Learn. 1994;16(3):185–202. https://doi.org/10.1023/A:1022689125041

11. Bertsekas DP. Reinforcement Learning and Optimal Control. Athena Scientific; 2019. Available from: https://web.mit.edu/dimitrib/www/RL OC Short View.pdf [Last accessed on June 4, 2026].

12. Hariharan N, Anand GP. A Brief Study of Deep Reinforcement Learning with Epsilon-Greedy Exploration. Int J Comput Digit Syst. 2022;11(1):541–551. https://doi.org/10.12785/ijcds/110144

13. Aghanim A, Chekenbah H, Oulhaj O, Lasri, R. Q-learning Empowered Cavity Filter Tuning with Epsilon Decay Strategy. Prog Electromagn Res C. 2024;140:31–40. https://doi.org/10.2528/PIERC23111903

14. Tokic M, Palm G. Value-Difference Based Exploration: Adaptive Control between Epsilon-Greedy and Softmax. In: KI 2011: Advances in Artificial Intelligence. Springer; 2011:335–346. https://doi.org/10.1007/978-3-642-24455-1

15. Mignon A, Rocha RLA. An Adaptive Implementation of Epsilon-Greedy. Procedia Comput Sci. 2017;109:1146–1151. https://doi.org/10.1016/j.procs.2017.05.431

16. Kumar A, Singh D. Adaptive Epsilon-Greedy Reinforcement Learning for IoT Security. Discov Internet Things. 2024;4(1):27. https://doi.org/10.1007/s43926-024-00080-7

17. Aslan E, Arserim MA, U，car A. Development of Push-Recovery Control System for Humanoid Robots Using Deep Reinforcement Learning. Ain Shams Eng J. 2023;14(10):102167. https://doi.org/10.1016/j.asej.2023.102167

18. Ben-Akka M, Tanougast C, Diou C. Novel Design of Reward and Epsilon-Greedy Decay Strategy Tailored for Q-Learning in Optimizing Local Mobile Robot Path Planning. Knowl-Based Syst. 2025;324:113836. https://doi.org/10.1016/j.knosys.2025.113836

19. Du D, Han S, Qi N, Ammar HB, Wang J, Pan W. Reinforcement Learning for Safe Robot Control Using Control Lyapunov Barrier Functions. In: Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA). 2023:9442–9448. https://doi.org/10.1109/ICRA48891.2023.10160991

20. Zhao L, Gatsis K, Papachristodoulou A. Stable and Safe Reinforcement Learning via A Barrier-Lyapunov Actor-Critic Approach. In: Proceedings of the 2023 62nd IEEE Conference on Decision and Control (CDC). 2023:1320–1325. https://doi.org/10.1109/CDC49753.2023.103837

21. Lewis FL, Vrabie D, Vamvoudakis KG. Reinforcement Learning and Feedback Control. IEEE Control Syst Mag. 2012;32(6):76–105. https://doi.org/10.1109/MCS.2012.2214134

22. Sutton RS, Barto AG. Reinforcement Learning: An Introduction. 2nd ed. MIT Press; 2018. Available from: https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf [Last accessed on June 4, 2026].

23. Liu S, Pu T, Zeng L, Wang Y, Cheng H, Liu Z. Reinforcement Learning-Based Network Dismantling by Targeting Maximum-Degree Nodes in The Giant Connected Component. Mathematics. 2024;12(17):2766. https://doi.org/10.3390/math12172766

24. Vamvoudakis KG, Wan Y, Lewis FL, & Cansever D, eds. Handbook of Reinforcement Learning and Control. Springer; 2021. https://doi.org/10.1007/978-3-030-60990-0

25. Ladosz P, Weng L, Kim M, Oh H. Exploration in Deep Reinforcement Learning: A Survey. Inf Fusion. 2022;85:1–22. https://doi.org/10.1016/j.inffus.2022.03.003

26. Wang D,Wei W, Li L,Wang X, Liang J. Rethinking Exploration-Exploitation Trade-Off. Neural Netw. 2025;187:107342. https://doi.org/10.1016/j.neunet.2025.107342

27. Zhang Y, Lyu Y, Zhan G, Zou W, Li SE. Boosting Exploration in Reinforcement Learning for Sparse Reward Tasks. In: Proceedings of the 2025 American Control Conference (ACC). 2025:3492–3499. https://doi.org/10.23919/ACC63710.2025.11107911

28. Sledge IJ, Principe JC. Balancing Exploration and Exploitation in Reinforcement Learning Using A Value of Information Criterion. In: Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2017:2816–2820. https://doi.org/10.1109/ICASSP.2017.7952670

29. Garg P, Silvas E, Willems F. Safe and Time-Efficient Exploration in Reinforcement Learning-Based Control of A Vehicle Thermal Systems. Control Eng Pract. 2025;164:106458. https://doi.org/10.1016/j.conengprac.2025.106458

30. Ma Z, Liu Z. An Improved Q-learning Algorithm with Particle Swarm Optimization for Path Planning. In: Proceedings of the 2024 6th International Conference on Frontier Technologies of Information and Computer (ICFTIC). 2024:1662–1667. https://doi.org/10.1109/ICFTIC64248.2024.10913234

31. Rokhlin DB. Robbins-Monro Conditions for Persistent Exploration Learning Strategies. In: Springer Proceedings in Mathematics & Statistics. Springer; 2019:237–247. https://doi.org/10.1007/978-3-030-26748-3 14

32. Littman ML. Value-Function Reinforcement Learning in Markov Games. Cogn Syst Res. 2001;2(1):55–66. https://doi.org/10.1016/S1389-0417(01)00015-8

33. Szepesv´ari C, Littman ML. A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms. Neural Comput. 1999;11(8):2017–2060. https://doi.org/10.1162/089976699300016070

34. Singh S, Jaakkola T, Littman ML, Szepesv´ari C. Convergence Results for Single-Step Reinforcement Learning Algorithms. Mach Learn. 2000;38(3):287–308. https://doi.org/10.1023/A:1007678930559

35. Beresnevich V, Velani S. The Divergence Borel-Cantelli Lemma Revisited. J Math Anal Appl. 2023;519(1):126750. https://doi.org/10.1016/j.jmaa.2022.126750

36. Lemos-Silva M, Vaz S, Torres DFM. Exact Solution for A Discrete-Time SIR Model. Appl Numer Math. 2025;207:339–347. https://doi.org/10.1016/j.apnum.2024.09.014

37. Gairat A, Shcherbakov V. Discrete SIR Model on A Homogeneous Tree and Its Continuous Limit. J Phys A Math Theor. 2022;55(43):434004. https://doi.org/10.1088/1751-8121/ac9655

38. Rokaya M, Hemdan DI, Alzain MA, Atlam E-S. A Novel Fractional-Order Model with Data-Driven Validation for The Dynamics of Complex Epidemic Spreading in Networks. Int J Optim Control Theor Appl. 2026;16(1):111–137. https://doi.org/10.36922/IJOCTA025220107

39. Bertsekas DP. Model Predictive Control and Reinforcement Learning: A Unified Framework Based on Dynamic Programming. IFACPapersOnLine. 2024;58:363–383. https://doi.org/10.1016/j.ifacol.2024.09.056

40. Morcego B, Yin W, Boersma S, van Henten E, Puig V, Sun C. Reinforcement Learning Versus Model Predictive Control on Greenhouse Climate Control. Comput Electron Agric. 2023;215:108372. https://doi.org/10.1016/j.compag.2023.108372

41. Ernst D, Glavic M, Capitanescu F, Wehenkel L. Reinforcement Learning Versus Model Predictive Control: A Comparison on A Power System Problem. IEEE Trans Syst Man Cybern B Cybern. 2009;39(2):517–529. https://doi.org/10.1109/TSMCB.2008.2007630

42. Sajjadi SS, Pariz N, Karimpour A, Jajarmi A. An Off-Line NMPC Strategy for Continuous-Time Nonlinear Systems Using An Extended Modal Series Method. Nonlinear Dyn. 2014;78(4):2651–2674. https://doi.org/10.1007/s11071-014-1616-6

43. Camacho EF, ordons C. Model Predictive Control. Springer; 2007. https://doi.org/10.1007/978-0-85729-398-5

44. Sassano M. Policy Algebraic Equation for The Discrete-Time Linear Quadratic Regulator Problem. IEEE Trans Autom Control. 2025;70(4):2106–2121. https://doi.org/10.1109/TAC.2024.3465566

45. Zhang H, Duan G, Xie L. Linear Quadratic Regulation for Linear Time-Varying Systems with Multiple Input Delays. Automatica. 2006;42(9):1465–1476. https://doi.org/10.1016/j.automatica.2006.04.007

46. Zhang H, Li L, Xu J, Fu M. Linear Quadratic Regulation and Stabilization of Discrete-Time Systems with Delay and Multiplicative Noise. IEEE Trans Autom Control. 2015;60(10):2599–2613. https://doi.org/10.1109/TAC.2015.2411911

Previous article in this issue

Next article in this issue

An International Journal of Optimization and Control: Theories & Applications, Electronic ISSN: 2146-5703 Print ISSN: 2146-0957, Published by AccScience Publishing