AccScience Publishing / IJOCTA / Online First / DOI: 10.36922/IJOCTA026190075
RESEARCH ARTICLE

Q-learning as a feedback control mechanism in deterministic systems

Utti M. Rifanti1,2 Lina Aryati1 Nanang Susyanto1* Hadi Susanto3
Show Less
1 Department of Mathematics, Universitas Gadjah Mada, Yogyakarta, Indonesia
2 Telecommunications Engineering Study Program, Universitas Telkom, Purwokerto Campus, Purwokerto, Indonesia
3 Department of Mathematics, Khalifa University, Abu Dhabi, United Arab Emirates
IJOCTA 2026, 16(3), 026190075 https://doi.org/10.36922/IJOCTA026190075
Received: 4 May 2026 | Revised: 27 May 2026 | Accepted: 29 May 2026 | Published online: 9 June 2026
© 2026 by the Author(s). This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution -Noncommercial 4.0 International License (CC-by the license) ( https://creativecommons.org/licenses/by-nc/4.0/ )
Abstract

This paper studies tabular Q-learning when implemented as a feedback controller in deterministic discrete-time systems, where the learning process interacts with the system dynamics to generate closed-loop trajectories. Unlike classical convergenceresults, which assume infinite visitation as an external sampling condition, this property arises from the interaction between recurrence of the induced state dynamics and persistent exploration under GLIE-compliant ϵ-greedy policies. In particular, we prove that exploration remains sufficiently persistent along recurrent states, implying infinite visitation of all state-action pairs within the recurrent region. The results provide a dynamical-systems characterization of the sampling condition underlying classical Q-learning convergence and establish a formal connection between reinforcement learning and recurrence analysis in discrete-time systems. Numerical experiments on nonlinear epidemic dynamics and linear quadratic control benchmarks illustrate the resulting closed-loop behavior.

Keywords
Q-learning Feedback control
Deterministic discrete-time systems
Dynamical systems
Infinite visitation
State recurrence
Funding
None.
Conflict of interest
The authors declare that they have no conflict of interest.
References

1. Huang M, Liu C, He X, Ma L, Lu Z, Su, H. Reinforcement Learning-Based Control for Nonlinear Discrete-Time Systems. Neurocomputing. 2020;402:50–65. https://doi.org/10.1016/j.neucom.2020.03.061

 

2. Kiumarsi B, Lewis FL, Modares H, Karimpour A, Naghibi-Sistani MB. Reinforcement Qlearning for Optimal Tracking Control. Automatica. 2014;50(4):1167–1175. https://doi.org/10.1016/j.automatica.2014.02.015

 

3. Mukhopadhyay R, Sutradhar A, Chattopadhyay P. A Novel Investigation on The Effects of State and Reward Structure in Designing Deep Reinforcement Learning-Based Controller for Nonlinear Dynamical Systems. Int J Dyn Control. 2024;12(8):3017–3032. https://doi.org/10.1007/s40435-024-01407-6

 

4. Yadav KP, Narayan J, Kushwaha P. Learning to Balance: Reinforcement Learning Control for Single-Leg Balance of An Underactuated Biped Robot. Int J Dyn Control. 2025;13(7):268. https://doi.org/10.1007/s40435-025-01782-8

 

5. Semenov SS, Tsurkov VI. Reinforcement Learning for Optimal Control Problems. J Comput Syst Sci Int. 2023;62(3):508–521. https://doi.org/10.31857/S0002338823030125

 

6. Chen S, Zheng J. A Q-learning Grey Wolf Optimizer for A Distributed Hybrid Flowshop Rescheduling Problem with Urgent Job Insertion. J Appl Math Comput. 2025;71(3):3645–3670. https://doi.org/10.1007/s12190-024-02364-1

 

7. Kankashvar M, Rafiee S, Bolandi H. Fault-Tolerant Q-Learning for Discrete-Time Linear Systems with Actuator and Sensor Faults Using Input-Output Measured Data. Frankl Open. 2025;11:100259. https://doi.org/10.1016/j.fraope.2025.100259

 

8. Rifanti UM, Aryati L, Susyanto N, Susanto H. A Reinforcement Learning Based Decision-Support System for Mitigate Strategies During COVID-19: A Systematic Review. Jambura J Biomath. 2025;6(1):60–70. https://doi.org/10.37905/jjbm.v6i1.30513

 

9. Watkins CJCH, Dayan P. Q-learning. Mach Learn. 1992;8:279–292. https://doi.org/10.1007/BF00992698

 

10. Tsitsiklis JN. Asynchronous Stochastic Approximation and Q-Learning. Mach Learn. 1994;16(3):185–202. https://doi.org/10.1023/A:1022689125041

 

11. Bertsekas DP. Reinforcement Learning and Optimal Control. Athena Scientific; 2019. Available from: https://web.mit.edu/dimitrib/www/RL OC Short View.pdf [Last accessed on June 4, 2026].

 

12. Hariharan N, Anand GP. A Brief Study of Deep Reinforcement Learning with Epsilon-Greedy Exploration. Int J Comput Digit Syst. 2022;11(1):541–551. https://doi.org/10.12785/ijcds/110144

 

13. Aghanim A, Chekenbah H, Oulhaj O, Lasri, R. Q-learning Empowered Cavity Filter Tuning with Epsilon Decay Strategy. Prog Electromagn Res C. 2024;140:31–40. https://doi.org/10.2528/PIERC23111903

 

14. Tokic M, Palm G. Value-Difference Based Exploration: Adaptive Control between Epsilon-Greedy and Softmax. In: KI 2011: Advances in Artificial Intelligence. Springer; 2011:335–346. https://doi.org/10.1007/978-3-642-24455-1

 

15. Mignon A, Rocha RLA. An Adaptive Implementation of Epsilon-Greedy. Procedia Comput Sci. 2017;109:1146–1151. https://doi.org/10.1016/j.procs.2017.05.431

 

16. Kumar A, Singh D. Adaptive Epsilon-Greedy Reinforcement Learning for IoT Security. Discov Internet Things. 2024;4(1):27. https://doi.org/10.1007/s43926-024-00080-7

 

17. Aslan E, Arserim MA, U,car A. Development of Push-Recovery Control System for Humanoid Robots Using Deep Reinforcement Learning. Ain Shams Eng J. 2023;14(10):102167. https://doi.org/10.1016/j.asej.2023.102167

 

18. Ben-Akka M, Tanougast C, Diou C. Novel Design of Reward and Epsilon-Greedy Decay Strategy Tailored for Q-Learning in Optimizing Local Mobile Robot Path Planning. Knowl-Based Syst. 2025;324:113836. https://doi.org/10.1016/j.knosys.2025.113836

 

19. Du D, Han S, Qi N, Ammar HB, Wang J, Pan W. Reinforcement Learning for Safe Robot Control Using Control Lyapunov Barrier Functions. In: Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA). 2023:9442–9448. https://doi.org/10.1109/ICRA48891.2023.10160991

 

20. Zhao L, Gatsis K, Papachristodoulou A. Stable and Safe Reinforcement Learning via A Barrier-Lyapunov Actor-Critic Approach. In: Proceedings of the 2023 62nd IEEE Conference on Decision and Control (CDC). 2023:1320–1325. https://doi.org/10.1109/CDC49753.2023.103837

 

21. Lewis FL, Vrabie D, Vamvoudakis KG. Reinforcement Learning and Feedback Control. IEEE Control Syst Mag. 2012;32(6):76–105. https://doi.org/10.1109/MCS.2012.2214134

 

22. Sutton RS, Barto AG. Reinforcement Learning: An Introduction. 2nd ed. MIT Press; 2018. Available from: https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf [Last accessed on June 4, 2026].

 

23. Liu S, Pu T, Zeng L, Wang Y, Cheng H, Liu Z. Reinforcement Learning-Based Network Dismantling by Targeting Maximum-Degree Nodes in The Giant Connected Component. Mathematics. 2024;12(17):2766. https://doi.org/10.3390/math12172766

 

24. Vamvoudakis KG, Wan Y, Lewis FL, & Cansever D, eds. Handbook of Reinforcement Learning and Control. Springer; 2021. https://doi.org/10.1007/978-3-030-60990-0

 

25. Ladosz P, Weng L, Kim M, Oh H. Exploration in Deep Reinforcement Learning: A Survey. Inf Fusion. 2022;85:1–22. https://doi.org/10.1016/j.inffus.2022.03.003

 

26. Wang D,Wei W, Li L,Wang X, Liang J. Rethinking Exploration-Exploitation Trade-Off. Neural Netw. 2025;187:107342. https://doi.org/10.1016/j.neunet.2025.107342

 

27. Zhang Y, Lyu Y, Zhan G, Zou W, Li SE. Boosting Exploration in Reinforcement Learning for Sparse Reward Tasks. In: Proceedings of the 2025 American Control Conference (ACC). 2025:3492–3499. https://doi.org/10.23919/ACC63710.2025.11107911

 

28. Sledge IJ, Principe JC. Balancing Exploration and Exploitation in Reinforcement Learning Using A Value of Information Criterion. In: Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2017:2816–2820. https://doi.org/10.1109/ICASSP.2017.7952670

 

29. Garg P, Silvas E, Willems F. Safe and Time-Efficient Exploration in Reinforcement Learning-Based Control of A Vehicle Thermal Systems. Control Eng Pract. 2025;164:106458. https://doi.org/10.1016/j.conengprac.2025.106458

 

30. Ma Z, Liu Z. An Improved Q-learning Algorithm with Particle Swarm Optimization for Path Planning. In: Proceedings of the 2024 6th International Conference on Frontier Technologies of Information and Computer (ICFTIC). 2024:1662–1667. https://doi.org/10.1109/ICFTIC64248.2024.10913234

 

31. Rokhlin DB. Robbins-Monro Conditions for Persistent Exploration Learning Strategies. In: Springer Proceedings in Mathematics & Statistics. Springer; 2019:237–247. https://doi.org/10.1007/978-3-030-26748-3 14

 

32. Littman ML. Value-Function Reinforcement Learning in Markov Games. Cogn Syst Res. 2001;2(1):55–66. https://doi.org/10.1016/S1389-0417(01)00015-8

 

33. Szepesv´ari C, Littman ML. A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms. Neural Comput. 1999;11(8):2017–2060. https://doi.org/10.1162/089976699300016070

 

34. Singh S, Jaakkola T, Littman ML, Szepesv´ari C. Convergence Results for Single-Step Reinforcement Learning Algorithms. Mach Learn. 2000;38(3):287–308. https://doi.org/10.1023/A:1007678930559

 

35. Beresnevich V, Velani S. The Divergence Borel-Cantelli Lemma Revisited. J Math Anal Appl. 2023;519(1):126750. https://doi.org/10.1016/j.jmaa.2022.126750

 

36. Lemos-Silva M, Vaz S, Torres DFM. Exact Solution for A Discrete-Time SIR Model. Appl Numer Math. 2025;207:339–347. https://doi.org/10.1016/j.apnum.2024.09.014

 

37. Gairat A, Shcherbakov V. Discrete SIR Model on A Homogeneous Tree and Its Continuous Limit. J Phys A Math Theor. 2022;55(43):434004. https://doi.org/10.1088/1751-8121/ac9655

 

38. Rokaya M, Hemdan DI, Alzain MA, Atlam E-S. A Novel Fractional-Order Model with Data-Driven Validation for The Dynamics of Complex Epidemic Spreading in Networks. Int J Optim Control Theor Appl. 2026;16(1):111–137. https://doi.org/10.36922/IJOCTA025220107

 

39. Bertsekas DP. Model Predictive Control and Reinforcement Learning: A Unified Framework Based on Dynamic Programming. IFACPapersOnLine. 2024;58:363–383. https://doi.org/10.1016/j.ifacol.2024.09.056

 

40. Morcego B, Yin W, Boersma S, van Henten E, Puig V, Sun C. Reinforcement Learning Versus Model Predictive Control on Greenhouse Climate Control. Comput Electron Agric. 2023;215:108372. https://doi.org/10.1016/j.compag.2023.108372

 

41. Ernst D, Glavic M, Capitanescu F, Wehenkel L. Reinforcement Learning Versus Model Predictive Control: A Comparison on A Power System Problem. IEEE Trans Syst Man Cybern B Cybern. 2009;39(2):517–529. https://doi.org/10.1109/TSMCB.2008.2007630

 

42. Sajjadi SS, Pariz N, Karimpour A, Jajarmi A. An Off-Line NMPC Strategy for Continuous-Time Nonlinear Systems Using An Extended Modal Series Method. Nonlinear Dyn. 2014;78(4):2651–2674. https://doi.org/10.1007/s11071-014-1616-6

 

43. Camacho EF, ordons C. Model Predictive Control. Springer; 2007. https://doi.org/10.1007/978-0-85729-398-5

 

44. Sassano M. Policy Algebraic Equation for The Discrete-Time Linear Quadratic Regulator Problem. IEEE Trans Autom Control. 2025;70(4):2106–2121. https://doi.org/10.1109/TAC.2024.3465566

 

45. Zhang H, Duan G, Xie L. Linear Quadratic Regulation for Linear Time-Varying Systems with Multiple Input Delays. Automatica. 2006;42(9):1465–1476. https://doi.org/10.1016/j.automatica.2006.04.007

 

46. Zhang H, Li L, Xu J, Fu M. Linear Quadratic Regulation and Stabilization of Discrete-Time Systems with Delay and Multiplicative Noise. IEEE Trans Autom Control. 2015;60(10):2599–2613. https://doi.org/10.1109/TAC.2015.2411911

Share
Back to top
An International Journal of Optimization and Control: Theories & Applications, Electronic ISSN: 2146-5703 Print ISSN: 2146-0957, Published by AccScience Publishing