Comparison of synthetic data generation techniques for obesity level prediction based on dietary habits and physical status

In the contemporary context of the obesity epidemic and its associated comorbidities, early detection of individuals at risk is critical. Artificial intelligence and machine learning techniques offer substantial potential for automating obesity risk assessment, enabling early diagnosis and intervention. However, the development of robust predictive models is often hampered by limited or imbalanced datasets. Synthetic data generation has emerged as a key solution, allowing the expansion and balancing of data while preserving privacy. Recent surveys highlight that the synthetic minority oversampling technique (SMOTE) is a leading method for data generation in obesity detection. In line with this, our study analyzed the Estimation of Obesity Levels dataset, a dataset from the University of California, Irvine repository, focused on dietary habits and physical condition, which suffers from class imbalance. We compared three synthetic data generation approaches: SMOTE—nominal and continuous, variational autoencoders, and conditional tabular generative adversarial network. We trained multiple classifiers on the generated datasets and evaluated their performance. Classifiers trained on data including height and weight (i.e., body mass index [BMI]-related features) achieved F1-scores of up to 98.16%, as expected due to the direct role of BMI in obesity classification. Crucially, models trained without height and weight still achieved an F1-score of 74.48% when synthetic augmentation was used, demonstrating that useful obesity prediction models can be developed even in the absence of explicit anthropometric measures. These results indicate that synthetic data can enable accurate classification when key features are missing or when data are scarce.
- Ural D, Kılıçkap M, Göksülük H, et al. Data on prevalence of obesity and waist circumference in Turkey: Systematic review, meta-analysis and meta-regression of epidemiological studies on cardiovascular risk factors. Turk J Cardiol Arch. 2018;46(7):577-590. doi: 10.5543/tkda.2018.62200
- Yavuz R, Tontuş H. The clinical approach to the obesity in adult, adolescent and pediatric age groups. J Exp Clin Med. 2013;30(1s):69-74.
- Rosengren A. Obesity and cardiovascular health: The size of the problem. Eur Heart J. 2021;42(34):3404-3406. doi: 10.1093/eurheartj/ehab518
- Dönder E, Önalan E. Definition, epidemiology, and clinical evaluation of obesity. Fırat Med J. 2018;23(3):1-4.
- UCI Machine Learning Repository. Estimation of Obesity Levels Based On Eating Habits and Physical Condition. Irvine: UCI Machine Learning Repository; 2019. doi: 10.24432/C5H31Z
- Shi R, Wang Y, Du M, Shen X, Wang X. A Comprehensive Survey of Synthetic Tabular Data Generation. [arXiv Preprint]; 2025. doi: 10.48550/arXiv.2504.16506
- Hernadez M, Epelde G, Alberdi A, Cilla R, Rankin D. Synthetic tabular data evaluation in the health domain covering resemblance, utility, and privacy dimensions. Methods Inf Med. 2023;62(S01):e19-e38. doi: 10.1055/s-0042-1760247
- Arora A, Arora A. Generative adversarial networks and synthetic patient data: Current challenges and future perspectives. Fut Healthc J. 2022;9(2):190-193. doi: 10.7861/fhj.2022-0013
- Sámano R, Lopezmalo-Casares S, Martínez-Rojano H, et al. Early life determinants of overweight and obesity in a sample of Mexico city preschoolers. Nutrients. 2025;17(4):697. doi: 10.3390/nu17040697
- Sobas K, Suliga E, Bryk P, Gluszek S. Dietary patterns and nutritional status in bariatric surgery candidates-a cross-sectional study. Nutrients. 2025;17(4):716. doi: 10.3390/nu17040716
- Colonnello E, Libotte F, Masi D, et al. Eating behavior patterns, metabolic parameters and circulating oxytocin levels in patients with obesity: An exploratory study. Eating Weight Disord. 2025;30(1):6. doi: 10.1007/s40519-024-01698-w
- El-Sehrawy AAMA, Khachatryan LG, Kubaev A, et al. Triglyceride-glucose index: A potent predictor of metabolic risk factors and eating behavior patterns among obese individuals. BMC Endocr Disord. 2025;25(1):71. doi: 10.1186/s12902-025-01887-3
- Kuckuck S, Van der Valk ES, Lengton R, et al. Long-term hair cortisone and perceived stress are associated with long-term hedonic eating tendencies in patients with obesity. Psychoneuroendocrinology. 2025;171:107224. doi: 10.1016/j.psyneuen.2024.107224
- Palechor FM, De la Hoz Manotas A. Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico. Data Brief. 2025;25:104344. doi: 10.1016/j.dib.2019.104344
- Helforoush Z, Sayyad H. Prediction and classification of obesity risk based on a hybrid metaheuristic machine learning approach. Front Big Data. 2024;7:1469981. doi: 10.3389/fdata.2024.1469981
- Ayub H, Khan MA, Shehryar Ali Naqvi S, et al. Unraveling the potential of attentive Bi-LSTM for accurate obesity prognosis: Advancing public health towards sustainable cities. Bioengineering (Basel). 2024;11(6):533. doi: 10.3390/bioengineering11060533
- Shakti MAS, Vijayalakshmi M, Kumar N, Vaidhehi M. Analysis on Various Machine Learning Framework for Obesity Level Prediction. In: Proceedings of the 1st International Conference on Contemporary Global Challenges and Urban Innovations (ICCGUI) IEEE. Vol. 1; 2024. p. 406-411. doi: 10.1109/IC-CGU58078.2024.10530812
- Yağmur N. A hybrid approach to obesity level determination with decision tree and pelican optimization algorithm. J Sci Rep A. 2024;57:97-109. doi: 10.59313/jsr-a.1447814
- Özkurt C. Examination and evaluation of obesity risk factors with explainable artificial intelligence. Comput Electron Med. 2024;1(1):12-17. doi: 10.69882/adba.cem.2024072
- Wang X. Predicting obesity risk through lifestyle habits: A comparative analysis of machine learning models. E3S Web Conf. 2024;385:05037. doi: 10.1051/e3sconf/202455305037
- Okpe OA, Odey JA, Abiodum OJ. A novel multi-class classification of obesity level using artificial neural network. Int J Adv Multidiscip Res Studies. 2024;4(3):1374-1379.
- Azad M, Khan MFK, El-Ghany SA. XAI-enhanced machine learning for obesity risk classification: A stacking approach with LIME explanations. IEEE Access. 2025;13:13847-13865. doi: 10.1109/ACCESS.2025.3530840
- Solomon DD, Khan S, Garg S, et al. Hybrid majority voting: Prediction and classification model for obesity. Diagnostics (Basel). 2023;13(15):2610. doi: 10.3390/diagnostics13152610
- Kaur R, Kumar R, Gupta M. Predicting risk of obesity and meal planning to reduce obesity in adulthood using artificial intelligence. Endocrine. 2022;78(3):458-469. doi: 10.1007/s12020-022-03215-4
- Muliawan A, Fauziah DA, Afrianto E. Obesity risk prediction using random forest based on eating habit parameters. INSIDE J. 2024;2(1):13-18.
- Choudhuri A. A Hybrid Machine Learning Model for Estimation of Obesity Levels. In: Proceedings of the International Conference on Data Management, Analytics and Innovation. Vol. 137; 2023. p. 414-423. doi: 10.1007/978-981-19-2600-6_22
- Cervantes RC, Palacio ALH. Estimation of obesity levels based on computational intelligence. Inf Med Unlocked. 2020;21:100472. doi: 10.1016/j.imu.2020.100472
- Ganie SM, Reddy BB, Rege M. An investigation of ensemble learning techniques for obesity risk prediction using lifestyle data. Decis Analyt J. 2025;14:100539. doi: 10.1016/j.dajour.2024.100539
- Nagarajan SG, Balasubramanian V, Gonugunta P, Gudla SK. Obesity level prediction using deep learning approach-a comparative analysis. Eng Appl Sci Res. 2024;51(4):540-554.
- Umoh PN, Nneji GU, Monday HN, et al. Optimizing machine learning classifiers and feature selection techniques for obesity levels estimation using physical habits and dietary data. World Sci News. 2024;198:326-353. doi: 10.1142/WSN198(2024)325-353
- Vairachilai S, Periyanayagi S, Raja SPR. PIPR machine learning model: Obesity impact analysis. Open Biomed Eng J. 2024;18(1):1-20. doi: 10.2174/0118741207289421240430115207
- Forte P, Encarnação S, Monteiro AM, et al. A deep learning neural network to classify obesity risk in portuguese adolescents based on physical fitness levels and body mass index percentiles: Insights for national health policies. Behav Sci. 2023;13(7):522. doi: 10.3390/bs13070522
- Yağın FH, Gülü M, Görmez Y, et al. Estimation of obesity levels with a trained neural network approach optimized by the Bayesian technique. Appl Sci. 2023;13(6):3875. doi: 10.3390/app13063875
- Gözükara Bağ HG, Yağın FH, Görmez Y, et al. Estimation of obesity levels through the proposed predictive approach based on physical activity and nutritional habits. Diagnostics. 2023;13(18):2949. doi: 10.3390/diagnostics13182949
- Yang Y, Khorshidi HA, Aickelin U. A review on over-sampling techniques in classification of multi-class imbalanced datasets: Insights for medical problems. Front Digit Health. 2024;6:1430245. doi: 10.3389/fdgth.2024.1430245
- Lemaître G, Nogueira F, Aridas CK. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. J Mach Learn Res. 2017;18(17):1-5.
- Xu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K. Modeling Tabular Data using Conditional GAN. In: Advances in Neural Information Processing Systems; 2019. p. 32. Available from: https://proceedings.neurips.cc/paper/2019/hash/254ed7d2de3b23ab10936 522dd547b78- abstract.html [Last accessed on 2024 Dec 12].
- Patki N, Wedge R, Veeramachaneni K. The Synthetic Data Vault. In: International Conference on Data Science and Advanced Analytics (DSAA); 2016. p. 399-410. doi: 10.1109/DSAA.2016.49
- Luo Y, Tao J, Zhu Y, Xu Y. HSS: Enhancing IoT malicious traffic classification leveraging hybrid sampling strategy. Cybersecurity. 2024;7(1):11. doi: 10.1186/s42400-023-00201-9
- Yadav P, Gaur M, Madhukar RK, Verma G, Kumar P. Rigorous experimental analysis of tabular data generated using TVAE and CTGAN. Int J Adv Comput Sci Appl. 2024;15(4):1250-1262. doi: 10.14569/ijacsa.2024.01504125
- Huang GL, Wu PY. CTGAN: Cloud transformer generative adversarial network. In: 2022 IEEE International Conference on Image Processing (ICIP); 2022. p. 511-515. doi: 10.1109/ICIP46576.2022.9897229