Review of methods and systems for generation of synthetic training data
- Authors: Rabchevsky A.N.1,2
- Affiliations:
- Perm State University
- LLC “SEUSLAB”
- Issue: No 4 (2023)
- Pages: 6-45
- Section: ARTICLES
- URL: https://ered.pstu.ru/index.php/amcs/article/view/4052
- DOI: https://doi.org/10.15593/2499-9873/2023.4.01
- Cite item
Abstract
It is impossible to imagine the advancement of modern artificial intelligence systems without neural network technologies. During the design process researchers are often faced with the fact that there is not enough data to train modern neural network models, these data may be unbalanced or highly sparse. Often it happens that real data simply does not exist, as the research field is still emerging. A relevant problem is ensuring the confidentiality of real personal or patient medical data, which is used in the exchange between researchers or in the testing of various neural network systems. In many subject areas, the cost of collecting and marking up real data can be very high. Synthetic data is increasingly being used to solve these problems. The purpose of this publication is to introduce readers to advances in the generation and use of synthetic data. The paper presents a description of various methods, systems and software tools used to generate synthetic data, which can help to improve neural network models. Since an entire industry for synthetic data production has already formed, the leading data synthesis technology platforms are presented. The paper is of an overview nature, so it contains an extensive bibliography. The value of the article lies in the fact that this review will help readers broaden their understanding of the use of synthetic data in solving a wide range of neural network problems, as well as to become more familiar with the methods and tools for their generation.
Full Text
В процессе проектирования современных нейросетевых моделей исследователи часто сталкиваются с проблемой недоступности достаточного количества данных для их обучения, а также с неравномерностью или разреженностью этих данных. Нередко случается, что реальных данных просто не существует, так как область исследований еще только формируется. Также существует проблема конфиденциальности реальных персональных данных или медицинских данных пациентов, которые используются в процессе обмена между исследователями или в процессе тестирования различных нейросетевых систем. Во всех этих случаях на помощь приходят синтетические данные. Как следует из названия, синтетические данные, это данные, которые созданы искусственно, а не в результате реальных событий. Они часто создаются с помощью алгоритмов и используются для широкого спектра действий. Одно из первых упоминаний о применении синтетических данных встречается в связи с разработкой и тестированием системы обнаружения вторжений агентством DARPA 1 в 1998 и 1999 гг. [1]. Тестовые данные содержали сетевой трафик и файлы журнала системных вызовов из смоделированной большой компьютерной сети. Атакующие данные были сгенерированы синтетически на основе сценариев возможных атак, а фоновые данные – с помощью программных автоматов, имитирующих использование различных услуг. Использование синтетических данных позволило разработчикам смоделировать и протестировать различные сценарии вторжения, которые ранее еще не встречались. В тех случаях, когда новые службы еще только тестируются перед вводом в эксплуатацию, данные для обучения нейросетей могут просто отсутствовать. В этом случае для тестирования нужны синтезированные данные. Например, авторы статьи [2] уже в 2002 г. применяли синтетические данные при создании системы детектирования мошенничества.About the authors
A. N. Rabchevsky
Perm State University; LLC “SEUSLAB”
References
- Evaluating intrusion detection systems: the 1998 DARPA off-line intrusion detection evaluation / R.P. Lippmann, D.J. Fried, I. Graf, J.W. Haines, K.R. Kendall, D. McClung, D. Weber, S.E. Webster, D. Wyschogrod, R.K. Cunningham, M.A. Zissman // In: Proceedings DARPA Information Survivability Conference and Exposition. DISCEX’00. IEEE Comput. Soc. – 2000. – P. 12–26. doi: 10.1109/DISCEX.2000.821506
- Lundin E., Kvarnström H., Jonsson E.A. Synthetic Fraud Data Generation Methodology // Deng Robertand Bao, Fengand Zhou Jianyingand, and Qing Sihan (eds.) Information and Communications Security. Springer Berlin Heidelberg, Berlin, Heidelberg. – 2002. – Р. 265–277. doi: 10.1007/3-540-36159-6_23
- Learning Deep Models from Synthetic Data for Extracting Dolphin Whistle Contours / P. Li, X. Liua, K.J. Palmer, E. Fleishman, D. Gillespie, E.-M. Nosal, Y. Shiu, H. Klinck, D. Cholewiak, T. Helble, M.A. Roch. – 2020. doi: 10.48550/ARXIV.2005.08894
- Lombardo, J. Method for Generation and Distribution of Synthetic Medical Record Data for Evaluation of Disease-Monitoring Systems / J. Lombardo, L.A. Moniz // Johns Hopkins APL Technical Digest (Applied Physics Laboratory). – 2008. – Vol. 27.
- Construction and Validation of Synthetic Electronic Medical Records / L. Moniz, A.L. Buczak, L. Hung, S. Babin, M. Dorko, J. Lombardo // Online J Public Health Inform. – 2009. – Vol. 1. doi: 10.5210/ojphi.v1i1.2720
- Buczak A.L., Babin S., Moniz L. Data-driven approach for creating synthetic electronic medical records // BMC Med Inform Decis Mak. – 2010. – Vol. 10, iss. 1. doi: 10.1186/1472-6947-10-59
- Jin C., Rinard M.C. Learning From Context-Agnostic Synthetic Data // CoRR. – 2020. – abs/2005.14707
- McKenna R., Miklau G., Sheldon D. Winning the NIST Contest: A scalable and general approach to differentially private synthetic data // CoRR. – 2021. – abs/2108.04978
- Noise-Aware Statistical Inference with Differentially Private Synthetic Data / O. Räisä, J. Jälkö, S. Kaski, A. Honkela // arXiv. – 2022. – abs/2205.14485. doi: 10.48550/ARXIV. 2205.14485
- Awan J., Cai Z. One Step to Efficient Synthetic Data // arXiv. – 2020. – bs/2006.02397. doi: 10.48550/ARXIV.2006.02397
- Goetz J., Tewari A. Federated Learning via Synthetic Data // CoRR. – 2020. – abs/2008.04489
- FedSynth: Gradient Compression via Synthetic Data in Federated Learning / S. Hu, J. Goetz, K. Malik, H. Zhan, Z. Liu, Y. Liu // arXiv. – 2022. – abs/2204.01273. DOI: 10.48550/ ARXIV.2204.01273
- SMOTE: Synthetic Minority Over-sampling Technique / N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer // Journal of Artificial Intelligence Research. – 2002. – Vol. 16. – P. 321–357. doi: 10.1613/jair.953
- Mukherjee M., Khushi M. SMOTE-ENC: A novel SMOTE-based method to generate synthetic data for nominal and continuous features // Applied System Innovation. – 2021. – Vol. 4, iss. 1. – Art. 18. doi: 10.3390/asi4010018.
- A Method for Handling Multi-class Imbalanced Data by Geometry based Information Sampling and Class Prioritized Synthetic Data Generation (GICaPS) / A. Majumder, S. Dutta, S. Kumar, L. Behera // CoRR. – 2020. – abs/2010.05155
- Gonsior J., Thiele M., Lehner W. ImitAL: Learning Active Learning Strategies from Synthetic Data //. CoRR. – 2021. – abs/2108.07670
- Dataset Condensation via Efficient Synthetic-Data Parameterization / J.-H. Kim, J. Kim, S.J. Oh, S. Yun, H. Song, J. Jeong, J.-W. Ha, H.O. Song // arXiv. – 2022. – abs/2205. 14959. doi: 10.48550/ARXIV.2205.14959
- Effective Use of Synthetic Data for Urban Scene Semantic Segmentation / F.S. Saleh, M.S. Aliakbarian, M. Salzmann, L. Petersson, J.M. Alvarez // CoRR. – 2018. – abs/1807.06132
- Mason K., Vejdan S., Grijalva S. An “On The Fly” Framework for Efficiently Generating Synthetic Big Data Sets // CoRR. – 2019. – abs/1903.06798
- Condrea F., Ivan V.-A., Leordeanu M. In Search of Life: Learning from Synthetic Data to Detect Vital Signs in Videos // CoRR. – 2020. – abs/2004.07691
- Conditional Synthetic Data Generation for Robust Machine Learning Applications with Limited Pandemic Data / H.P. Das, R. Tran, J. Singh, X. Yue, G. Tison, A.L. Sangiovanni-Vincentelli, C.J. Spanos // CoRR. – 2021. – abs/2109.06486
- Рабчевский А.Н., Ашихмин Е.Г., Рабчевский Е.А. Моделирование структуры пропаганды протестного движения в социальных сетях с помощью графового анализа и нейросетевых технологий. – текст: непосредственный // Математическое и компьютерное моделирование: сборник материалов IX Международной научной конференции, посвященной 85-летию профессора В.И. Потапова. – Омск, 2021. – С. 273–276.
- Rabchevsky A., Yasnitsky L., Zayakin V. Comparison of methods for identifying user roles in online social networks // Applied Mathematics and Control Sciences. – 2021. Vol. 2. – P. 93–111. doi: 10.15593/2499-9873/2021.2.06
- Rabchevskiy A.N., Yasnitskiy L.N. Creating and Using Synthetic Data for Neural Network Training, Using the Creation of a Neural Network Classifier of Online Social Network User Roles as an Example // Digital Science. DSIC 2021. Lecture Notes in Networks and Systems, Springer, Cham. – 2022. – Vol. 381. – P. 412–421. doi: 10.1007/978-3-030-93677-8_36
- Generation of synthetic training data for object detection in piles / E. Buls, R. Kadikis, R. Cacurs, J. Ārents // D.P. Nikolaev, P. Radeva, A. Verikas, J. Zhou (eds.) Eleventh International Conference on Machine Vision (ICMV 2018), SPIE. – 2019. – P. 105. doi: 10.1117/12.2523203
- An Annotation Saved is an Annotation Earned: Using Fully Synthetic Training for Object Instance Detection / S. Hinterstoisser, O. Pauly, H. Heibel, M. Marek, M. Bokeloh // CoRR. – 2019. – abs/1902.09967
- Dina A.S., Siddique A.B., Manivannan D. Effect of Balancing Data Using Synthetic Data on the Performance of Machine Learning Classifiers for Intrusion Detection in Computer Networks // arXiv. – 2022. – abs/2204.00144. doi: 10.48550/ARXIV.2204.00144
- Charitou C., Dragicevic S., d’Avila Garcez A. Synthetic Data Generation for Fraud Detection using GANs // CoRR. – 2021. – abs/2109.12546
- Generative Adversarial Networks for Synthetic Data Generation: A Comparative Study / C. Little, M. Elliot, R. Allmendinger, S.S. Samani // arXiv. – 2021. – abs/2112.01925. doi: 10.48550/ARXIV.2112.01925
- GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification / M. Frid-Adar, I. Diamant, E. Klang, M. Amitai, J. Goldberger, H. Greenspan // Neurocomputing. – 2018. – Vol. 321. – P. 321–331. doi: 10.1016/j.neucom. 2018.09.013
- GAN-based synthetic brain MR image generation / C. Han, H. Hayashi, L. Rundo, R. Araki, W. Shimoda, S. Muramatsu, Y. Furukawa, G. Mauri, H. Nakayama // 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018). – 2018. – P. 734–738.
- Constrained Generative Adversarial Network Ensembles for Sharable Synthetic Data Generation / E. Dikici, L.M. Prevedello, M. Bigelow, R.D. White, B.S. Erdal // arXiv. – 2020. – abs/2003.00086. doi: 10.48550/ARXIV.2003.00086
- Using Synthetic Data to Enhance the Accuracy of Fingerprint-Based Localization: A Deep Learning Approach / M. Nabati, H. Navidan, R. Shahbazian, S.A. Ghorashi, D. Win¬dridge // IEEE Sens Lett. – 2020. – Vol. 4. – P. 1–4. doi: 10.1109/lsens.2020.2971555
- Synthetic Data Generation and Adaption for Object Detection in Smart Vending Machines / K. Wang, F. Shi, W. Wang, Y. Nan, S. Lian // CoRR. – 2019. – abs/1904.12294
- Synthetic Data and Hierarchical Object Detection in Overhead Imagery / N. Clement, A. Schoen, A.P. Boedihardjo, A. Jenkins // CoRR. – 2021. – abs/2102.00103
- Jordon J., Yoon J., van der Schaar M. PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees // In: ICLR. – 2019.
- Arnold C. Releasing differentially private synthetic micro-data with bayesian gans. – 2018.
- Li M., Zhuang D., Chang J.M. MC-GEN:Multi-level Clustering for Private Synthetic Data Generation // arXiv. – 2022. – abs/2205.14298. doi: 10.48550/ARXIV.2205.14298
- Incentivizing Collaboration in Machine Learning via Synthetic Data Rewards / S.S. Tay, X. Xu, C.S. Foo, B.K.H. Low // CoRR. – 2021. – abs/2112.09327
- Liu T., Vietri G., Wu Z.S. Iterative Methods for Private Synthetic Data: Unifying Framework and New Methods // CoRR. – 2021. – abs/2106.07153
- Low Light Video Enhancement using Synthetic Data Produced with an Intermediate Domain Mapping / D. Triantafyllidou, S. Moran, S. McDonagh, S. Parisot, G. Slabaugh // arXiv. – 2020. –abs/2007.09187. – doi: 10.48550/ARXIV.2007.09187
- Adapting deep generative approaches for getting synthetic data with realistic marginal distributions / K. Farhadyar, F. Bonofiglio, D. Zoeller, H. Binder // atXiv. – 2021. – abs/2105. 06907. doi: 10.48550/ARXIV.2105.06907
- DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks / B. van Breugel, T. Kyono, J. Berrevoets, M. van der Schaar // CoRR. – 2021. – abs/2110.12884
- Graham P., Penny R. Multiply Imputed Synthetic Data Files // Official Statistics Research Series, Statistics New Zealand. – 2007. – Vol. 1.
- Boedihardjo M., Strohmer T., Vershynin R. Private sampling: a noiseless approach for generating differentially private synthetic data // CoRR. – 2021. – abs/2109.14839
- Boedihardjo M., Strohmer T., Vershynin R. Covariance’s Loss is Privacy’s Gain: Computationally Efficient, Private and Accurate Synthetic Data // CoRR. – 2021. – abs/2107.05824
- Kamthe S., Assefa S., Deisenroth M. Copula Flows for Synthetic Data Generation // arXiv. – 2021. – abs/2101.00598. doi: 10.48550/ARXIV.2101.00598
- Li Z., Zhao Y., Fu J. SYNC: A Copula based Framework for Generating Synthetic Data from Aggregated Sources // arXiv. – 2020. – abs/2009.09471. doi: 10.48550/ARXIV. 2009.09471
- End-to-End Synthetic Data Generation for Domain Adaptation of Question Answering Systems / S. Shakeri, C.N. dos Santos, H. Zhu, P. Ng, F. Nan, Z. Wang, R. Nallapati, B. Xiang // CoRR. – 2020. – abs/2010.06028
- Bousquet O., Livni R., Moran S. Synthetic Data Generators: Sequential and Private // arXiv. – 2019. – abs/1902.03468. doi: 10.48550/ARXIV.1902.03468
- Exploring Invariances in Deep Convolutional Neural Networks Using Synthetic Images / X. Peng, B. Sun, K. Ali, K. Saenko // CoRR. – 2014. – abs/1412.7122
- Transfer Learning from Synthetic to Real Images Using Variational Autoencoders for Precise Position Detection / T. Inoue, S. Choudhury, G. de Magistris, S. Das-gupta // 2018 25th IEEE International Conference on Image Processing (ICIP). – 2018. – P. 2725–2729. doi: 10.1109/ICIP.2018.8451064
- Learning to Augment Synthetic Images for Sim2Real Policy Transfer / A. Pashevich, R. Strudel, I. Kalevatykh, I. Laptev, C. Schmid // CoRR. – 2019. – abs/1903.07740. doi: 10.48550/arXiv.1903.07740
- AutoSimulate: (Quickly) Learning Synthetic Data Generation / H.S. Behl, A.G. Baydin, R. Gal, P.H.S. Torr, V. Vineet // CoRR. – 2020. – abs/2008.08424
- ProcSy: Procedural Synthetic Dataset Generation Towards Influence Factor Studies Of Semantic Segmentation Networks / S. Khan, B. Phan, R. Salay, K. Czarnecki // CVPR Workshops. – 2019.
- Illumination Invariant Camera Localization Using Synthetic Images / S. Shoman, T. Mashita, A. Plopski, P. Ratsamee, Y. Uranishi, H. Takemura // 2018 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct). – IEEE. – 2018. – P. 143–144. doi: 10.1109/ISMAR-Adjunct.2018.00053
- Rozantsev A., Lepetit V., Fua P. On rendering synthetic images for training an object detector // Computer Vision and Image Understanding. – 2015. – Vol. 137. – P. 24–37. doi: 10.1016/j.cviu.2014.12.006
- Synthetic Data Are as Good as the Real for Association Knowledge Learning in Multi-object Tracking / Y. Liu, Z. Wang, X. Zhou, L. Zheng // CoRR. – 2021. – abs/2106.16100
- Automatic Generation of Synthetic LiDAR Point Clouds for 3-D Data Analysis / F. Wang, Y. Zhuang, H. Gu, H. Hu // IEEE Trans Instrum Meas. – 2019. – Vol. 68. – P. 2671–2673. doi: 10.1109/TIM.2019.2906416
- Learning how to analyse crowd behaviour using synthetic data / A.R. Khadka, M.M. Oghaz, W. Matta, M. Cosentino, P. Remagnino, V. Argyriou // Proceedings of the 32nd International Conference on Computer Animation and Social Agents. – ACM. – New York. – NY. – USA. – 2019. – P. 11–14. doi: 10.1145/3328756.3328773
- Synthetic Data Generation for Deep Learning of Underwater Disparity Estimation / E.A. Olson, C. Barbalata, J. Zhang, K.A. Skinner, M. Johnson-Roberson // OCEANS 2018 MTS/IEEE Charleston. – 2018. – P. 1–6.
- Sun S., Shi H., Wu Y. A survey of multi-source domain adaptation // Information Fusion. – 2015. – Vol. 24. – P. 84–92. doi: 10.1016/j.inffus.2014.12.003
- Ren Z., Lee Y.J. Cross-Domain Self-Supervised Multi-task Feature Learning Using Synthetic Imagery // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. – 2018. – P. 762–771. doi: 10.1109/CVPR.2018.00086
- Effect of Kinematics and Fluency in Adversarial Synthetic Data Generation for ASL Recognition with RF Sensors / M.M. Rahman, E. Malaia, A.C. Gurbuz, D.J. Griffin, C. Crawford, S. Gurbuz // IEEE Trans Aerosp Electron Syst. – 2022. – Vol. 1. doi: 10.1109/taes. 2021.3139848
- Alkhalifah T., Wang H., Ovcharenko O. MLReal: Bridging the gap between training on synthetic data and real data applications in machine learning // arXiv. – 2021. – abs/2109.05294. doi: 10.48550/ARXIV.2109.05294
- Learning from Synthetic Data for Opinion-free Blind Image Quality Assessment in the Wild / Z. Wang, Z.-R. Tang, Z. Yu, J. Zhang, Y. Fang // CoRR. – 2021. – abs/2106.14076
- Meta-Sim: Learning to Generate Synthetic Datasets / A. Kar, A. Prakash, M.-Y. Liu, E. Cameracci, J. Yuan, M. Rusiniak, D. Acuna, A. Torralba, A. Fidler // CoRR. – 2019. – abs/1904.11621
- Stein G.J., Roy N. GeneSIS-RT: Generating Synthetic Images for training Secondary Real-world Tasks // CoRR. – 2017. – abs/1710.04280. doi: 10.48550/arXiv.1710.04280
- S^3Net: Semantic-Aware Self-supervised Depth Estimation with Monocular Videos and Synthetic Data / B. Cheng, I.S. Saggu, R. Shah, G. Bansal, D. Bharadia // CoRR. – 2020. – abs/2007.14511
- PeopleSansPeople: A Synthetic Data Generator for Human-Centric Computer Vision / S.E. Ebadi, Y.-C. Jhang, A. Zook, S. Dhakad, A. Crespi, P. Parisi, S. Borkman, J. Hogins, S. Ganguly // CoRR. – 2021. – abs/2112.09290
- Hart K.M., Goodman A.B., O’Shea R.P. Automatic Generation of Machine Learning Synthetic Data Using ROS // CoRR. – 2021. – abs/2106.04547
- UnrealROX+: An Improved Tool for Acquiring Synthetic Data from Virtual 3D Environments / P. Martinez-Gonzalez, S. Oprea, J.A. Castro-Vargas, A. Garcia-Garcia, S. Orts-Escolano, J.G. Rodrguez, M. Vincze // CoRR. – 2021. – abs/2104.11776
- Jin C., Rinard M.C. Learning From Context-Agnostic Synthetic Data // CoRR. – 2020. – abs/2005.14707
- Baek K., Shim H. Commonality in Natural Images Rescues GANs: Pretraining GANs with Generic and Privacy-free Synthetic Data // arXiv. – 2022. – abs/2204.04950. DOI: 10.48550/ ARXIV.2204.04950
- Deep Learning based Food Instance Segmentation using Synthetic Data / D. Park, J. Lee, J. Lee, K. Lee // CoRR. – 2021. – abs/2107.07191
- Raab G.M., Nowok B., Dibben C. Assessing, visualizing and improving the utility of synthetic data // arXiv. – 2021. – abs/2109.12717. doi: 10.48550/ARXIV.2109.12717
- Unity Perception: Generate Synthetic Data for Computer Vision / S. Borkman, A. Crespi, S. Dhakad, S. Ganguly, J. Hogins, Y.-C. Jhang, M. Kamalzadeh, B. Li, S. Leal, P. Parisi, C. Romero, W. Smith, A. Thaman, S. Warren, N. Yadav // CoRR. – 2021. – abs/2107.04259
- ElderSim: A Synthetic Data Generation Platform for Human Action Recognition in Eldercare Applications / H. Hwang, C. Jang, G. Park, J. Cho, I.-J. Kim // arXiv. – 2020. – abs/2010.14742. – doi: 10.48550/ARXIV.2010.14742
- Dilmegani G. Top 20 Synthetic Data Use Cases and Applications in 2023: сайт [Электронный ресурс]. – URL: https://research.aimultiple.com/synthetic-data-use-cases (дата обращения: 13.09.2023).
- Dilmegani G. The Ultimate Guide to Synthetic Data: Uses, Benefits and Tools: сайт [Электронный ресурс]. – URL: https://research.aimultiple.com/synthetic-data-tools (дата обращения: 13.09.2023).
- Dilmegani G. The Ultimate Guide to Synthetic Data in 2023: сайт [Электронный ресурс]. – URL: https://research.aimultiple.com/synthetic-data/ (дата обращения: 13.09.2023).
- Dilmegani G. Synthetic Data Generation: Techniques, Best Practices and Tools: сайт [Электронный ресурс]. – URL: https://research.aimultiple.com/synthetic-data-generation/ (дата обращения: 13.09.2023).
- Черепанов Ф.М., Ясницкий Л.Н. Лабораторный практикум по нейросетевым технологиям. – текст: непосредственный. // Перспективные технологии искус-ственного интеллекта: сборник трудов международной научно-практической конференции (Пенза, Пензенский ун-т, Научный Совет РАН по методологии искусственного интеллекта, 1-6 июля 2008 г.) / Пенз. ун-т. – Пенза. – 2008. – С. 128–130.
- Черепанов Ф.М., Ясницкий Л.Н. Лабораторный практикум по нейросетевым технологиям: свидетельство о государственной регистрации программы для ЭВМ № 2009611544. Заявка № 2009610226. Зарегистрировано в Реестре программ для ЭВМ 12 марта 2009 г.
Statistics
Views
Abstract - 166
PDF (Russian) - 162
Refbacks
- There are currently no refbacks.