Review of methods and systems for generation of synthetic training data

Abstract


It is impossible to imagine the advancement of modern artificial intelligence systems without neural network technologies. During the design process researchers are often faced with the fact that there is not enough data to train modern neural network models, these data may be unbalanced or highly sparse. Often it happens that real data simply does not exist, as the research field is still emerging. A relevant problem is ensuring the confidentiality of real personal or patient medical data, which is used in the exchange between researchers or in the testing of various neural network systems. In many subject areas, the cost of collecting and marking up real data can be very high. Synthetic data is increasingly being used to solve these problems. The purpose of this publication is to introduce readers to advances in the generation and use of synthetic data. The paper presents a description of various methods, systems and software tools used to generate synthetic data, which can help to improve neural network models. Since an entire industry for synthetic data production has already formed, the leading data synthesis technology platforms are presented. The paper is of an overview nature, so it contains an extensive bibliography. The value of the article lies in the fact that this review will help readers broaden their understanding of the use of synthetic data in solving a wide range of neural network problems, as well as to become more familiar with the methods and tools for their generation.

Full Text

В процессе проектирования современных нейросетевых моделей исследователи часто сталкиваются с проблемой недоступности достаточного количества данных для их обучения, а также с неравномерностью или разреженностью этих данных. Нередко случается, что реальных данных просто не существует, так как область исследований еще только формируется. Также существует проблема конфиденциальности реальных персональных данных или медицинских данных пациентов, которые используются в процессе обмена между исследователями или в процессе тестирования различных нейросетевых систем. Во всех этих случаях на помощь приходят синтетические данные. Как следует из названия, синтетические данные, это данные, которые созданы искусственно, а не в результате реальных событий. Они часто создаются с помощью алгоритмов и используются для широкого спектра действий. Одно из первых упоминаний о применении синтетических данных встречается в связи с разработкой и тестированием системы обнаружения вторжений агентством DARPA 1 в 1998 и 1999 гг. [1]. Тестовые данные содержали сетевой трафик и файлы журнала системных вызовов из смоделированной большой компьютерной сети. Атакующие данные были сгенерированы синтетически на основе сценариев возможных атак, а фоновые данные – с помощью программных автоматов, имитирующих использование различных услуг. Использование синтетических данных позволило разработчикам смоделировать и протестировать различные сценарии вторжения, которые ранее еще не встречались. В тех случаях, когда новые службы еще только тестируются перед вводом в эксплуатацию, данные для обучения нейросетей могут просто отсутствовать. В этом случае для тестирования нужны синтезированные данные. Например, авторы статьи [2] уже в 2002 г. применяли синтетические данные при создании системы детектирования мошенничества.

About the authors

A. N. Rabchevsky

Perm State University; LLC “SEUSLAB”

References

  1. Evaluating intrusion detection systems: the 1998 DARPA off-line intrusion detection evaluation / R.P. Lippmann, D.J. Fried, I. Graf, J.W. Haines, K.R. Kendall, D. McClung, D. Weber, S.E. Webster, D. Wyschogrod, R.K. Cunningham, M.A. Zissman // In: Proceedings DARPA Information Survivability Conference and Exposition. DISCEX’00. IEEE Comput. Soc. – 2000. – P. 12–26. doi: 10.1109/DISCEX.2000.821506
  2. Lundin E., Kvarnström H., Jonsson E.A. Synthetic Fraud Data Generation Methodology // Deng Robertand Bao, Fengand Zhou Jianyingand, and Qing Sihan (eds.) Information and Communications Security. Springer Berlin Heidelberg, Berlin, Heidelberg. – 2002. – Р. 265–277. doi: 10.1007/3-540-36159-6_23
  3. Learning Deep Models from Synthetic Data for Extracting Dolphin Whistle Contours / P. Li, X. Liua, K.J. Palmer, E. Fleishman, D. Gillespie, E.-M. Nosal, Y. Shiu, H. Klinck, D. Cholewiak, T. Helble, M.A. Roch. – 2020. doi: 10.48550/ARXIV.2005.08894
  4. Lombardo, J. Method for Generation and Distribution of Synthetic Medical Record Data for Evaluation of Disease-Monitoring Systems / J. Lombardo, L.A. Moniz // Johns Hopkins APL Technical Digest (Applied Physics Laboratory). – 2008. – Vol. 27.
  5. Construction and Validation of Synthetic Electronic Medical Records / L. Moniz, A.L. Buczak, L. Hung, S. Babin, M. Dorko, J. Lombardo // Online J Public Health Inform. – 2009. – Vol. 1. doi: 10.5210/ojphi.v1i1.2720
  6. Buczak A.L., Babin S., Moniz L. Data-driven approach for creating synthetic electronic medical records // BMC Med Inform Decis Mak. – 2010. – Vol. 10, iss. 1. doi: 10.1186/1472-6947-10-59
  7. Jin C., Rinard M.C. Learning From Context-Agnostic Synthetic Data // CoRR. – 2020. – abs/2005.14707
  8. McKenna R., Miklau G., Sheldon D. Winning the NIST Contest: A scalable and general approach to differentially private synthetic data // CoRR. – 2021. – abs/2108.04978
  9. Noise-Aware Statistical Inference with Differentially Private Synthetic Data / O. Räisä, J. Jälkö, S. Kaski, A. Honkela // arXiv. – 2022. – abs/2205.14485. doi: 10.48550/ARXIV. 2205.14485
  10. Awan J., Cai Z. One Step to Efficient Synthetic Data // arXiv. – 2020. – bs/2006.02397. doi: 10.48550/ARXIV.2006.02397
  11. Goetz J., Tewari A. Federated Learning via Synthetic Data // CoRR. – 2020. – abs/2008.04489
  12. FedSynth: Gradient Compression via Synthetic Data in Federated Learning / S. Hu, J. Goetz, K. Malik, H. Zhan, Z. Liu, Y. Liu // arXiv. – 2022. – abs/2204.01273. DOI: 10.48550/ ARXIV.2204.01273
  13. SMOTE: Synthetic Minority Over-sampling Technique / N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer // Journal of Artificial Intelligence Research. – 2002. – Vol. 16. – P. 321–357. doi: 10.1613/jair.953
  14. Mukherjee M., Khushi M. SMOTE-ENC: A novel SMOTE-based method to generate synthetic data for nominal and continuous features // Applied System Innovation. – 2021. – Vol. 4, iss. 1. – Art. 18. doi: 10.3390/asi4010018.
  15. A Method for Handling Multi-class Imbalanced Data by Geometry based Information Sampling and Class Prioritized Synthetic Data Generation (GICaPS) / A. Majumder, S. Dutta, S. Kumar, L. Behera // CoRR. – 2020. – abs/2010.05155
  16. Gonsior J., Thiele M., Lehner W. ImitAL: Learning Active Learning Strategies from Synthetic Data //. CoRR. – 2021. – abs/2108.07670
  17. Dataset Condensation via Efficient Synthetic-Data Parameterization / J.-H. Kim, J. Kim, S.J. Oh, S. Yun, H. Song, J. Jeong, J.-W. Ha, H.O. Song // arXiv. – 2022. – abs/2205. 14959. doi: 10.48550/ARXIV.2205.14959
  18. Effective Use of Synthetic Data for Urban Scene Semantic Segmentation / F.S. Saleh, M.S. Aliakbarian, M. Salzmann, L. Petersson, J.M. Alvarez // CoRR. – 2018. – abs/1807.06132
  19. Mason K., Vejdan S., Grijalva S. An “On The Fly” Framework for Efficiently Generating Synthetic Big Data Sets // CoRR. – 2019. – abs/1903.06798
  20. Condrea F., Ivan V.-A., Leordeanu M. In Search of Life: Learning from Synthetic Data to Detect Vital Signs in Videos // CoRR. – 2020. – abs/2004.07691
  21. Conditional Synthetic Data Generation for Robust Machine Learning Applications with Limited Pandemic Data / H.P. Das, R. Tran, J. Singh, X. Yue, G. Tison, A.L. Sangiovanni-Vincentelli, C.J. Spanos // CoRR. – 2021. – abs/2109.06486
  22. Рабчевский А.Н., Ашихмин Е.Г., Рабчевский Е.А. Моделирование структуры пропаганды протестного движения в социальных сетях с помощью графового анализа и нейросетевых технологий. – текст: непосредственный // Математическое и компьютерное моделирование: сборник материалов IX Международной научной конференции, посвященной 85-летию профессора В.И. Потапова. – Омск, 2021. – С. 273–276.
  23. Rabchevsky A., Yasnitsky L., Zayakin V. Comparison of methods for identifying user roles in online social networks // Applied Mathematics and Control Sciences. – 2021. Vol. 2. – P. 93–111. doi: 10.15593/2499-9873/2021.2.06
  24. Rabchevskiy A.N., Yasnitskiy L.N. Creating and Using Synthetic Data for Neural Network Training, Using the Creation of a Neural Network Classifier of Online Social Network User Roles as an Example // Digital Science. DSIC 2021. Lecture Notes in Networks and Systems, Springer, Cham. – 2022. – Vol. 381. – P. 412–421. doi: 10.1007/978-3-030-93677-8_36
  25. Generation of synthetic training data for object detection in piles / E. Buls, R. Kadikis, R. Cacurs, J. Ārents // D.P. Nikolaev, P. Radeva, A. Verikas, J. Zhou (eds.) Eleventh International Conference on Machine Vision (ICMV 2018), SPIE. – 2019. – P. 105. doi: 10.1117/12.2523203
  26. An Annotation Saved is an Annotation Earned: Using Fully Synthetic Training for Object Instance Detection / S. Hinterstoisser, O. Pauly, H. Heibel, M. Marek, M. Bokeloh // CoRR. – 2019. – abs/1902.09967
  27. Dina A.S., Siddique A.B., Manivannan D. Effect of Balancing Data Using Synthetic Data on the Performance of Machine Learning Classifiers for Intrusion Detection in Computer Networks // arXiv. – 2022. – abs/2204.00144. doi: 10.48550/ARXIV.2204.00144
  28. Charitou C., Dragicevic S., d’Avila Garcez A. Synthetic Data Generation for Fraud Detection using GANs // CoRR. – 2021. – abs/2109.12546
  29. Generative Adversarial Networks for Synthetic Data Generation: A Comparative Study / C. Little, M. Elliot, R. Allmendinger, S.S. Samani // arXiv. – 2021. – abs/2112.01925. doi: 10.48550/ARXIV.2112.01925
  30. GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification / M. Frid-Adar, I. Diamant, E. Klang, M. Amitai, J. Goldberger, H. Greenspan // Neurocomputing. – 2018. – Vol. 321. – P. 321–331. doi: 10.1016/j.neucom. 2018.09.013
  31. GAN-based synthetic brain MR image generation / C. Han, H. Hayashi, L. Rundo, R. Araki, W. Shimoda, S. Muramatsu, Y. Furukawa, G. Mauri, H. Nakayama // 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018). – 2018. – P. 734–738.
  32. Constrained Generative Adversarial Network Ensembles for Sharable Synthetic Data Generation / E. Dikici, L.M. Prevedello, M. Bigelow, R.D. White, B.S. Erdal // arXiv. – 2020. – abs/2003.00086. doi: 10.48550/ARXIV.2003.00086
  33. Using Synthetic Data to Enhance the Accuracy of Fingerprint-Based Localization: A Deep Learning Approach / M. Nabati, H. Navidan, R. Shahbazian, S.A. Ghorashi, D. Win¬dridge // IEEE Sens Lett. – 2020. – Vol. 4. – P. 1–4. doi: 10.1109/lsens.2020.2971555
  34. Synthetic Data Generation and Adaption for Object Detection in Smart Vending Machines / K. Wang, F. Shi, W. Wang, Y. Nan, S. Lian // CoRR. – 2019. – abs/1904.12294
  35. Synthetic Data and Hierarchical Object Detection in Overhead Imagery / N. Clement, A. Schoen, A.P. Boedihardjo, A. Jenkins // CoRR. – 2021. – abs/2102.00103
  36. Jordon J., Yoon J., van der Schaar M. PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees // In: ICLR. – 2019.
  37. Arnold C. Releasing differentially private synthetic micro-data with bayesian gans. – 2018.
  38. Li M., Zhuang D., Chang J.M. MC-GEN:Multi-level Clustering for Private Synthetic Data Generation // arXiv. – 2022. – abs/2205.14298. doi: 10.48550/ARXIV.2205.14298
  39. Incentivizing Collaboration in Machine Learning via Synthetic Data Rewards / S.S. Tay, X. Xu, C.S. Foo, B.K.H. Low // CoRR. – 2021. – abs/2112.09327
  40. Liu T., Vietri G., Wu Z.S. Iterative Methods for Private Synthetic Data: Unifying Framework and New Methods // CoRR. – 2021. – abs/2106.07153
  41. Low Light Video Enhancement using Synthetic Data Produced with an Intermediate Domain Mapping / D. Triantafyllidou, S. Moran, S. McDonagh, S. Parisot, G. Slabaugh // arXiv. – 2020. –abs/2007.09187. – doi: 10.48550/ARXIV.2007.09187
  42. Adapting deep generative approaches for getting synthetic data with realistic marginal distributions / K. Farhadyar, F. Bonofiglio, D. Zoeller, H. Binder // atXiv. – 2021. – abs/2105. 06907. doi: 10.48550/ARXIV.2105.06907
  43. DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks / B. van Breugel, T. Kyono, J. Berrevoets, M. van der Schaar // CoRR. – 2021. – abs/2110.12884
  44. Graham P., Penny R. Multiply Imputed Synthetic Data Files // Official Statistics Research Series, Statistics New Zealand. – 2007. – Vol. 1.
  45. Boedihardjo M., Strohmer T., Vershynin R. Private sampling: a noiseless approach for generating differentially private synthetic data // CoRR. – 2021. – abs/2109.14839
  46. Boedihardjo M., Strohmer T., Vershynin R. Covariance’s Loss is Privacy’s Gain: Computationally Efficient, Private and Accurate Synthetic Data // CoRR. – 2021. – abs/2107.05824
  47. Kamthe S., Assefa S., Deisenroth M. Copula Flows for Synthetic Data Generation // arXiv. – 2021. – abs/2101.00598. doi: 10.48550/ARXIV.2101.00598
  48. Li Z., Zhao Y., Fu J. SYNC: A Copula based Framework for Generating Synthetic Data from Aggregated Sources // arXiv. – 2020. – abs/2009.09471. doi: 10.48550/ARXIV. 2009.09471
  49. End-to-End Synthetic Data Generation for Domain Adaptation of Question Answering Systems / S. Shakeri, C.N. dos Santos, H. Zhu, P. Ng, F. Nan, Z. Wang, R. Nallapati, B. Xiang // CoRR. – 2020. – abs/2010.06028
  50. Bousquet O., Livni R., Moran S. Synthetic Data Generators: Sequential and Private // arXiv. – 2019. – abs/1902.03468. doi: 10.48550/ARXIV.1902.03468
  51. Exploring Invariances in Deep Convolutional Neural Networks Using Synthetic Images / X. Peng, B. Sun, K. Ali, K. Saenko // CoRR. – 2014. – abs/1412.7122
  52. Transfer Learning from Synthetic to Real Images Using Variational Autoencoders for Precise Position Detection / T. Inoue, S. Choudhury, G. de Magistris, S. Das-gupta // 2018 25th IEEE International Conference on Image Processing (ICIP). – 2018. – P. 2725–2729. doi: 10.1109/ICIP.2018.8451064
  53. Learning to Augment Synthetic Images for Sim2Real Policy Transfer / A. Pashevich, R. Strudel, I. Kalevatykh, I. Laptev, C. Schmid // CoRR. – 2019. – abs/1903.07740. doi: 10.48550/arXiv.1903.07740
  54. AutoSimulate: (Quickly) Learning Synthetic Data Generation / H.S. Behl, A.G. Baydin, R. Gal, P.H.S. Torr, V. Vineet // CoRR. – 2020. – abs/2008.08424
  55. ProcSy: Procedural Synthetic Dataset Generation Towards Influence Factor Studies Of Semantic Segmentation Networks / S. Khan, B. Phan, R. Salay, K. Czarnecki // CVPR Workshops. – 2019.
  56. Illumination Invariant Camera Localization Using Synthetic Images / S. Shoman, T. Mashita, A. Plopski, P. Ratsamee, Y. Uranishi, H. Takemura // 2018 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct). – IEEE. – 2018. – P. 143–144. doi: 10.1109/ISMAR-Adjunct.2018.00053
  57. Rozantsev A., Lepetit V., Fua P. On rendering synthetic images for training an object detector // Computer Vision and Image Understanding. – 2015. – Vol. 137. – P. 24–37. doi: 10.1016/j.cviu.2014.12.006
  58. Synthetic Data Are as Good as the Real for Association Knowledge Learning in Multi-object Tracking / Y. Liu, Z. Wang, X. Zhou, L. Zheng // CoRR. – 2021. – abs/2106.16100
  59. Automatic Generation of Synthetic LiDAR Point Clouds for 3-D Data Analysis / F. Wang, Y. Zhuang, H. Gu, H. Hu // IEEE Trans Instrum Meas. – 2019. – Vol. 68. – P. 2671–2673. doi: 10.1109/TIM.2019.2906416
  60. Learning how to analyse crowd behaviour using synthetic data / A.R. Khadka, M.M. Oghaz, W. Matta, M. Cosentino, P. Remagnino, V. Argyriou // Proceedings of the 32nd International Conference on Computer Animation and Social Agents. – ACM. – New York. – NY. – USA. – 2019. – P. 11–14. doi: 10.1145/3328756.3328773
  61. Synthetic Data Generation for Deep Learning of Underwater Disparity Estimation / E.A. Olson, C. Barbalata, J. Zhang, K.A. Skinner, M. Johnson-Roberson // OCEANS 2018 MTS/IEEE Charleston. – 2018. – P. 1–6.
  62. Sun S., Shi H., Wu Y. A survey of multi-source domain adaptation // Information Fusion. – 2015. – Vol. 24. – P. 84–92. doi: 10.1016/j.inffus.2014.12.003
  63. Ren Z., Lee Y.J. Cross-Domain Self-Supervised Multi-task Feature Learning Using Synthetic Imagery // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. – 2018. – P. 762–771. doi: 10.1109/CVPR.2018.00086
  64. Effect of Kinematics and Fluency in Adversarial Synthetic Data Generation for ASL Recognition with RF Sensors / M.M. Rahman, E. Malaia, A.C. Gurbuz, D.J. Griffin, C. Crawford, S. Gurbuz // IEEE Trans Aerosp Electron Syst. – 2022. – Vol. 1. doi: 10.1109/taes. 2021.3139848
  65. Alkhalifah T., Wang H., Ovcharenko O. MLReal: Bridging the gap between training on synthetic data and real data applications in machine learning // arXiv. – 2021. – abs/2109.05294. doi: 10.48550/ARXIV.2109.05294
  66. Learning from Synthetic Data for Opinion-free Blind Image Quality Assessment in the Wild / Z. Wang, Z.-R. Tang, Z. Yu, J. Zhang, Y. Fang // CoRR. – 2021. – abs/2106.14076
  67. Meta-Sim: Learning to Generate Synthetic Datasets / A. Kar, A. Prakash, M.-Y. Liu, E. Cameracci, J. Yuan, M. Rusiniak, D. Acuna, A. Torralba, A. Fidler // CoRR. – 2019. – abs/1904.11621
  68. Stein G.J., Roy N. GeneSIS-RT: Generating Synthetic Images for training Secondary Real-world Tasks // CoRR. – 2017. – abs/1710.04280. doi: 10.48550/arXiv.1710.04280
  69. S^3Net: Semantic-Aware Self-supervised Depth Estimation with Monocular Videos and Synthetic Data / B. Cheng, I.S. Saggu, R. Shah, G. Bansal, D. Bharadia // CoRR. – 2020. – abs/2007.14511
  70. PeopleSansPeople: A Synthetic Data Generator for Human-Centric Computer Vision / S.E. Ebadi, Y.-C. Jhang, A. Zook, S. Dhakad, A. Crespi, P. Parisi, S. Borkman, J. Hogins, S. Ganguly // CoRR. – 2021. – abs/2112.09290
  71. Hart K.M., Goodman A.B., O’Shea R.P. Automatic Generation of Machine Learning Synthetic Data Using ROS // CoRR. – 2021. – abs/2106.04547
  72. UnrealROX+: An Improved Tool for Acquiring Synthetic Data from Virtual 3D Environments / P. Martinez-Gonzalez, S. Oprea, J.A. Castro-Vargas, A. Garcia-Garcia, S. Orts-Escolano, J.G. Rodrguez, M. Vincze // CoRR. – 2021. – abs/2104.11776
  73. Jin C., Rinard M.C. Learning From Context-Agnostic Synthetic Data // CoRR. – 2020. – abs/2005.14707
  74. Baek K., Shim H. Commonality in Natural Images Rescues GANs: Pretraining GANs with Generic and Privacy-free Synthetic Data // arXiv. – 2022. – abs/2204.04950. DOI: 10.48550/ ARXIV.2204.04950
  75. Deep Learning based Food Instance Segmentation using Synthetic Data / D. Park, J. Lee, J. Lee, K. Lee // CoRR. – 2021. – abs/2107.07191
  76. Raab G.M., Nowok B., Dibben C. Assessing, visualizing and improving the utility of synthetic data // arXiv. – 2021. – abs/2109.12717. doi: 10.48550/ARXIV.2109.12717
  77. Unity Perception: Generate Synthetic Data for Computer Vision / S. Borkman, A. Crespi, S. Dhakad, S. Ganguly, J. Hogins, Y.-C. Jhang, M. Kamalzadeh, B. Li, S. Leal, P. Parisi, C. Romero, W. Smith, A. Thaman, S. Warren, N. Yadav // CoRR. – 2021. – abs/2107.04259
  78. ElderSim: A Synthetic Data Generation Platform for Human Action Recognition in Eldercare Applications / H. Hwang, C. Jang, G. Park, J. Cho, I.-J. Kim // arXiv. – 2020. – abs/2010.14742. – doi: 10.48550/ARXIV.2010.14742
  79. Dilmegani G. Top 20 Synthetic Data Use Cases and Applications in 2023: сайт [Электронный ресурс]. – URL: https://research.aimultiple.com/synthetic-data-use-cases (дата обращения: 13.09.2023).
  80. Dilmegani G. The Ultimate Guide to Synthetic Data: Uses, Benefits and Tools: сайт [Электронный ресурс]. – URL: https://research.aimultiple.com/synthetic-data-tools (дата обращения: 13.09.2023).
  81. Dilmegani G. The Ultimate Guide to Synthetic Data in 2023: сайт [Электронный ресурс]. – URL: https://research.aimultiple.com/synthetic-data/ (дата обращения: 13.09.2023).
  82. Dilmegani G. Synthetic Data Generation: Techniques, Best Practices and Tools: сайт [Электронный ресурс]. – URL: https://research.aimultiple.com/synthetic-data-generation/ (дата обращения: 13.09.2023).
  83. Черепанов Ф.М., Ясницкий Л.Н. Лабораторный практикум по нейросетевым технологиям. – текст: непосредственный. // Перспективные технологии искус-ственного интеллекта: сборник трудов международной научно-практической конференции (Пенза, Пензенский ун-т, Научный Совет РАН по методологии искусственного интеллекта, 1-6 июля 2008 г.) / Пенз. ун-т. – Пенза. – 2008. – С. 128–130.
  84. Черепанов Ф.М., Ясницкий Л.Н. Лабораторный практикум по нейросетевым технологиям: свидетельство о государственной регистрации программы для ЭВМ № 2009611544. Заявка № 2009610226. Зарегистрировано в Реестре программ для ЭВМ 12 марта 2009 г.

Statistics

Views

Abstract - 79

PDF (Russian) - 64

Refbacks

  • There are currently no refbacks.

This website uses cookies

You consent to our cookies if you continue to use our website.

About Cookies