Difference between revisions of "Vol-3194/paper41"
Jump to navigation
Jump to search
(edited by wikiedit) |
(edited by wikiedit) |
||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
− | + | =Paper= | |
{{Paper | {{Paper | ||
+ | |id=Vol-3194/paper41 | ||
+ | |storemode=property | ||
+ | |title=Generating Synthetic Discrete Datasets with Machine Learning | ||
+ | |pdfUrl=https://ceur-ws.org/Vol-3194/paper41.pdf | ||
+ | |volume=Vol-3194 | ||
+ | |authors=Giuseppe Manco,Ettore Ritacco,Antonino Rullo,Domenico Saccà,Edoardo Serra | ||
+ | |dblpUrl=https://dblp.org/rec/conf/sebd/0001RRSS22 | ||
|wikidataid=Q117344879 | |wikidataid=Q117344879 | ||
}} | }} | ||
+ | ==Generating Synthetic Discrete Datasets with Machine Learning== | ||
+ | <pdf width="1500px">https://ceur-ws.org/Vol-3194/paper41.pdf</pdf> | ||
+ | <pre> | ||
+ | Generating Synthetic Discrete Datasets with Machine | ||
+ | Learning | ||
+ | (Discussion Paper) | ||
+ | |||
+ | Giuseppe Manco1 , Ettore Ritacco1 , Antonino Rullo2 , Domenico Saccà2 and | ||
+ | Edoardo Serra3 | ||
+ | 1 | ||
+ | Institute for High Performance Computing and Networking (ICAR) of the Italian National Research Council (CNR), v. P. | ||
+ | Bucci 8/9C, 87036 Rende (CS), Italy | ||
+ | 2 | ||
+ | DIMES Department, University of Calabria, Rende, CS, 87036, Italy | ||
+ | 3 | ||
+ | Computer Science Department, Boise State University, Boise, ID 83725, USA | ||
+ | |||
+ | |||
+ | Abstract | ||
+ | The real data are not always available/accessible/sufficient or in many cases they are incomplete and | ||
+ | lacking in semantic content necessary to the definition of optimization processes. In this paper we | ||
+ | discuss about the synthetic data generation under two different perspectives. The core common idea is | ||
+ | to analyze a limited set of real data to learn the main patterns that characterize them and exploit this | ||
+ | knowledge to generate brand new data. The first perspective is constraint-based generation and consists | ||
+ | in generating a synthetic dataset satisfying given support constraints on the real frequent patterns. The | ||
+ | second one is based on probabilistic generative modeling and considers the synthetic generation as a | ||
+ | sampling process from a parametric distribution learned on the real data, typically encoded as a neural | ||
+ | network (e.g. Variational Autoencoders, Generative Adversarial Networks). | ||
+ | |||
+ | Keywords | ||
+ | Synthetic dataset, Data generation, Inverse Frequent Itemset Mining, Constraints-based models, Varia- | ||
+ | tional Autoencoder, Generative Adversarial Networks, Generative models | ||
+ | |||
+ | |||
+ | This paper is an extended abstract of [1]. | ||
+ | |||
+ | |||
+ | 1. Introduction | ||
+ | Emerging “Big Data” platforms and applications call for the invention of novel data analysis | ||
+ | techniques that are capable to effectively and efficiently handle large amount of data [2]. There | ||
+ | is therefore an increasing need to use real-life datasets for data-driven experiments but the | ||
+ | scarcity of significant datasets is a critical issue for research papers [3]. Synthetic data generation | ||
+ | can help in this, by reproducing the internal mechanisms and dependencies that justify the | ||
+ | occurrence of some specific pieces of information, and hence being able to replicate them | ||
+ | stochastically. In this paper we focus on the problem of generating high-dimensional discrete | ||
+ | data. Generating such data is a challenge because of the combination of both high dimensionality | ||
+ | and discrete components, which may result in a complex structural domain with lots of variety | ||
+ | and irregularities, not necessarily smooth. Throughout the paper, we shall study approaches to | ||
+ | data generation which rely on the idea that each point in the real domain can be mapped into a | ||
+ | |||
+ | SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy | ||
+ | © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). | ||
+ | CEUR | ||
+ | Workshop | ||
+ | Proceedings | ||
+ | http://ceur-ws.org | ||
+ | ISSN 1613-0073 | ||
+ | CEUR Workshop Proceedings (CEUR-WS.org) | ||
+ | �Table 1 | ||
+ | Transactional Dataset end Frequent Itemsets. | ||
+ | |||
+ | 𝑎 𝑏 𝑐 𝑑 𝜎𝒟 | ||
+ | 1 1 1 0 40 𝑆𝑖 𝑎 𝑏 𝑐 𝑑 𝜎 𝑆𝑖 | ||
+ | 0 1 1 1 40 𝑆1 1 1 0 0 100 | ||
+ | 1 1 0 0 60 𝑆2 0 1 1 0 100 | ||
+ | 0 1 1 0 20 𝑆3 0 0 1 1 50 | ||
+ | 1 0 1 1 10 | ||
+ | (𝑏) Frequent Itemsets with threshold 50 | ||
+ | (𝑎) Dataset 𝒟 | ||
+ | |||
+ | |||
+ | |||
+ | suitable latent space and vice versa. These mappings guarantee a consistency with regards to | ||
+ | the original dataset. At the same time, the manifold in the latent space summarizes the main | ||
+ | characteristics of the data, that can hence be injected into the synthesized data in a controlled | ||
+ | way. We first discuss two main approaches, that are the Inverse Frequent Itemset Mining (IFM) | ||
+ | and probabilistic generative modeling (PGM); finally, a comparison of the results obtained with | ||
+ | some of the presented algorithms are provided. | ||
+ | |||
+ | |||
+ | 2. Inverse Frequent Itemset Mining-based Generative Models | ||
+ | Let ℐ be a finite domain of 𝑛 elements, also called items. Any subset 𝐼 ⊆ ℐ is an itemset over | ||
+ | ℐ, also called a transaction. Let 𝒰ℐ denote the set of all itemsets on ℐ; then, |𝒰ℐ | = 2𝑛 . A | ||
+ | (transactional) database 𝒟 is a set of tuples [𝑘, 𝐼], where 𝑘 is the key and 𝐼 is an itemset. The size | ||
+ | |𝒟| of 𝒟 is the total number of its itemsets, i.e., transactions. A transactional database 𝒟 is very | ||
+ | often represented as a bag of itemsets, i.e, the keys are omitted so that tuples are simply itemsets | ||
+ | and may therefore occur duplicated – in this case 𝒟 is also called a transactional dataset. In | ||
+ | the paper we shall also represent an itemset 𝐼 ∈ 𝒟 by its one-hot encoding x , that is a binary | ||
+ | vector of size 𝑛 (the number of items in ℐ) such that its 𝑖-th position 𝑥𝑖 = 1 if the 𝑖-th item in ℐ | ||
+ | is in 𝐼, 0 otherwise. Consequently, 𝒳 = {x 1 , . . . , x 𝜂 }, where 𝜂 = |𝒟|, is the one-hot encoding | ||
+ | of the whole dataset 𝒟. For each itemset 𝐼 ∈ 𝒟, there exist two important measures: (i) the | ||
+ | number of duplicates of 𝐼, denoted as 𝛿 𝒟 (𝐼), that is the number of occurrences of 𝐼 in 𝒟, and (ii) | ||
+ | the support of 𝐼, denoted as 𝜎 𝒟 (𝐼), that | ||
+ | ∑︀ is the sum of the number of duplicates of each itemset | ||
+ | 𝐽 in 𝒟 containing 𝐼, i.e., 𝜎 𝒟 (𝐼) = 𝐽∈𝒟∧𝐼⊆𝐽 𝛿 𝒟 (𝐽) . A dataset 𝒟 can be represented in a | ||
+ | succinct format as a set of pairs (𝐼, 𝜎 𝒟 (𝐼)). Given ℐ = {𝑎, 𝑏, 𝑐, 𝑑}, an example of dataset in the | ||
+ | succinct, one-hot format is shown in Table 1-𝑎. We say that 𝐼 is a frequent (resp., infrequent) | ||
+ | itemset in 𝒟 if its support is greater than or equal to (resp., less than) a given threshold. A | ||
+ | classical data mining task over transaction datasets is to detect the set of the frequent/infrequent | ||
+ | itemsets, and a rich literature deals with this topic [4, 5, 6] Given the threshold 50, the frequent | ||
+ | itemsets for the dataset of Table 1-𝑎 are listed in Table 1-𝑏. The perspective of the frequent | ||
+ | itemset mining problem has been later inverted as follows: given a set of itemsets together with | ||
+ | their frequency constraints the goal is to compute, if any, a transaction dataset satisfying the | ||
+ | above constraints. This new problem is called the inverse frequent itemset mining problem (IFM) | ||
+ | algorithms [7], and has been later investigated also in privacy preserving contexts [8]. Given | ||
+ | a set ℐ of items, the IFM problem consists in finding a dataset 𝒟 that satisfies given support | ||
+ | �Table 2 | ||
+ | Examples of Feasible Datasets with size 170 for an IFM instance. | ||
+ | |||
+ | 𝑎 𝑏 𝑐 𝑑 𝜎𝒟 𝑎 𝑏 𝑐 𝑑 𝜎𝒟 𝑎 𝑏 𝑐 𝑑 𝜎𝒟 | ||
+ | 1 1 1 0 70 1 1 1 0 40 1 1 1 1 30 | ||
+ | 0 1 1 1 10 0 1 1 1 40 1 1 1 0 30 | ||
+ | 1 1 0 0 30 1 1 0 0 60 0 1 1 1 20 | ||
+ | 0 1 1 0 20 0 1 1 0 20 1 1 0 0 40 | ||
+ | 0 0 1 1 40 0 0 1 1 10 0 1 1 0 50 | ||
+ | (𝑎) Dataset 𝒟1 (𝑏) Dataset 𝒟2 (𝑐) Dataset 𝒟3 | ||
+ | |||
+ | |||
+ | constraints on some itemsets 𝑆𝑖 on ℐ — the set of such itemsets is denoted by 𝑆. The support | ||
+ | constraints are represented as follows: ∀𝑆𝑖 ∈ 𝑆 : 𝜎𝑚𝑖𝑛 𝑖 ≤ 𝜎 𝒟 (𝑆𝑖 ) ≤ 𝜎𝑚𝑎𝑥𝑖 , where 𝜎 𝒟 (𝑆𝑖 ) is | ||
+ | the sum of all number of duplicates of itemsets in 𝒟 containing 𝐼. As an example, consider | ||
+ | ℐ = {𝑎, 𝑏, 𝑐, 𝑑}, 𝑆 = {{𝑎, 𝑏}, {𝑏, 𝑐}, {𝑐, 𝑑}} and the support constraints represented in Table | ||
+ | 1-𝑏 – in this example minimal and maximal supports coincide. The itemsets 𝑆1 = {𝑎, 𝑏} and | ||
+ | 𝑆2 = {𝑏, 𝑐} must occur in exactly 100 transactions (possibly as their sub-transactions) whereas | ||
+ | the itemset 𝑆3 = {𝑐, 𝑑} must occur in exactly 50 transactions. It is also required that the dataset | ||
+ | size (i.e., the total number of transactions) be 170. The dataset 𝐷1 shown in Table 2-𝑎 is feasible | ||
+ | as it satisfies all constraints: 𝑆1 is satisfied by the transactions {𝑎, 𝑏, 𝑐} and {𝑎, 𝑏}, 𝑆2 by the | ||
+ | transactions {𝑎, 𝑏, 𝑐}, {𝑏, 𝑐, 𝑑} and {𝑏, 𝑐} and 𝑆3 by the transactions {𝑏, 𝑐, 𝑑} and {𝑐, 𝑑}. | ||
+ | Let 𝑆 ′ be the set of all itemsets that are neither in 𝑆 nor subsets of some itemset in 𝑆. In | ||
+ | the example, 𝑆 ′ consists of {𝑎, 𝑏, 𝑐, 𝑑}, {𝑎, 𝑏, 𝑐}, {𝑎, 𝑏, 𝑑}, {𝑎, 𝑐, 𝑑}, {𝑏, 𝑐, 𝑑}, {𝑎, 𝑐}, {𝑎, 𝑑} and | ||
+ | {𝑏, 𝑑}. IFM does not enforce any constraint on the itemsets in 𝑆 ′ and, therefore, it may happen | ||
+ | that 𝒟 contains additional (and, in some cases, unsuspected or even undesired) frequent itemsets. | ||
+ | In the dataset 𝒟1 of Table 2-𝑎, the itemset {𝑎, 𝑏, 𝑐} is in 𝑆 ′ but it turns out to be frequent with a | ||
+ | support of 70. To remove the anomaly, Guzzo et al. [9] have proposed an alternative formulation, | ||
+ | called IFM𝑆 , that requires that only itemsets in 𝑆 can be included as transactions in 𝒟 and, | ||
+ | therefore, no unexpected frequent itemsets may eventually occur. Obviously, the decision | ||
+ | complexity of this problem is lower as it is NP-complete. Despite the complexity improvement, | ||
+ | the IFM𝑆 formulation has a severe drawback: it is too restrictive in excluding any transaction | ||
+ | besides the ones in 𝑆 as confirmed by the fact that no feasible dataset exists for our running | ||
+ | example. To weaken the tight restrictions of IFM𝑆 , Guzzo et al. [10] proposed a new formulation | ||
+ | of the problem, called IFM with infrequency support constraints (IFMI for short), which admits | ||
+ | transactions in 𝑆 ′ to be in a feasible dataset if their supports are below a given threshold 𝜎 ′ . By | ||
+ | the anti-monotonicity property, the number of infrequency support constraints can be reduced | ||
+ | by applying them only to a subset of 𝑆 ′ consisting of its minimal (inclusion-wise) elements. This | ||
+ | subset, denoted by 𝐵𝑆 ′ , is called the negative border and coincides with the set of all minimal | ||
+ | transversals of the hypergraph 𝐸 ¯ = {ℐ ∖ 𝐼 : 𝐼 ∈ 𝑆} (as defined in [11]). In the example, | ||
+ | 𝐵𝑆 ′ = {{𝑎, 𝑐}, {𝑎, 𝑑}, {𝑏, 𝑑}} and the dataset 𝒟2 in Table 2-𝑏 is a feasible dataset for IFMI for | ||
+ | 𝜎 ′ = 40. In fact, all infrequency support constraints on 𝐵𝑆 ′ are satisfied as the supports of | ||
+ | {𝑎, 𝑐}, {𝑎, 𝑑}, {𝑏, 𝑑} are respectively 40, 0 and 40. Another possibility to enforce infrequency | ||
+ | constraints is to fix a duplicate threshold 𝛿 ′ so that an itemset in 𝑆 ′ is admitted as transaction | ||
+ | �in a feasible dataset if its number of occurrences is at most 𝛿 ′ . This formulation has been given | ||
+ | in [12] with the name of IFM with infrequency duplicate constraints (IFMD for short). Observe | ||
+ | that duplicate constraints are less restrictive than infrequency constraints in the sense that | ||
+ | some itemset 𝐼 in 𝐵𝑆 ′ may happen to be eventually frequent as it may inherit the supports of | ||
+ | several itemsets in 𝑆 ′ with duplicates below the threshold. For instance, given the threshold | ||
+ | 𝛿 ′ = 30, the dataset 𝒟3 in Table 2-𝑐 is a feasible dataset for IFMD . However, the supports of | ||
+ | {𝑎, 𝑐}, {𝑎, 𝑑}, {𝑏, 𝑑} are respectively 60, 30 and 50, thus {𝑎, 𝑐} and {𝑏, 𝑑} are frequent. | ||
+ | |||
+ | |||
+ | 3. Machine Learning-based Generative Models | ||
+ | The probabilistic approach to model transactional data (PGM - probabilistic generative modeling) | ||
+ | assumes that in a database 𝒟 the itemsets are modeled as stochastic events: that is, they are | ||
+ | sampled from an unknown true distribution P𝑟 . The analysis of the statistical distribution of the | ||
+ | stochastic events provides insights on the mathematical rules governing the generation process. | ||
+ | The problem hence becomes how to obtain a smooth and reliable estimate of P𝑟 . | ||
+ | In general, it is convenient to use a parametric model to estimate P𝑟 when the constraints on | ||
+ | the shape of the distribution are known. By associating each observation x with a probability | ||
+ | measure 𝑃 (x |𝜃) ≡ 𝑃𝜃 (x ) where 𝜃 is the set of the distribution parameters, our problem | ||
+ | hence becomes to devise the optimal parameter set 𝜃 that guarantees a reliable approximation | ||
+ | P𝜃 ≈ P𝑟 that can emulate the sampling process x ∼ P𝑟 in a tractable and reliable way. A | ||
+ | clear advantage of a parametric approaches to data generation lies in the insights that it can | ||
+ | provide within the data generation process. They allow to detect the factors governing the | ||
+ | data, providing a meaningful explanation of complex phenomena. In this work, we are focusing | ||
+ | on two state-of-the-art probabilistic approaches: Variational Autoencoders and Generative | ||
+ | Adversarial Networks. | ||
+ | A Variational Autoencoder (VAE) is an artificial neural architecture that combines tra- | ||
+ | ditional autoencoder architectures [13] with the concept of latent variable modeling [14]. | ||
+ | Essentially, we can assume the existence of a a 𝐾-dimensional latent space 𝒵, that can be the | ||
+ | generation engine of the samples in 𝒳 . The transactions 𝒳 = {x 1 , . . . , x 𝜂 } can be modeled | ||
+ | through a chain dependency: (i) given a distribution 𝑃𝜃 (over the parameter set 𝜃) we can | ||
+ | sample z ∈ 𝒵, and (ii) given a z and another distribution 𝑃𝜑 (over the parameter set 𝜑) we | ||
+ | can sample x . According to the maximum likelihood principle, optimal values of 𝜃 and 𝜑 can | ||
+ | be found by trying to maximizing 𝑃 (𝒳 ), but, unfortunately, this optimization is typically an | ||
+ | intractable problem that requires the exploitation of heuristics. | ||
+ | Variational inference [15] introduces a proposal normal distribution 𝑄𝜆 (z |x ) (over the | ||
+ | parameter set 𝜆), whose purpose is to approximate the true posterior 𝑃 (z |x ). Hence, a VAE | ||
+ | is devised by concatenating two neural networks: an “Encoder”, that maps an input x into | ||
+ | a latent variable z , exploiting 𝑄𝜆 (z |x ), and a “Decoder” that reconstructs x by applying | ||
+ | 𝑃𝜑 (x |z ) to z . The gain function, to be optimized to learn the network parameters 𝜆 and | ||
+ | 𝜑, is obtained by marginalizing the log-likelihood of 𝑃 (𝒳 ) over z and applying Jensen’s | ||
+ | inequality: log 𝑃 (x ) ≥ Ez ∼𝑄 [log 𝑃 (x |z )] − KL [𝑄(z |x )‖𝑃 (z )], where KL is the Kullback- | ||
+ | Leibler divergence. | ||
+ | The Decoder can be opportunely exploit to solve the discrete data generation problem. In fact, | ||
+ | � Embedding plot Embedding plot Embedding plot | ||
+ | 4 4 4 | ||
+ | 3 3 3 | ||
+ | 2 2 2 | ||
+ | 1 1 1 | ||
+ | 0 0 0 | ||
+ | 1 1 1 | ||
+ | 2 2 2 | ||
+ | 3 3 3 | ||
+ | 4 4 4 | ||
+ | 5 4 3 2 1 0 1 2 3 5 4 3 2 1 0 1 2 3 5 4 3 2 1 0 1 2 3 | ||
+ | |||
+ | (a) Real data mapping in 𝒵 (b) Synthetic data generation (c) Synthetic data generation | ||
+ | using MVAE using Clustering-variant of | ||
+ | MVAE | ||
+ | |||
+ | Figure 1: Illustration of the data generation process within VAE. | ||
+ | |||
+ | |||
+ | we can sample as many z as we want, feed them to the Decoder and obtain brand new synthetic | ||
+ | data that is similar to the real one, if the VAE was properly trained. A simple generation approach, | ||
+ | called Multinomial VAE (MVAE), was proposed in [16], where x ∼ Multinomial (𝜋(z )), with | ||
+ | 𝜋(z ) = softmax {exp [𝑃𝜑 (z )]} and z ∼ 𝒩 (0 , I 𝐾 ). A more sophisticated approach, that can | ||
+ | better observe the latent space of z and overcome the strong bias of considering only one | ||
+ | standard normal distribution, is implemented by clustering all the z ∼ 𝑄𝜆 (𝒳 ). The generation | ||
+ | of a synthetic x starts by choosing a cluster 𝑐, by exploiting a multinomial sampling according | ||
+ | to the clusters’ densities. Then, we can apply the MVAE sampling to pick up z ∼ 𝒩 (𝜇𝑐 , 𝜎 𝑐 ), | ||
+ | where 𝜇𝑐 and 𝜎 𝑐 are the mean and standard deviation within 𝑐. A comparison of these two | ||
+ | approaches is shown in Figure 1. Unfortunately, the approaches based on maximum likelihood | ||
+ | have been shown to suffer from over generalization [17], especially when data from P𝑟 is limited. | ||
+ | Generative Adversarial Networks (GANs) [18] propose an alternative modeling which | ||
+ | departs from the maximum likelihood and instead focuses on an alternative optimization strategy. | ||
+ | In order to optimize the weights 𝜃 of a neural network, called Generator G, able to learn the | ||
+ | probability space P𝜃 , Adversarial Networks rely on an auxiliary classifier 𝐷, with weights 𝜑, | ||
+ | trained to discriminate between real and generated data. In practice, optimality can be achieved | ||
+ | when x ∼ P𝜃 is indistinguishable from x ∼ P𝑟 . The training process can be hence devised as a | ||
+ | competitive game, with the generator trying to produce realistic samples, starting from random | ||
+ | samples in the latent space 𝒵, and the classifier focusing on the detection of generated data, | ||
+ | where the objective function is min𝜃 max𝜑 Ex ∼P𝑟 [log 𝐷𝜑 (x )] + Ex ∼P𝜃 [log(1 − 𝐷𝜑 (x ))]. | ||
+ | It can be shown [18] that the adoption of this alternate optimization is equivalent to train 𝜃 | ||
+ | to minimize the Jensen-Shannon divergence, that, differently from the approaches based on | ||
+ | maximum likelihood, has the objective of a complete adherence of P𝑟 and P𝜃 . In our context, | ||
+ | transactions are discrete vectors, with x ∈ {0, 1}𝑛 , thus backpropagation does not directly | ||
+ | apply to G and a workaround is needed. The simplest one is to admit a continuous relaxation of | ||
+ | the output of the generator. Just like with the variational autoencoder, the output of 𝐺 can be | ||
+ | modeled a multinomial probability (with no direct sampling channel), rather than a binary vector. | ||
+ | However, there is a major problem with this: the input of the discriminator would be a softmax | ||
+ | distribution from the generated transactions, and a binary vector for the real transactions. As | ||
+ | a result, the discriminator could easily tell them apart, with the result that the GAN would | ||
+ | get stuck in an equilibrium that is not good for the Generator. Different formulations of the | ||
+ | �Table 3 | ||
+ | Originary experimental dataset. | ||
+ | Tuple ID 𝑖1 𝑖2 𝑖3 𝑖4 nd itemset frequent support | ||
+ | 𝑡1 1 0 1 1 210 𝑖1 y 640 | ||
+ | 𝑡2 0 1 0 1 140 𝑖2 y 550 | ||
+ | 𝑡3 1 1 1 0 110 𝑖3 y 450 | ||
+ | 𝑡4 1 0 0 0 100 𝑖4 y 500 | ||
+ | 𝑡5 1 1 0 0 90 𝑖1 , 𝑖2 y 270 | ||
+ | 𝑡6 0 0 0 1 80 𝑖1 , 𝑖3 y 380 | ||
+ | 𝑡7 0 1 1 0 70 𝑖1 , 𝑖4 y 280 | ||
+ | 𝑡8 1 1 0 1 70 𝑖3 , 𝑖4 n 210 | ||
+ | 𝑡9 0 1 0 0 70 𝑖2 , 𝑖3 n 180 | ||
+ | 𝑡10 1 0 1 0 60 𝑖2 , 𝑖4 n 210 | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | adversarial training (based on Wasserstein distance [19], namely Wasserstein GANs or WGANs) | ||
+ | can partially mitigate this issue. | ||
+ | |||
+ | |||
+ | 4. Comparative Analysis | ||
+ | In this section we compare the two approaches discussed in sections 2 and 3 by analyzing their | ||
+ | behavior on a set of controlled experiments. For these, we use a toy dataset, shown in Table 3, | ||
+ | comprising 4 items upon which 10 patterns are selected. The dataset used in the experiments is | ||
+ | hence built by replicating such patterns with some fixed frequencies. We use the following | ||
+ | approaches to generate the synthetic datasets: | ||
+ | |||
+ | - IFM: IFM formulation with support ≥ 250. | ||
+ | - IFM𝐼 : IFM formulation with infrequent itemsets’ support ≤ 210. | ||
+ | - IFM𝐷 : IFM formulation with transaction duplicates ≤ 100. | ||
+ | - IFM𝐷𝐼 : IFM𝐷 merged with IFM𝐼 . | ||
+ | - IFM𝐷𝐼−5% : IFM𝐼 formulation imposing that each transaction has a number of duplicates | ||
+ | that differ less than 5% from the number of duplicates in the original dataset. | ||
+ | - VAE: Variational Auto Encoder generator. | ||
+ | - VAE−{𝑡4 , 𝑡6 }: VAE without the generation of 𝑡4 and 𝑡6 transactions (as visible in Table 3) | ||
+ | by means of sampling with rejection. | ||
+ | - IFM𝐷𝐼−5% − {𝑡4 , 𝑡6 }: IFM𝐷𝐼−5% formulation imposing the number of duplicates for 𝑡4 | ||
+ | and 𝑡6 to be zero. | ||
+ | |||
+ | The purpose of the experiments is to observe the reconstruction process in both methods | ||
+ | and compare the resulting reconstructed datasets. The comparison relies on the transactions | ||
+ | (𝑡1 . . . 𝑡10 in the table) as well as both simple items and item pairs. We evaluate the faithfulness | ||
+ | of the reconstruction in two respects: (1) whether the patterns are reproduced, and (2) whether | ||
+ | their frequencies are faithful. A simple∑︀ metric to measure the reconstruction accuracy is the | ||
+ | |𝐷𝑖 −𝑂𝑖 | | ||
+ | discrepancy 𝒮, computed as 𝒮 = 𝑖∑︀ | ||
+ | , where, for a pattern 𝑖 (either a transaction or an | ||
+ | 𝑖 𝑂𝑖 | ||
+ | item pair), 𝐷𝑖 and 𝑂𝑖 represent the frequency of the pattern in the reconstructed and original | ||
+ | �Table 4 | ||
+ | Discrepancy. | ||
+ | Patterns IFM IFM𝐼 IFM𝐷 IFM𝐷𝐼 IFM𝐷𝐼−5% VAE VAE−{𝑡4 , 𝑡6 } IFM𝐷𝐼−5% − {𝑡4 , 𝑡6 } | ||
+ | Transactions 33.00% 12.00% 48.00% 50.00% 4.80% 7.00% 36.00% 23.10% | ||
+ | Items & item pairs 3.68% 0.82% 1.91% 0.82% 0.11% 3.02% 16.68% 4.71% | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | Figure 2: Two-dimensional representation of the latent space | ||
+ | |||
+ | |||
+ | dataset, respectively. Table 4 reports the values of discrepancy, which are further detailed in | ||
+ | Figures 3-5. | ||
+ | We first analyze the results of the reconstruction for a VAE-based generative model. Figure | ||
+ | 3 shows a comparison with IFM𝐷𝐼−5% , the IFM-based formulation which provides the best | ||
+ | performances. The reconstruction provided by IFM𝐷𝐼−5% is extremely faithful, both on the | ||
+ | itemsets and the transactions. This because it enforces that the number of duplicates of each | ||
+ | transaction differs less than 5% from the number of duplicates in the original dataset. Figure 2 | ||
+ | shows the details of how the original patterns are mapped into the latent generative space: | ||
+ | the leftmost picture shows the mapping of the original data, and the rightmost shows how | ||
+ | the generation results from a larger region of the latent space. The main advantage of the | ||
+ | approach based on generative modeling through latent variables is that the latter allows to | ||
+ | control the reconstruction process. By acting on the latter we can modify the characteristics of | ||
+ | the reconstructed space. For example, we see that transaction 𝑡11 (the only spurious transaction | ||
+ | generated by VAE) is placed in a specific region, denoted by a red star in the figure. Sampling | ||
+ | repeatedly from that region would allow us to change the overall distribution of the transactions | ||
+ | while still maintaining the itemset distribution. By contrast, the IFM-based approaches are in | ||
+ | general successful in maintaining the itemset distributions. However, they tend to produce a | ||
+ | higher noise with transactions unless not explicitly constrained by the IFM𝐷𝐼−5% formulation | ||
+ | (see Figure 4). This noise can in principle be considered an advantage in specific contexts where | ||
+ | a differentiation from the original dataset is required (e.g., due to privacy concerns). | ||
+ | In principle, the adoption of IFM allows to implement a reconstruction "by design", by | ||
+ | choosing which itemsets to maintain or suppress. As an evidence, we report the cases of | ||
+ | IFM𝐷𝐼−5% − {𝑡4 , 𝑡6 } and VAE−{𝑡4 , 𝑡6 }, where transaction 𝑡4 and 𝑡6 are removed from the | ||
+ | generation phases. In fact, Figure 5 shows that IFM𝐷𝐼−5% − {𝑡4 , 𝑡6 } keeps the number of | ||
+ | � duplicates comparison IFM-VAE support comparison IFM-VAE | ||
+ | 250 700 | ||
+ | 600 | ||
+ | 200 | ||
+ | 500 | ||
+ | 150 400 | ||
+ | 100 300 | ||
+ | 200 | ||
+ | 50 | ||
+ | 100 | ||
+ | 0 0 | ||
+ | t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 1001 1011 0010 i1 i2 i3 i4 i1,i2 i1,i3 i1,i4 i3,i4 i2,i3 i2,i4 | ||
+ | |||
+ | original VAE IFM_DI 5% original VAE IFM_DI 5% | ||
+ | |||
+ | |||
+ | Figure 3: Details of the discrepancies on the number of duplicates (left) and the itemsets support (right) | ||
+ | between VAE and IFM-based techniques. | ||
+ | |||
+ | duplicates IFM-based techniques | ||
+ | 250 | ||
+ | 200 | ||
+ | 150 | ||
+ | 100 | ||
+ | 50 | ||
+ | 0 | ||
+ | t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 1001 0111 1011 0010 | ||
+ | original IFM IFM_I IFM_D IFM_DI IFM_DI 5% | ||
+ | |||
+ | support IFM-based techniques | ||
+ | 700 | ||
+ | |||
+ | 600 | ||
+ | |||
+ | 500 | ||
+ | |||
+ | 400 | ||
+ | |||
+ | 300 | ||
+ | |||
+ | 200 | ||
+ | |||
+ | 100 | ||
+ | |||
+ | 0 | ||
+ | i1 i2 i3 i4 i1,i2 i1,i3 i1,i4 i3,i4 i2,i3 i2,i4 | ||
+ | original IFM IFM_I IFM_D IFM_DI IFM_DI 5% | ||
+ | |||
+ | |||
+ | Figure 4: Details of the discrepancies on the number of duplicates (top) and the itemsets support | ||
+ | (bottom) between the original dataset and IFM-based techniques. | ||
+ | |||
+ | |||
+ | duplicates and the supports almost similar to the ones of the original dataset, while VAE−{𝑡4 , 𝑡6 } | ||
+ | changes many of them to remove the two transactions. The approach based on generative | ||
+ | modeling is in general more efficient. However, constraint-based generation is sensitive to the | ||
+ | frequency threshold and a suitable tuning can make these approaches comparable. | ||
+ | To summarize, these experiments support an underlying intuition: Constraint-based genera- | ||
+ | tion allows more control on the expected outcome at the expense of a higher computational | ||
+ | cost, whereas probabilistic generative models provide more faithful reconstructions but are less | ||
+ | controllable. This essentially means that, without any further modeling artifact (that we do | ||
+ | not consider here), generative models are prone to fail in providing tailored reconstructions | ||
+ | where some patterns can be suppressed and new ones introduced. By contrast, contraint-based | ||
+ | generation is more suitable for reconstructions "by design". | ||
+ | � duplicates comparison IFM-VAE w/o { t4,t6} support comparison IFM-VAE w/o { t4,t6} | ||
+ | 250 700 | ||
+ | 200 | ||
+ | 500 | ||
+ | 150 | ||
+ | 100 300 | ||
+ | |||
+ | 50 | ||
+ | 100 | ||
+ | 0 | ||
+ | t1 t2 t3 t5 t7 t8 t9 t10 1001 0011 -100 i1 i2 i3 i4 i1,i2 i1,i3 i1,i4 i3,i4 i2,i3 i2,i4 | ||
+ | original VAE w/o { t4,t6} IFM_DI w/o { t4,t6} original VAE w/o { t4,t6} IFM_DI 5% w/o { t4,t6} | ||
+ | |||
+ | |||
+ | |||
+ | Figure 5: Details of the discrepancies on the number of duplicates (left) and the itemsets support (right) | ||
+ | between VAE and IFM-based techniques, without transactions 𝑡4 and 𝑡6 . | ||
+ | |||
+ | |||
+ | 5. Conclusions | ||
+ | This paper has provided an overview about state-of-the-art approaches for synthetic transac- | ||
+ | tional data generation. A transaction has been modeled as a high-dimension sparse itemset, | ||
+ | that can be mapped into a binary vector, defined over the item space. The investigated algo- | ||
+ | rithms are (variants of the) Inverse Frequent Itemset Mining (IFM), and Probabilistic Generative | ||
+ | Models (PGMs). According to our analysis, the IFM approaches result to be extremely flexible | ||
+ | and understandable; they enable the control of the data generation procedure by the direct | ||
+ | identification of the discovery patterns to preserve. However, they proved to have extremely | ||
+ | onerous computational costs, making them not feasible in high-dimensional contexts. An | ||
+ | opposite conclusion has been obtained by analyzing PGMs: they are extremely fast and accurate, | ||
+ | but strongly lacking in control, flexibility and understandability. As future work, an interesting | ||
+ | research line is trying to investigate novel methodologies and techniques that are able to take | ||
+ | advantage of IFM and PGMs, by combining their strong points and mitigating their weakness. | ||
+ | Another promising research line is to apply the combination of the two approaches to NoSQL | ||
+ | applications by considering the extension of IFM that has been recently proposed in [12]. | ||
+ | |||
+ | |||
+ | References | ||
+ | [1] G. Manco, E. Ritacco, A. Rullo, D. Saccà, E. Serra, Machine learning methods for generating | ||
+ | high dimensional discrete datasets, WIREs Data Mining Knowl. Discov. 12 (2022). URL: | ||
+ | https://doi.org/10.1002/widm.1450. doi:10.1002/widm.1450. | ||
+ | [2] C. P. Chen, C.-Y. Zhang, Data-intensive applications, challenges, techniques and tech- | ||
+ | nologies: A survey on big data, Information Sciences 275 (2014) 314 – 347. doi:https: | ||
+ | //doi.org/10.1016/j.ins.2014.01.015. | ||
+ | [3] G. Weikum, Where’s the data in the big data wave?, 2013. ACM Sigmod Blog: | ||
+ | http://wp.sigmod.org/?p=786. | ||
+ | [4] R. Agrawal, T. Imieliński, A. Swami, Mining association rules between sets of items in | ||
+ | large databases, in: Proceedings of the 1993 ACM SIGMOD International Conference on | ||
+ | Management of Data, SIGMOD ’93, ACM, New York, NY, USA, 1993, pp. 207–216. | ||
+ | � [5] J. Han, H. Cheng, D. Xin, X. Yan, Frequent pattern mining: current status and fu- | ||
+ | ture directions, Data Mining and Knowledge Discovery 15 (2007) 55–86. doi:10.1007/ | ||
+ | s10618-006-0059-1. | ||
+ | [6] L. Cagliero, P. Garza, Itemset generalization with cardinality-based constraints, Informa- | ||
+ | tion Sciences 244 (2013) 161 – 174. doi:https://doi.org/10.1016/j.ins.2013.05. | ||
+ | 008. | ||
+ | [7] T. Mielikainen, On inverse frequent set mining, in: Proceedings of 2nd Workshop on | ||
+ | Privacy Preserving Data Mining, PPDM ’03, IEEE Computer Society, Washington, DC, | ||
+ | USA, 2003, pp. 18–23. | ||
+ | [8] R. Agrawal, R. Srikant, Privacy-preserving data mining, in: Proceedings of the 2000 ACM | ||
+ | SIGMOD International Conference on Management of Data, SIGMOD ’00, ACM, New | ||
+ | York, NY, USA, 2000, pp. 439–450. doi:10.1145/342009.335438. | ||
+ | [9] A. Guzzo, D. Saccà, E. Serra, An effective approach to inverse frequent set mining, in: | ||
+ | Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, ICDM ’09, | ||
+ | IEEE Computer Society, Washington, DC, USA, 2009, pp. 806–811. doi:10.1109/ICDM. | ||
+ | 2009.123. | ||
+ | [10] A. Guzzo, L. Moccia, D. Saccà, E. Serra, Solving inverse frequent itemset mining with | ||
+ | infrequency constraints via large-scale linear programs, ACM Transactions on Knowledge | ||
+ | Discovery from Data 7 (2013) 18:1–18:39. doi:10.1145/2541268.2541271. | ||
+ | [11] D. Gunopulos, R. Khardon, H. Mannila, H. Toivonen, Data mining, hypergraph transversals, | ||
+ | and machine learning, in: Proceedings of the 16-th ACM SIGMOD-SIGACT-SIGART | ||
+ | Symposium on Principles of Database Systems, PODS ’97, 1997, pp. 209–216. doi:10. | ||
+ | 1145/263661.263684. | ||
+ | [12] D. Saccà, E. Serra, A. Rullo, Extending inverse frequent itemsets mining to generate realistic | ||
+ | datasets: complexity, accuracy and emerging applications, Data Mining and Knowledge | ||
+ | Discovery 33 (2019) 1736–1774. doi:10.1007/s10618-019-00643-1. | ||
+ | [13] P. Baldi, Autoencoders, unsupervised learning, and deep architectures, in: In Proceedings | ||
+ | of the International Conference on Unsupervised and Transfer Learning workshop (UTLW), | ||
+ | volume 27, PMLR, 2011, pp. 37–49. | ||
+ | [14] K. P. Murphy, Machine Learning: A Probabilistic Perspective, The MIT Press, 2012. | ||
+ | [15] D. M. Blei, A. Kucukelbir, J. D. McAuliffe, Variational inference: A review for statisti- | ||
+ | cians, Journal of the American Statistical Association 112 (2017) 859–877. doi:10.1080/ | ||
+ | 01621459.2017.1285773. | ||
+ | [16] D. Liang, R. G. Krishnan, M. Hoffman, T. Jebara, Variational autoencoders for collaborative | ||
+ | filtering, in: Proceedings of the 2018 World Wide Web Conference, WWW ’18, 2018, pp. | ||
+ | 689–698. | ||
+ | [17] L. Theis, A. van den Oord, M. Bethge, A note on the evaluation of generative models, | ||
+ | in: International Conference on Learning Representations, 2016. doi:10.48550/arXiv. | ||
+ | 1511.01844. | ||
+ | [18] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, | ||
+ | Y. Bengio, Generative adversarial nets, in: Advances in Neural Information Processing | ||
+ | Systems, volume 27, 2014. | ||
+ | [19] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein generative adversarial networks, in: | ||
+ | Proceedings of the 34th International Conference on Machine Learning, 2017, pp. 214–223. | ||
+ | � | ||
+ | </pre> |
Latest revision as of 17:52, 30 March 2023
Paper
Paper | |
---|---|
edit | |
description | |
id | Vol-3194/paper41 |
wikidataid | Q117344879→Q117344879 |
title | Generating Synthetic Discrete Datasets with Machine Learning |
pdfUrl | https://ceur-ws.org/Vol-3194/paper41.pdf |
dblpUrl | https://dblp.org/rec/conf/sebd/0001RRSS22 |
volume | Vol-3194→Vol-3194 |
session | → |
Generating Synthetic Discrete Datasets with Machine Learning
Generating Synthetic Discrete Datasets with Machine Learning (Discussion Paper) Giuseppe Manco1 , Ettore Ritacco1 , Antonino Rullo2 , Domenico Saccà2 and Edoardo Serra3 1 Institute for High Performance Computing and Networking (ICAR) of the Italian National Research Council (CNR), v. P. Bucci 8/9C, 87036 Rende (CS), Italy 2 DIMES Department, University of Calabria, Rende, CS, 87036, Italy 3 Computer Science Department, Boise State University, Boise, ID 83725, USA Abstract The real data are not always available/accessible/sufficient or in many cases they are incomplete and lacking in semantic content necessary to the definition of optimization processes. In this paper we discuss about the synthetic data generation under two different perspectives. The core common idea is to analyze a limited set of real data to learn the main patterns that characterize them and exploit this knowledge to generate brand new data. The first perspective is constraint-based generation and consists in generating a synthetic dataset satisfying given support constraints on the real frequent patterns. The second one is based on probabilistic generative modeling and considers the synthetic generation as a sampling process from a parametric distribution learned on the real data, typically encoded as a neural network (e.g. Variational Autoencoders, Generative Adversarial Networks). Keywords Synthetic dataset, Data generation, Inverse Frequent Itemset Mining, Constraints-based models, Varia- tional Autoencoder, Generative Adversarial Networks, Generative models This paper is an extended abstract of [1]. 1. Introduction Emerging “Big Data” platforms and applications call for the invention of novel data analysis techniques that are capable to effectively and efficiently handle large amount of data [2]. There is therefore an increasing need to use real-life datasets for data-driven experiments but the scarcity of significant datasets is a critical issue for research papers [3]. Synthetic data generation can help in this, by reproducing the internal mechanisms and dependencies that justify the occurrence of some specific pieces of information, and hence being able to replicate them stochastically. In this paper we focus on the problem of generating high-dimensional discrete data. Generating such data is a challenge because of the combination of both high dimensionality and discrete components, which may result in a complex structural domain with lots of variety and irregularities, not necessarily smooth. Throughout the paper, we shall study approaches to data generation which rely on the idea that each point in the real domain can be mapped into a SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) �Table 1 Transactional Dataset end Frequent Itemsets. 𝑎 𝑏 𝑐 𝑑 𝜎𝒟 1 1 1 0 40 𝑆𝑖 𝑎 𝑏 𝑐 𝑑 𝜎 𝑆𝑖 0 1 1 1 40 𝑆1 1 1 0 0 100 1 1 0 0 60 𝑆2 0 1 1 0 100 0 1 1 0 20 𝑆3 0 0 1 1 50 1 0 1 1 10 (𝑏) Frequent Itemsets with threshold 50 (𝑎) Dataset 𝒟 suitable latent space and vice versa. These mappings guarantee a consistency with regards to the original dataset. At the same time, the manifold in the latent space summarizes the main characteristics of the data, that can hence be injected into the synthesized data in a controlled way. We first discuss two main approaches, that are the Inverse Frequent Itemset Mining (IFM) and probabilistic generative modeling (PGM); finally, a comparison of the results obtained with some of the presented algorithms are provided. 2. Inverse Frequent Itemset Mining-based Generative Models Let ℐ be a finite domain of 𝑛 elements, also called items. Any subset 𝐼 ⊆ ℐ is an itemset over ℐ, also called a transaction. Let 𝒰ℐ denote the set of all itemsets on ℐ; then, |𝒰ℐ | = 2𝑛 . A (transactional) database 𝒟 is a set of tuples [𝑘, 𝐼], where 𝑘 is the key and 𝐼 is an itemset. The size |𝒟| of 𝒟 is the total number of its itemsets, i.e., transactions. A transactional database 𝒟 is very often represented as a bag of itemsets, i.e, the keys are omitted so that tuples are simply itemsets and may therefore occur duplicated – in this case 𝒟 is also called a transactional dataset. In the paper we shall also represent an itemset 𝐼 ∈ 𝒟 by its one-hot encoding x , that is a binary vector of size 𝑛 (the number of items in ℐ) such that its 𝑖-th position 𝑥𝑖 = 1 if the 𝑖-th item in ℐ is in 𝐼, 0 otherwise. Consequently, 𝒳 = {x 1 , . . . , x 𝜂 }, where 𝜂 = |𝒟|, is the one-hot encoding of the whole dataset 𝒟. For each itemset 𝐼 ∈ 𝒟, there exist two important measures: (i) the number of duplicates of 𝐼, denoted as 𝛿 𝒟 (𝐼), that is the number of occurrences of 𝐼 in 𝒟, and (ii) the support of 𝐼, denoted as 𝜎 𝒟 (𝐼), that ∑︀ is the sum of the number of duplicates of each itemset 𝐽 in 𝒟 containing 𝐼, i.e., 𝜎 𝒟 (𝐼) = 𝐽∈𝒟∧𝐼⊆𝐽 𝛿 𝒟 (𝐽) . A dataset 𝒟 can be represented in a succinct format as a set of pairs (𝐼, 𝜎 𝒟 (𝐼)). Given ℐ = {𝑎, 𝑏, 𝑐, 𝑑}, an example of dataset in the succinct, one-hot format is shown in Table 1-𝑎. We say that 𝐼 is a frequent (resp., infrequent) itemset in 𝒟 if its support is greater than or equal to (resp., less than) a given threshold. A classical data mining task over transaction datasets is to detect the set of the frequent/infrequent itemsets, and a rich literature deals with this topic [4, 5, 6] Given the threshold 50, the frequent itemsets for the dataset of Table 1-𝑎 are listed in Table 1-𝑏. The perspective of the frequent itemset mining problem has been later inverted as follows: given a set of itemsets together with their frequency constraints the goal is to compute, if any, a transaction dataset satisfying the above constraints. This new problem is called the inverse frequent itemset mining problem (IFM) algorithms [7], and has been later investigated also in privacy preserving contexts [8]. Given a set ℐ of items, the IFM problem consists in finding a dataset 𝒟 that satisfies given support �Table 2 Examples of Feasible Datasets with size 170 for an IFM instance. 𝑎 𝑏 𝑐 𝑑 𝜎𝒟 𝑎 𝑏 𝑐 𝑑 𝜎𝒟 𝑎 𝑏 𝑐 𝑑 𝜎𝒟 1 1 1 0 70 1 1 1 0 40 1 1 1 1 30 0 1 1 1 10 0 1 1 1 40 1 1 1 0 30 1 1 0 0 30 1 1 0 0 60 0 1 1 1 20 0 1 1 0 20 0 1 1 0 20 1 1 0 0 40 0 0 1 1 40 0 0 1 1 10 0 1 1 0 50 (𝑎) Dataset 𝒟1 (𝑏) Dataset 𝒟2 (𝑐) Dataset 𝒟3 constraints on some itemsets 𝑆𝑖 on ℐ — the set of such itemsets is denoted by 𝑆. The support constraints are represented as follows: ∀𝑆𝑖 ∈ 𝑆 : 𝜎𝑚𝑖𝑛 𝑖 ≤ 𝜎 𝒟 (𝑆𝑖 ) ≤ 𝜎𝑚𝑎𝑥𝑖 , where 𝜎 𝒟 (𝑆𝑖 ) is the sum of all number of duplicates of itemsets in 𝒟 containing 𝐼. As an example, consider ℐ = {𝑎, 𝑏, 𝑐, 𝑑}, 𝑆 = {{𝑎, 𝑏}, {𝑏, 𝑐}, {𝑐, 𝑑}} and the support constraints represented in Table 1-𝑏 – in this example minimal and maximal supports coincide. The itemsets 𝑆1 = {𝑎, 𝑏} and 𝑆2 = {𝑏, 𝑐} must occur in exactly 100 transactions (possibly as their sub-transactions) whereas the itemset 𝑆3 = {𝑐, 𝑑} must occur in exactly 50 transactions. It is also required that the dataset size (i.e., the total number of transactions) be 170. The dataset 𝐷1 shown in Table 2-𝑎 is feasible as it satisfies all constraints: 𝑆1 is satisfied by the transactions {𝑎, 𝑏, 𝑐} and {𝑎, 𝑏}, 𝑆2 by the transactions {𝑎, 𝑏, 𝑐}, {𝑏, 𝑐, 𝑑} and {𝑏, 𝑐} and 𝑆3 by the transactions {𝑏, 𝑐, 𝑑} and {𝑐, 𝑑}. Let 𝑆 ′ be the set of all itemsets that are neither in 𝑆 nor subsets of some itemset in 𝑆. In the example, 𝑆 ′ consists of {𝑎, 𝑏, 𝑐, 𝑑}, {𝑎, 𝑏, 𝑐}, {𝑎, 𝑏, 𝑑}, {𝑎, 𝑐, 𝑑}, {𝑏, 𝑐, 𝑑}, {𝑎, 𝑐}, {𝑎, 𝑑} and {𝑏, 𝑑}. IFM does not enforce any constraint on the itemsets in 𝑆 ′ and, therefore, it may happen that 𝒟 contains additional (and, in some cases, unsuspected or even undesired) frequent itemsets. In the dataset 𝒟1 of Table 2-𝑎, the itemset {𝑎, 𝑏, 𝑐} is in 𝑆 ′ but it turns out to be frequent with a support of 70. To remove the anomaly, Guzzo et al. [9] have proposed an alternative formulation, called IFM𝑆 , that requires that only itemsets in 𝑆 can be included as transactions in 𝒟 and, therefore, no unexpected frequent itemsets may eventually occur. Obviously, the decision complexity of this problem is lower as it is NP-complete. Despite the complexity improvement, the IFM𝑆 formulation has a severe drawback: it is too restrictive in excluding any transaction besides the ones in 𝑆 as confirmed by the fact that no feasible dataset exists for our running example. To weaken the tight restrictions of IFM𝑆 , Guzzo et al. [10] proposed a new formulation of the problem, called IFM with infrequency support constraints (IFMI for short), which admits transactions in 𝑆 ′ to be in a feasible dataset if their supports are below a given threshold 𝜎 ′ . By the anti-monotonicity property, the number of infrequency support constraints can be reduced by applying them only to a subset of 𝑆 ′ consisting of its minimal (inclusion-wise) elements. This subset, denoted by 𝐵𝑆 ′ , is called the negative border and coincides with the set of all minimal transversals of the hypergraph 𝐸 ¯ = {ℐ ∖ 𝐼 : 𝐼 ∈ 𝑆} (as defined in [11]). In the example, 𝐵𝑆 ′ = {{𝑎, 𝑐}, {𝑎, 𝑑}, {𝑏, 𝑑}} and the dataset 𝒟2 in Table 2-𝑏 is a feasible dataset for IFMI for 𝜎 ′ = 40. In fact, all infrequency support constraints on 𝐵𝑆 ′ are satisfied as the supports of {𝑎, 𝑐}, {𝑎, 𝑑}, {𝑏, 𝑑} are respectively 40, 0 and 40. Another possibility to enforce infrequency constraints is to fix a duplicate threshold 𝛿 ′ so that an itemset in 𝑆 ′ is admitted as transaction �in a feasible dataset if its number of occurrences is at most 𝛿 ′ . This formulation has been given in [12] with the name of IFM with infrequency duplicate constraints (IFMD for short). Observe that duplicate constraints are less restrictive than infrequency constraints in the sense that some itemset 𝐼 in 𝐵𝑆 ′ may happen to be eventually frequent as it may inherit the supports of several itemsets in 𝑆 ′ with duplicates below the threshold. For instance, given the threshold 𝛿 ′ = 30, the dataset 𝒟3 in Table 2-𝑐 is a feasible dataset for IFMD . However, the supports of {𝑎, 𝑐}, {𝑎, 𝑑}, {𝑏, 𝑑} are respectively 60, 30 and 50, thus {𝑎, 𝑐} and {𝑏, 𝑑} are frequent. 3. Machine Learning-based Generative Models The probabilistic approach to model transactional data (PGM - probabilistic generative modeling) assumes that in a database 𝒟 the itemsets are modeled as stochastic events: that is, they are sampled from an unknown true distribution P𝑟 . The analysis of the statistical distribution of the stochastic events provides insights on the mathematical rules governing the generation process. The problem hence becomes how to obtain a smooth and reliable estimate of P𝑟 . In general, it is convenient to use a parametric model to estimate P𝑟 when the constraints on the shape of the distribution are known. By associating each observation x with a probability measure 𝑃 (x |𝜃) ≡ 𝑃𝜃 (x ) where 𝜃 is the set of the distribution parameters, our problem hence becomes to devise the optimal parameter set 𝜃 that guarantees a reliable approximation P𝜃 ≈ P𝑟 that can emulate the sampling process x ∼ P𝑟 in a tractable and reliable way. A clear advantage of a parametric approaches to data generation lies in the insights that it can provide within the data generation process. They allow to detect the factors governing the data, providing a meaningful explanation of complex phenomena. In this work, we are focusing on two state-of-the-art probabilistic approaches: Variational Autoencoders and Generative Adversarial Networks. A Variational Autoencoder (VAE) is an artificial neural architecture that combines tra- ditional autoencoder architectures [13] with the concept of latent variable modeling [14]. Essentially, we can assume the existence of a a 𝐾-dimensional latent space 𝒵, that can be the generation engine of the samples in 𝒳 . The transactions 𝒳 = {x 1 , . . . , x 𝜂 } can be modeled through a chain dependency: (i) given a distribution 𝑃𝜃 (over the parameter set 𝜃) we can sample z ∈ 𝒵, and (ii) given a z and another distribution 𝑃𝜑 (over the parameter set 𝜑) we can sample x . According to the maximum likelihood principle, optimal values of 𝜃 and 𝜑 can be found by trying to maximizing 𝑃 (𝒳 ), but, unfortunately, this optimization is typically an intractable problem that requires the exploitation of heuristics. Variational inference [15] introduces a proposal normal distribution 𝑄𝜆 (z |x ) (over the parameter set 𝜆), whose purpose is to approximate the true posterior 𝑃 (z |x ). Hence, a VAE is devised by concatenating two neural networks: an “Encoder”, that maps an input x into a latent variable z , exploiting 𝑄𝜆 (z |x ), and a “Decoder” that reconstructs x by applying 𝑃𝜑 (x |z ) to z . The gain function, to be optimized to learn the network parameters 𝜆 and 𝜑, is obtained by marginalizing the log-likelihood of 𝑃 (𝒳 ) over z and applying Jensen’s inequality: log 𝑃 (x ) ≥ Ez ∼𝑄 [log 𝑃 (x |z )] − KL [𝑄(z |x )‖𝑃 (z )], where KL is the Kullback- Leibler divergence. The Decoder can be opportunely exploit to solve the discrete data generation problem. In fact, � Embedding plot Embedding plot Embedding plot 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 4 3 2 1 0 1 2 3 5 4 3 2 1 0 1 2 3 5 4 3 2 1 0 1 2 3 (a) Real data mapping in 𝒵 (b) Synthetic data generation (c) Synthetic data generation using MVAE using Clustering-variant of MVAE Figure 1: Illustration of the data generation process within VAE. we can sample as many z as we want, feed them to the Decoder and obtain brand new synthetic data that is similar to the real one, if the VAE was properly trained. A simple generation approach, called Multinomial VAE (MVAE), was proposed in [16], where x ∼ Multinomial (𝜋(z )), with 𝜋(z ) = softmax {exp [𝑃𝜑 (z )]} and z ∼ 𝒩 (0 , I 𝐾 ). A more sophisticated approach, that can better observe the latent space of z and overcome the strong bias of considering only one standard normal distribution, is implemented by clustering all the z ∼ 𝑄𝜆 (𝒳 ). The generation of a synthetic x starts by choosing a cluster 𝑐, by exploiting a multinomial sampling according to the clusters’ densities. Then, we can apply the MVAE sampling to pick up z ∼ 𝒩 (𝜇𝑐 , 𝜎 𝑐 ), where 𝜇𝑐 and 𝜎 𝑐 are the mean and standard deviation within 𝑐. A comparison of these two approaches is shown in Figure 1. Unfortunately, the approaches based on maximum likelihood have been shown to suffer from over generalization [17], especially when data from P𝑟 is limited. Generative Adversarial Networks (GANs) [18] propose an alternative modeling which departs from the maximum likelihood and instead focuses on an alternative optimization strategy. In order to optimize the weights 𝜃 of a neural network, called Generator G, able to learn the probability space P𝜃 , Adversarial Networks rely on an auxiliary classifier 𝐷, with weights 𝜑, trained to discriminate between real and generated data. In practice, optimality can be achieved when x ∼ P𝜃 is indistinguishable from x ∼ P𝑟 . The training process can be hence devised as a competitive game, with the generator trying to produce realistic samples, starting from random samples in the latent space 𝒵, and the classifier focusing on the detection of generated data, where the objective function is min𝜃 max𝜑 Ex ∼P𝑟 [log 𝐷𝜑 (x )] + Ex ∼P𝜃 [log(1 − 𝐷𝜑 (x ))]. It can be shown [18] that the adoption of this alternate optimization is equivalent to train 𝜃 to minimize the Jensen-Shannon divergence, that, differently from the approaches based on maximum likelihood, has the objective of a complete adherence of P𝑟 and P𝜃 . In our context, transactions are discrete vectors, with x ∈ {0, 1}𝑛 , thus backpropagation does not directly apply to G and a workaround is needed. The simplest one is to admit a continuous relaxation of the output of the generator. Just like with the variational autoencoder, the output of 𝐺 can be modeled a multinomial probability (with no direct sampling channel), rather than a binary vector. However, there is a major problem with this: the input of the discriminator would be a softmax distribution from the generated transactions, and a binary vector for the real transactions. As a result, the discriminator could easily tell them apart, with the result that the GAN would get stuck in an equilibrium that is not good for the Generator. Different formulations of the �Table 3 Originary experimental dataset. Tuple ID 𝑖1 𝑖2 𝑖3 𝑖4 nd itemset frequent support 𝑡1 1 0 1 1 210 𝑖1 y 640 𝑡2 0 1 0 1 140 𝑖2 y 550 𝑡3 1 1 1 0 110 𝑖3 y 450 𝑡4 1 0 0 0 100 𝑖4 y 500 𝑡5 1 1 0 0 90 𝑖1 , 𝑖2 y 270 𝑡6 0 0 0 1 80 𝑖1 , 𝑖3 y 380 𝑡7 0 1 1 0 70 𝑖1 , 𝑖4 y 280 𝑡8 1 1 0 1 70 𝑖3 , 𝑖4 n 210 𝑡9 0 1 0 0 70 𝑖2 , 𝑖3 n 180 𝑡10 1 0 1 0 60 𝑖2 , 𝑖4 n 210 adversarial training (based on Wasserstein distance [19], namely Wasserstein GANs or WGANs) can partially mitigate this issue. 4. Comparative Analysis In this section we compare the two approaches discussed in sections 2 and 3 by analyzing their behavior on a set of controlled experiments. For these, we use a toy dataset, shown in Table 3, comprising 4 items upon which 10 patterns are selected. The dataset used in the experiments is hence built by replicating such patterns with some fixed frequencies. We use the following approaches to generate the synthetic datasets: - IFM: IFM formulation with support ≥ 250. - IFM𝐼 : IFM formulation with infrequent itemsets’ support ≤ 210. - IFM𝐷 : IFM formulation with transaction duplicates ≤ 100. - IFM𝐷𝐼 : IFM𝐷 merged with IFM𝐼 . - IFM𝐷𝐼−5% : IFM𝐼 formulation imposing that each transaction has a number of duplicates that differ less than 5% from the number of duplicates in the original dataset. - VAE: Variational Auto Encoder generator. - VAE−{𝑡4 , 𝑡6 }: VAE without the generation of 𝑡4 and 𝑡6 transactions (as visible in Table 3) by means of sampling with rejection. - IFM𝐷𝐼−5% − {𝑡4 , 𝑡6 }: IFM𝐷𝐼−5% formulation imposing the number of duplicates for 𝑡4 and 𝑡6 to be zero. The purpose of the experiments is to observe the reconstruction process in both methods and compare the resulting reconstructed datasets. The comparison relies on the transactions (𝑡1 . . . 𝑡10 in the table) as well as both simple items and item pairs. We evaluate the faithfulness of the reconstruction in two respects: (1) whether the patterns are reproduced, and (2) whether their frequencies are faithful. A simple∑︀ metric to measure the reconstruction accuracy is the |𝐷𝑖 −𝑂𝑖 | discrepancy 𝒮, computed as 𝒮 = 𝑖∑︀ , where, for a pattern 𝑖 (either a transaction or an 𝑖 𝑂𝑖 item pair), 𝐷𝑖 and 𝑂𝑖 represent the frequency of the pattern in the reconstructed and original �Table 4 Discrepancy. Patterns IFM IFM𝐼 IFM𝐷 IFM𝐷𝐼 IFM𝐷𝐼−5% VAE VAE−{𝑡4 , 𝑡6 } IFM𝐷𝐼−5% − {𝑡4 , 𝑡6 } Transactions 33.00% 12.00% 48.00% 50.00% 4.80% 7.00% 36.00% 23.10% Items & item pairs 3.68% 0.82% 1.91% 0.82% 0.11% 3.02% 16.68% 4.71% Figure 2: Two-dimensional representation of the latent space dataset, respectively. Table 4 reports the values of discrepancy, which are further detailed in Figures 3-5. We first analyze the results of the reconstruction for a VAE-based generative model. Figure 3 shows a comparison with IFM𝐷𝐼−5% , the IFM-based formulation which provides the best performances. The reconstruction provided by IFM𝐷𝐼−5% is extremely faithful, both on the itemsets and the transactions. This because it enforces that the number of duplicates of each transaction differs less than 5% from the number of duplicates in the original dataset. Figure 2 shows the details of how the original patterns are mapped into the latent generative space: the leftmost picture shows the mapping of the original data, and the rightmost shows how the generation results from a larger region of the latent space. The main advantage of the approach based on generative modeling through latent variables is that the latter allows to control the reconstruction process. By acting on the latter we can modify the characteristics of the reconstructed space. For example, we see that transaction 𝑡11 (the only spurious transaction generated by VAE) is placed in a specific region, denoted by a red star in the figure. Sampling repeatedly from that region would allow us to change the overall distribution of the transactions while still maintaining the itemset distribution. By contrast, the IFM-based approaches are in general successful in maintaining the itemset distributions. However, they tend to produce a higher noise with transactions unless not explicitly constrained by the IFM𝐷𝐼−5% formulation (see Figure 4). This noise can in principle be considered an advantage in specific contexts where a differentiation from the original dataset is required (e.g., due to privacy concerns). In principle, the adoption of IFM allows to implement a reconstruction "by design", by choosing which itemsets to maintain or suppress. As an evidence, we report the cases of IFM𝐷𝐼−5% − {𝑡4 , 𝑡6 } and VAE−{𝑡4 , 𝑡6 }, where transaction 𝑡4 and 𝑡6 are removed from the generation phases. In fact, Figure 5 shows that IFM𝐷𝐼−5% − {𝑡4 , 𝑡6 } keeps the number of � duplicates comparison IFM-VAE support comparison IFM-VAE 250 700 600 200 500 150 400 100 300 200 50 100 0 0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 1001 1011 0010 i1 i2 i3 i4 i1,i2 i1,i3 i1,i4 i3,i4 i2,i3 i2,i4 original VAE IFM_DI 5% original VAE IFM_DI 5% Figure 3: Details of the discrepancies on the number of duplicates (left) and the itemsets support (right) between VAE and IFM-based techniques. duplicates IFM-based techniques 250 200 150 100 50 0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 1001 0111 1011 0010 original IFM IFM_I IFM_D IFM_DI IFM_DI 5% support IFM-based techniques 700 600 500 400 300 200 100 0 i1 i2 i3 i4 i1,i2 i1,i3 i1,i4 i3,i4 i2,i3 i2,i4 original IFM IFM_I IFM_D IFM_DI IFM_DI 5% Figure 4: Details of the discrepancies on the number of duplicates (top) and the itemsets support (bottom) between the original dataset and IFM-based techniques. duplicates and the supports almost similar to the ones of the original dataset, while VAE−{𝑡4 , 𝑡6 } changes many of them to remove the two transactions. The approach based on generative modeling is in general more efficient. However, constraint-based generation is sensitive to the frequency threshold and a suitable tuning can make these approaches comparable. To summarize, these experiments support an underlying intuition: Constraint-based genera- tion allows more control on the expected outcome at the expense of a higher computational cost, whereas probabilistic generative models provide more faithful reconstructions but are less controllable. This essentially means that, without any further modeling artifact (that we do not consider here), generative models are prone to fail in providing tailored reconstructions where some patterns can be suppressed and new ones introduced. By contrast, contraint-based generation is more suitable for reconstructions "by design". � duplicates comparison IFM-VAE w/o { t4,t6} support comparison IFM-VAE w/o { t4,t6} 250 700 200 500 150 100 300 50 100 0 t1 t2 t3 t5 t7 t8 t9 t10 1001 0011 -100 i1 i2 i3 i4 i1,i2 i1,i3 i1,i4 i3,i4 i2,i3 i2,i4 original VAE w/o { t4,t6} IFM_DI w/o { t4,t6} original VAE w/o { t4,t6} IFM_DI 5% w/o { t4,t6} Figure 5: Details of the discrepancies on the number of duplicates (left) and the itemsets support (right) between VAE and IFM-based techniques, without transactions 𝑡4 and 𝑡6 . 5. Conclusions This paper has provided an overview about state-of-the-art approaches for synthetic transac- tional data generation. A transaction has been modeled as a high-dimension sparse itemset, that can be mapped into a binary vector, defined over the item space. The investigated algo- rithms are (variants of the) Inverse Frequent Itemset Mining (IFM), and Probabilistic Generative Models (PGMs). According to our analysis, the IFM approaches result to be extremely flexible and understandable; they enable the control of the data generation procedure by the direct identification of the discovery patterns to preserve. However, they proved to have extremely onerous computational costs, making them not feasible in high-dimensional contexts. An opposite conclusion has been obtained by analyzing PGMs: they are extremely fast and accurate, but strongly lacking in control, flexibility and understandability. As future work, an interesting research line is trying to investigate novel methodologies and techniques that are able to take advantage of IFM and PGMs, by combining their strong points and mitigating their weakness. Another promising research line is to apply the combination of the two approaches to NoSQL applications by considering the extension of IFM that has been recently proposed in [12]. References [1] G. Manco, E. Ritacco, A. Rullo, D. Saccà, E. Serra, Machine learning methods for generating high dimensional discrete datasets, WIREs Data Mining Knowl. Discov. 12 (2022). URL: https://doi.org/10.1002/widm.1450. doi:10.1002/widm.1450. [2] C. P. Chen, C.-Y. Zhang, Data-intensive applications, challenges, techniques and tech- nologies: A survey on big data, Information Sciences 275 (2014) 314 – 347. doi:https: //doi.org/10.1016/j.ins.2014.01.015. [3] G. Weikum, Where’s the data in the big data wave?, 2013. ACM Sigmod Blog: http://wp.sigmod.org/?p=786. [4] R. Agrawal, T. Imieliński, A. Swami, Mining association rules between sets of items in large databases, in: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, SIGMOD ’93, ACM, New York, NY, USA, 1993, pp. 207–216. � [5] J. Han, H. Cheng, D. Xin, X. Yan, Frequent pattern mining: current status and fu- ture directions, Data Mining and Knowledge Discovery 15 (2007) 55–86. doi:10.1007/ s10618-006-0059-1. [6] L. Cagliero, P. Garza, Itemset generalization with cardinality-based constraints, Informa- tion Sciences 244 (2013) 161 – 174. doi:https://doi.org/10.1016/j.ins.2013.05. 008. [7] T. Mielikainen, On inverse frequent set mining, in: Proceedings of 2nd Workshop on Privacy Preserving Data Mining, PPDM ’03, IEEE Computer Society, Washington, DC, USA, 2003, pp. 18–23. [8] R. Agrawal, R. Srikant, Privacy-preserving data mining, in: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD ’00, ACM, New York, NY, USA, 2000, pp. 439–450. doi:10.1145/342009.335438. [9] A. Guzzo, D. Saccà, E. Serra, An effective approach to inverse frequent set mining, in: Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, ICDM ’09, IEEE Computer Society, Washington, DC, USA, 2009, pp. 806–811. doi:10.1109/ICDM. 2009.123. [10] A. Guzzo, L. Moccia, D. Saccà, E. Serra, Solving inverse frequent itemset mining with infrequency constraints via large-scale linear programs, ACM Transactions on Knowledge Discovery from Data 7 (2013) 18:1–18:39. doi:10.1145/2541268.2541271. [11] D. Gunopulos, R. Khardon, H. Mannila, H. Toivonen, Data mining, hypergraph transversals, and machine learning, in: Proceedings of the 16-th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’97, 1997, pp. 209–216. doi:10. 1145/263661.263684. [12] D. Saccà, E. Serra, A. Rullo, Extending inverse frequent itemsets mining to generate realistic datasets: complexity, accuracy and emerging applications, Data Mining and Knowledge Discovery 33 (2019) 1736–1774. doi:10.1007/s10618-019-00643-1. [13] P. Baldi, Autoencoders, unsupervised learning, and deep architectures, in: In Proceedings of the International Conference on Unsupervised and Transfer Learning workshop (UTLW), volume 27, PMLR, 2011, pp. 37–49. [14] K. P. Murphy, Machine Learning: A Probabilistic Perspective, The MIT Press, 2012. [15] D. M. Blei, A. Kucukelbir, J. D. McAuliffe, Variational inference: A review for statisti- cians, Journal of the American Statistical Association 112 (2017) 859–877. doi:10.1080/ 01621459.2017.1285773. [16] D. Liang, R. G. Krishnan, M. Hoffman, T. Jebara, Variational autoencoders for collaborative filtering, in: Proceedings of the 2018 World Wide Web Conference, WWW ’18, 2018, pp. 689–698. [17] L. Theis, A. van den Oord, M. Bethge, A note on the evaluation of generative models, in: International Conference on Learning Representations, 2016. doi:10.48550/arXiv. 1511.01844. [18] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in Neural Information Processing Systems, volume 27, 2014. [19] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein generative adversarial networks, in: Proceedings of the 34th International Conference on Machine Learning, 2017, pp. 214–223. �