The simple solution of left-joining the customer table with the order table, consolidating all the information and then synthesising this big table will not work.
For two reasons:
1. Each synthesised row will be a new customer and order. But in reality we have to fix the customer and only then generate details at the order level — in other words, fix the parent and generate the child. This would leave us unable to constrain the number of products in any order, which is a key feature of the data that needs to be preserved.
2. The assumption of independence of rows is no longer valid for the customer table — each customer is independent of each other. However, the order details — products bought — are not independent. Some of these fields may even have dependencies on time and should be treated as a sequence: A given Product X is dependent on the presence of other products in the basket.
To address this particular case, let's consider two levels:
1. Customer level (name, address, etc) - that we call the “parent level”
2. Order details level (products, suppliers, etc) - that we call the “child level” It is important to preserve this structure in the synthetic version because otherwise there could be misalignments and information leaks such as orders without customers or customers having unrealistic orders. This can be seen as a particular case of sequential data.
Some other examples where sequential data is common include:
electronic health records (EHR) data - diagnostics, exams
messages sent and received between two or many agents
measurements of physical systems taken over time
credit card transactions
Note that the finest grain may not be a sequence but the key insight is that data has a structure that has to be preserved — rows are not independent
status | not read | reprioritisations | ||
---|---|---|---|---|
last reprioritisation on | suggested re-reading day | |||
started reading on | finished reading on |