softmax and the output of the softmax layer at any given time step t is a k-tuple for the probability distribution across the k neurons of the output layer. We set the number of neurons k in the softmax layer to reflect the transaction counts observed across all individuals in the training data: as is the case with any ‘‘forward-looking” approach, the model can only learn from events that are observed at some point during estimation; i.e., if in the calibration period individuals only make between zero and three transactions during any of the discrete time periods, then a softmax layer with four neurons is sufficient: the neurons’ respective outputs represent the inferred probability of zero, one, two and three transactions. 10 With each vector read as input, the model’s training objective is to predict the target variable, which in this self- supervised training setup is just the input variable shifted by a single time step. Using the example from Table 2, given the sequence of input vectors starting with the first week of January, i.e. [1,January,1,F,0], [0,January,2,F,0], [1,Jan- uary,3,F,1] ..., we train the model to output the target sequence 0,1,1,...equal to the rightmost column in Table 2. With each input vector processed by the network, the internal memory component is trained to update a real-valued cell state vector to reflect the sequence of events thus far. We estimate the model parameters by minimizing the stochastic mini-batch 11 error between the predicted output and the actual target values. At the time of prediction, we fix the model parameters in the form of weights and biases between the individual neurons in the deep neural network, but the cell state vector built into the structure of the LSTM ‘‘memory” component is nonetheless being updated at each step with parts of the latest input, which helps the model learn very long-term transaction patterns. Each prediction is generated by drawing a sample from the multinomial output distribution calculated by the bottom network layer; our model therefore does not produce point or interval estimates, each output is a simulated draw 12 . Each time a draw from this multinomial distribution is made, the observation is fed back into the model as the new transaction variable input in order to generate the following time step prediction, and so on, until we create a sequence of predicted time steps of desired length. This so-called autoregressive mechanism in which an output value always becomes the new input is illustrated in Fig. 2 with the dotted arrow bending from the output layer back to the input. Fig. 2 also shows that we feed each input first into a dedicated embedding (Mikolov, Sutskever, Chen, Corrado, & Dean, 2013) 13 layer. Using embeddings is not critical to our approach, but by creating efficient and dense (real-valued) vector representations of all variables it already serves to better separate useful signals from noise and to condense the information even before it reaches the memory component (see also Chamberlain, Cardoso, & A (2017) for a similar approach). It should be highlighted that this setup of inputs with associated embeddings is completely flexible and allows for the inclusion of any time-varying context or customer-specific static variables by simply adding more inputs together with their respective embedding layers