Edited, memorised or added to reading queue

on 12-Jul-2022 (Tue)

Do you want BuboFlash to help you learning these things? Click here to log in or create user.

#abm #agent-based #machine-learning #model #priority
This makes the search for valid rules for agent behaviour to one of the biggest challenges in agent-based modelling. To solve this problem of rules that are unknown to us, using a form of machine learning to find them is intuitive. This idea was explored in [24], where a framework for agent-based modelling was presented and used to replicate Schelling’s prominent segregation model [25]. The main idea of the framework is closely related to reinforcement learning [26], in the sense that agents learn how to behave to optimize their score or utility function. However, the goal is completely different. While reinforcement learning tries to find optimal solutions and provides the Neural Network with as much information as possible, the presented framework limits the available information to things the agents can actually perceive and also allows for non-optimal decisions. The goal is to emulate a realistic decision process, not find an optimal solution
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#DataScience #nvidia-synthetic-data-report #synthetic
Synthetic data is divided into two types, based on whether it is generated from actual datasets or not
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




Flashcard 7103868177676

Tags
#DAG #causal #edx
Question
So all these methods for confounding adjustment -- stratification, matching, inverse probability weighting, G-formula, G-estimation -- have two things in common. First, they require data on the confounders that block the backdoor path. If those data are available, then the choice of one of these methods over the others is often a matter of personal taste. Unless the treatment is time-varying -- then we have to go to [...]
Answer
G-methods

statusnot learnedmeasured difficulty37% [default]last interval [days]               
repetition number in this series0memorised on               scheduled repetition               
scheduled repetition interval               last repetition or drill

Parent (intermediate) annotation

Open it
oor path. If those data are available, then the choice of one of these methods over the others is often a matter of personal taste. Unless the treatment is time-varying -- then we have to go to <span>G-methods <span>

Original toplevel document (pdf)

cannot see any pdfs







Flashcard 7103870799116

Tags
#abm #agent-based #machine-learning #model #priority
Question
[...] To solve this problem of rules that are unknown to us, using a form of machine learning to find them is intuitive. This idea was explored in [24], where a framework for agent-based modelling was presented and used to replicate Schelling’s prominent segregation model [25]. The main idea of the framework is closely related to reinforcement learning [26], in the sense that agents learn how to behave to optimize their score or utility function. However, the goal is completely different. While reinforcement learning tries to find optimal solutions and provides the Neural Network with as much information as possible, the presented framework limits the available information to things the agents can actually perceive and also allows for non-optimal decisions. The goal is to emulate a realistic decision process, not find an optimal solution
Answer
This makes the search for valid rules for agent behaviour to one of the biggest challenges in agent-based modelling.

statusnot learnedmeasured difficulty37% [default]last interval [days]               
repetition number in this series0memorised on               scheduled repetition               
scheduled repetition interval               last repetition or drill

Parent (intermediate) annotation

Open it
This makes the search for valid rules for agent behaviour to one of the biggest challenges in agent-based modelling. To solve this problem of rules that are unknown to us, using a form of machine learning to find them is intuitive. This idea was explored in [24], where a framework for agent-based mode

Original toplevel document (pdf)

cannot see any pdfs







#abm #agent-based #machine-learning #model #priority
Compared to the conventional approach to agent-based modelling, using this framework has various advantages. First and foremost, the most difficult task in developing an agent-based model, namely the definition of the rules and equations governing agent behaviour is translated to the definition of the goals of each agent and which parts of the system they can observe.
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on


Parent (intermediate) annotation

Open it
Compared to the conventional approach to agent-based modelling, using this framework has various advantages. First and foremost, the most difficult task in developing an agent-based model, namely the definition of the rules and equations governing agent behaviour is translated to the definition of the goals of each agent and which parts of the system they can observe. The connection between input and decision is then handled objectively by an Artificial Neural Network. This also means that the model is highly adaptive. If the goals of the agents, the

Original toplevel document (pdf)

cannot see any pdfs




Flashcard 7103874993420

Tags
#abm #agent-based #machine-learning #model #priority #synergistic-integration
Question
The key feature of ML is its ability to automatically learn and improve based on data and empirical information without being [...], which is known as “self-learning.”
Answer
explicitly programmed

statusnot learnedmeasured difficulty37% [default]last interval [days]               
repetition number in this series0memorised on               scheduled repetition               
scheduled repetition interval               last repetition or drill

Parent (intermediate) annotation

Open it
The key feature of ML is its ability to automatically learn and improve based on data and empirical information without being explicitly programmed, which is known as “self-learning.”

Original toplevel document (pdf)

cannot see any pdfs







Flashcard 7103877614860

Tags
#DataScience #nvidia-synthetic-data-report #synthetic
Question
Synthetic data is divided into two types, based on whether it is generated [...] or not
Answer
from actual datasets

statusnot learnedmeasured difficulty37% [default]last interval [days]               
repetition number in this series0memorised on               scheduled repetition               
scheduled repetition interval               last repetition or drill

Parent (intermediate) annotation

Open it
Synthetic data is divided into two types, based on whether it is generated from actual datasets or not

Original toplevel document (pdf)

cannot see any pdfs







#abm #agent-based #machine-learning #model #priority
This makes the search for valid rules for agent behaviour to one of the biggest challenges in agent-based modelling
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on


Parent (intermediate) annotation

Open it
This makes the search for valid rules for agent behaviour to one of the biggest challenges in agent-based modelling. To solve this problem of rules that are unknown to us, using a form of machine learning to find them is intuitive. This idea was explored in [24], where a framework for agent-based mod

Original toplevel document (pdf)

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
Response models in direct marketing predict customer responses from past customer behavior and marketing activity. These models often summarize past events using features such as recency or frequency (e.g., Blattberg, Kim, & Neslin, 2008; Malthouse, 1999; Van Diepen, Donkers, & Franses, 2009), and the process of feature engineering has received significant attention
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
In machine learning, a feature refers to a variable that describes some aspect of individual data objects (Dong & Liu, 2018). Feature engineering has been used broadly to refer to multiple aspects of feature creation, extraction, and transformation. Essentially, it refers to the process of using domain knowledge to create useful features that can be fed as predictors into a model.
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




[unknown IMAGE 7103892294924] #feature-engineering #has-images #lstm #recurrent-neural-networks #rnn
Fig. 1. Four customers with markedly different purchase patterns but identical features in terms of recency (last purchase), frequency (number of purchases), and seniority (first purchase
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
All four customers in the figure have the same seniority (date of first purchase), recency (date of last purchase), and frequency (number of purchases). However, each of them has a visibly different transaction pattern. A response model relying exclusively on seniority, recency, and frequency would not be able to distinguish between customers who have similar features but different behavioral sequence.
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
in a complex environment where there are multiple streams of data, such as in a data-rich environment where the analyst has access to historical marketing activity of various sorts (e.g., multiple types of solicitations sent through various marketing channels) and diverse customer behaviors (e.g., purchase histories across various product categories and sales channels) observed across different contexts (e.g., multiple business units or websites, see Park & Fader, 2004), the vast number and exponential complexity of inter-sequence and inter-temporal interactions (e.g., sequences of marketing actions, such as email–phone–catalog vs. catalog–email– phone) will make the data analyst's job arduous
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
When an analyst uses feature engineering to predict behavior, the performance of the model will depend greatly on the analyst's domain knowledge, and in particular, her ability to translate that domain knowledge into relevant features
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
While LSTM models take raw behavioral data as input and therefore do not rely on feature engineering or domain knowledge, our experience taught us that some fine-tuning is required to achieve optimal LSTM performance.
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




[unknown IMAGE 7103902780684] #feature-engineering #has-images #lstm #recurrent-neural-networks #rnn
Fig. 2. Classic feedforward neural network (A), recurrent neural network (B), and “unrolled” graphical representation of a recurrent neural network (C) where we use sequence data (x 1 ,x 2 ,x 3 ) to make sequence predictions (y 1 ,y 2 ,y 3 ) while preserving information through the hidden states h 1 ,h 2 ,h 3
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
Each module in the sequence is sometimes referred to as timesteps based on their position in the sequence. The RNN processes a sequence of input vectors (x 1 ,x 2 ,x 3 , …,x T ), with each vector being input into the RNN model at its corresponding timestep or position in the sequence. The RNN has a multidimensional hidden state, which summarizes task-relevant information from the entire history and is updated at each timestep as well.
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
Because of their typical high-dimensionality, the hidden states of RNN models are usually more potent than that of hidden markov models (e.g., Netzer, Lattin, & Srinivasan, 2008), which are commonly used in marketing to capture customer dynamics. The HMM has N discrete hidden states (where N is typically small) and, therefore, has only log 2 (N) bits of information available to capture the sequence history (Brown & Hinton, 2001). On the other hand, the RNN has distributed hidden states, which means that each input generally results in changes across all the hidden units of the RNN (Ming et al., 2017). RNNs combine a large number of distributed hidden states with nonlinear dynamics to update these hidden states, thereby allowing it to have a more substantial representational capacity when compared with an HMM
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn

The learning mechanism of the recurrent neural network thus involves:

(1) the forward propagation step where the cross- entropy loss is calculated;

(2) the backpropagation step where the gradient of the parameters with respect to the loss is calculated; and finally,

(3) the optimization algorithm, that changes the parameters of the RNN based on the gradient.

statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
The RNN processes the entire sequence of available data without having to summarize it into features. Since customer transactions occur sequentially, they can be modeled as a sequence prediction task using an RNN as well, where all firm actions and customer responses are represented by elements in a vector. For instance, suppose a firm solicits customers either through phone, mail, or email (three channels), and customers may purchase across 17 product categories. All the analyst has to do is to encode each observation period (e.g., a day, a week, a month) as a vector of size 20, where all the values are equal to 0, except when a solicitation is sent, or a purchase is observed. If purchase seasonality is significant, e.g., if peaks in sales occur around Christmas, the analyst can also encode the current month using a one-hot vector of size 12, for a total vector length of 32 raw inputs.
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
For natural language processing, an RNN would encode the sentence “A black cat jumped on the table” as a sequence of seven vectors (x 1 , x 2 , … x 7 ), where each word would be represented as a single non-zero value in a sparse vector 2 (Goodfellow et al., 2016). For instance, if we train a model with a vocabulary of 100,000 words, the first word “A” in the sentence would be encoded as a sparse vector of 100,000 numerical values, all equal to 0, except the first (corresponding to the word “A”), which would be equal to 1. The word “black” would be encoded as a sparse vector of 100,000 zero's, except the 12,853rd element (corresponding to the word “black”) equal to 1, etc
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
The dimensionality of the vector is often reduced through word embedding, a technique used in natural language processing, and with little applicability to panel data analysis. We skip this discussion in the interest of space
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
While an RNN can carry forward useful information from one timestep to the next, however, it is much less effective at capturing long-term dependencies (Bengio, Simard, & Frasconi, 1994; Pascanu, Mikolov, & Bengio, 2013). This limitation turns out to be a crucial problem in marketing analytics.
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
The effect of a direct mailing does not end after the campaign is over, and the customer has made her decision to respond or not. An advertising campaign or customer retention program can impact customers' behaviors for several weeks, even months. Customers tend to remember past events, at least partially. Hence, the effects of marketing actions tend to carry- over into numerous subsequent periods (Lilien, Rangaswamy, & De Bruyn, 2013; Schweidel & Knox, 2013; Van Diepen et al., 2009). The LSTM neural network, which we introduce next, is a kind of RNN that has been modified to effectively capture long-term dependencies in the data
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
The LSTM network forms a chain of repeating modules, like any RNN, but the modules, apart from the external recurrent function of the RNN, possess an internal recurrence (or self-loop), which lets the gradients flow for long durations without exploding or vanishing
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
It is worth noting that though our study focuses on LSTM neural networks, there are other variants of the RNN as well such as the Gated Recurrent Unit (GRU) which use internal recurrence and gating mechanism along with the external recurrence of the RNN (Cho et al., 2014; Chung, Gulcehre, Cho, & Bengio, 2014). However, research seems to suggest that none of the existing variants of the LSTM may significantly improve on the vanilla LSTM neural network
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
At each timestep, we submit relevant variables x, such as marketing actions (e.g., solicitations), customer behaviour (e.g., purchase occurrences), and seasonality indicators (e.g., month), in the form of a vector of dummy variables. In our illustration, the y variable is a vector of size one that indicates whether the customer has purchased during the following period. However, the dependent variable can easily include multiple indicators.
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
While training a model, the analyst aims at setting the parameters and hyperparameters such that the model reaches optimal capacity (Goodfellow et al., 2016) and therefore maximizes the chances that the model will generalize well to unseen data. Models with low capacity would underfit the training set and hence have a high bias. However , models with high capacity may overfit the training set and exhibit high variance. Representational capacity is the ability of the model to fit a wide range of functions. However, the effective capacity of a model might be lower than its representational capacity because of limitations and shortcomings, such as imperfect optimization or suboptimal hyperparameters (Goodfellow et al., 2016). To increase the match of the model's effective capacity and the complexity of the task at hand, the analyst needs to tune both the parameters and the hyperparameters of the model. Given how sensitive LSTM models are to hyperparameter tuning, this area requires particular attention.
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
As the number of hyperparameters and their range grow, the search space becomes exponentially complex, and tuning the models manually or by grid-search becomes impractical . Bayesian optimization for hyperparameter tuning provides hyperparameters (step 1) iteratively based on previous performance (Shahriari, Swersky, Wang, Adams, & De Freitas, 2015). We use Bayesian optimization to search the hyperparameter space for our model extensively.
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
The logit model performs remarkably well at high lift values (i.e., 20%), whereas the random forest model shines at lower lift values (lift at 1%). This result might suggest that the best traditional model to deploy depends on the degree of targeting the analyst seeks. Random forest models are particularly good at identifying tiny niches of super-responsive donors, and therefore are well suited for ultra-precise targeting. Logit models generalize well. Hence, they appear more appropriate to target wider portions of the donor base.
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
Interestingly, the LSTM model beats both benchmark model across the board and performs well at all lift levels.
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
While an LSTM model does not depend on the analyst's ability to craft meaningful model features, traditional benchmarks do heavily rely on human expertise. Consequently, when an LSTM model shows superior results over a traditional response model—as we have shown in the previous illustration —we cannot ascertain whether it is due to the superiority of the LSTM model, or to the poor performance of the analyst who designed the benchmark model. To alleviate that concern, we asked 297 graduate students in data science and business analytics from one of the top-ranked specialized masters in the world to compete in a marketing analytics prediction contest. 7 Each author participated and submitted multiple models as well, for a total of 816 submissions. With the LSTM model competing against such a wide variety of human expertise and modelling approaches, it becomes easier to disentangle the model performance from its human component
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
Several students (and the instructor) included a relative measure of time gap compared to contact's recency. For instance, an average time gap between donations of 300 days, and recency of 150 days, gives a ratio of 0.5. A ratio close to 1 indicates perfect timing for a solicitation. A value above 1 indicates the donor might have churned. Variants included the introduction of standard deviations in the computations (confidence interval) and mathematical transformations (log-transform, square root.)
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
Some contestants linked the donors' ZIP codes to publicly-available Census bureau data to infer education attainment, income, number of children, age, etc., or have linked donors' first names to the average age pyramid of said first names in the population to infer donors' age
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
For this exercise, the authors developed two separate LSTM models. The first one predicted the likelihood that each donor was going to respond favorably to the solicitation (0/1), and we calibrated it on the entire calibration data (N = 61,928). The second LSTM model predicted the donation amount t in case of donation, and we calibrated it on the individuals who donated in the calibration data (N = 6,456)
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn

For the purpose of sequence generation, we grouped the data in bimonthly increments, for a total of 24 steps per calendar year (e.g., January 1–15 is period 1, January 16–31 is period 2).

Both LSTM models used the same sequences of raw indicators as inputs, namely:

1. Online solicitation (0/1)
2. Offline solicitation (0/1)
3. Online, one-off donation (0/1)
4. Online, automatic deduction (0/1)
5. Offline, one-off donation (0/1)
6. Offline, automatic deduction (0/1)
7. One-off donation amount (0/log(amount))
8. Automatic deduction amount (0/log(amount))

For instance, if a contact donates 50 € by check on February 4, and is solicited by email on February 11, the sequence data for that period (3rd period of the year) indicates “online solicitation = 1,”“offline, one-off donation = 1,” and “one-off donation amount = log(50),” with all the other indicators equal to 0.

The only differences between the two independent LSTM models are: (a) the data we use to train the models and (b) the output functions. For the response model, the output is processed through a sigmoid function to ensure a probability between 0 and 1; for the amount model, the output is exponentiated to guarantee an amount prediction in the positive domain.

statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn

Brand Choice and Market Share Forecasting Using Scanner Data

Demand forecasting for products within a category is a critical task for retailers and brand managers alike. The multinomial logit model (MNL) is commonly used to predict brand choice and market share using marketing-mix and loyalty variables (Guadagni & Little, 1983). Artificial feedforward neural networks (ANN) have also been shown to effectively predict household brand choices, as well as brand market shares (Agrawal & Schorling, 1996). Since brand choices can be modeled as sequential choices, and data complexity increases exponentially with the number of brands (with interaction effects), LSTM neural networks offer suitable alternatives. Similar to our studies, we could encode brand choices and the decision environment as we encoded solicitations and donations: as a multidimensional vector. We conjecture that testing the performance of LSTM neural networks in the context of brand choices would constitute an exciting replication area.

statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
The LSTM neural network typology is well-suited for modeling churn, especially in time-series format. However, its performance against standard churn prediction models remains an avenue for further research
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn

Clickstream Data

Online retailers routinely use clickstream data to predict online customer behavior. These retailers observe the clickstream data from a panel of customers and use the history of customers' browsing behavior to make predictions about browsing behaviors, purchasing propensities, or consumer interests. Marketing academics have leveraged the clickstream data of a single website to model the evolution of website-visit behavior (Moe & Fader, 2004a) and purchase-conversion behavior (Moe & Fader, 2004b).

statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
Park and Fader (2004) leveraged internet clickstream data from multiple websites, such that relevant information from one website could be used to explain behavior on the other. The LSTM neural network would be well-suited for modeling online customer behavior across multiple websites since it can naturally capture inter-sequence and inter-temporal interactions from multiple streams of clickstream data without growing exponentially in complexity.
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
Third, while LSTM models offer a markedly improved solution to the problem of exploding gradients (over vanilla RNN models), they are not guaranteed to be shielded from it entirely. Facing such an issue, the analyst might need to rely on computational tricks, such as gradient clipping (Bengio, 2012), gradient scaling, or batch normalization (Bjorck, Gomes, Selman, & Weinberger, 2018)
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
Finally, the field of deep learning in general, and recurrent neural networks, in particular, is evolving rapidly. Many alternative model specifications and network architectures offer the promises of improvements over vanilla LSTM models. They have already been proven superior in some domains. Such alternative specifications include Gated Recurrent Units, BiLSTM (Siami-Namini, Tavakoli, & Namin, 2019), Multi-Dimensional LSTM (Graves & Schmidhuber, 2009), Neural Turing Machines (Graves, Wayne, & Danihelka, 2014), Attention-Based RNN and its various implementations (e.g., Bahdanau, Cho, & Bengio, 2014; Luong, Pham, & Manning, 2015), or Transformers (Vaswani et al., 2017). It is not clear that one architecture will lead systematically to the best possible performance. Lacking benchmarking studies, the analyst may be required to experiment with several models (although, as demonstrated in this paper, a simple LSTM model already provides excellent performance)
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
In this paper, we have shown that recent neural network architectures, traditionally used in natural language processing and machine translation, could effectively do away with the complicated and time-consuming step of feature engineering, even when applied to highly structured problems such as predicting the future behaviors of a panel of customers. We apply the LSTM neural networks to predict customer responses in direct marketing and discuss its possible application in other contexts within marketing, such as market-share forecasting using scanner data, churn prediction, or predictions using clickstream data
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#feature-engineering #lstm #recurrent-neural-networks #rnn
There are examples of authors that used 274 features to predict customer behaviors in a non-contractual setting. One of the authors, who has extensive industry experience, has built predictive models with 600 features and more. Feature engineering is not only a time- consuming process, it is also error-prone, complex, and highly dependent on the analyst's domain knowledge (or, sometimes, lack thereof). On the other hand, LSTM neural networks rely on raw unsummarized data to predict customer behaviors and can be scaled easily to very complex settings involving multiple streams of data
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




Flashcard 7103952588044

Tags
#DAG #causal #edx #has-images #inference
[unknown IMAGE 7096178707724]
Question
As you may have already noticed, the case-control design selects individuals based on their [...]. Women who did develop cancer are much more likely to be included in the study than women who did not develop cancer. Therefore, our causal graph will include a note for selection-- C-- an arrow from the outcome Y to C, and a box around C to indicate that the analysis is conditional on having been selected into the study, which means that we are only one arrow away from selection bias.
Answer
outcome

statusnot learnedmeasured difficulty37% [default]last interval [days]               
repetition number in this series0memorised on               scheduled repetition               
scheduled repetition interval               last repetition or drill

Parent (intermediate) annotation

Open it
As you may have already noticed, the case-control design selects individuals based on their outcome. Women who did develop cancer are much more likely to be included in the study than women who did not develop cancer. Therefore, our causal graph will include a note for selection-- C--

Original toplevel document (pdf)

cannot see any pdfs







Flashcard 7103954423052

Tags
#DAG #causal #edx #has-images
[unknown IMAGE 7093205732620]
Question
For example, suppose that the causal DAG includes an unmeasured common cause of A and Y, U and also a [...] variable L that is an effect of U.
Answer
measured

statusnot learnedmeasured difficulty37% [default]last interval [days]               
repetition number in this series0memorised on               scheduled repetition               
scheduled repetition interval               last repetition or drill

Parent (intermediate) annotation

Open it
For example, suppose that the causal DAG includes an unmeasured common cause of A and Y, U and also a measured variable L that is an effect of U.

Original toplevel document (pdf)

cannot see any pdfs







Flashcard 7103956258060

Tags
#DAG #causal #edx
Question

Two sources of bias:

- common cause (confounding)

- [...] (selection bias)

Answer
conditioning on common effect

statusnot learnedmeasured difficulty37% [default]last interval [days]               
repetition number in this series0memorised on               scheduled repetition               
scheduled repetition interval               last repetition or drill

Parent (intermediate) annotation

Open it
Two sources of bias: - common cause (confounding) - conditioning on common effect (selection bias)

Original toplevel document (pdf)

cannot see any pdfs







Flashcard 7103958093068

Tags
#DAG #causal #edx
Question

What is the backdoor path criterion?

This is a graphical rule that tells us whether we can identify the causal effect of interest if we know the causal DAG. And the rule is the following: we can identify the causal effect of A and Y if we have sufficient data to block all backdoor paths between A and Y. We sometimes refer to these variables that we use to eliminate the backdoor path as [...].

Answer
confounders

statusnot learnedmeasured difficulty37% [default]last interval [days]               
repetition number in this series0memorised on               scheduled repetition               
scheduled repetition interval               last repetition or drill

Parent (intermediate) annotation

Open it
identify the causal effect of A and Y if we have sufficient data to block all backdoor paths between A and Y. We sometimes refer to these variables that we use to eliminate the backdoor path as <span>confounders. <span>

Original toplevel document (pdf)

cannot see any pdfs







#DAG #causal #edx

What is the backdoor path criterion?

This is a graphical rule that tells us whether we can identify the causal effect of interest if we know the causal DAG.

And the rule is the following:

we can identify the causal effect of A and Y if we have sufficient data to block all backdoor paths between A and Y

statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on


Parent (intermediate) annotation

Open it
What is the backdoor path criterion? This is a graphical rule that tells us whether we can identify the causal effect of interest if we know the causal DAG. And the rule is the following: we can identify the causal effect of A and Y if we have sufficient data to block all backdoor paths between A and Y. We sometimes refer to these variables that we use to eliminate the backdoor path as confounders.

Original toplevel document (pdf)

cannot see any pdfs




[unknown IMAGE 7093205732620] #DAG #causal #edx #has-images
In those cases, it is generally better to adjust for L
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on


Parent (intermediate) annotation

Open it
For example, suppose that the causal DAG includes an unmeasured common cause of A and Y, U and also a measured variable L that is an effect of U. In those cases, it is generally better to adjust for L, because even though adjusting for L will not eliminate all confounding by U, it will typically eliminate some of the confounding by U. In those cases we say that L is a surrogate confo

Original toplevel document (pdf)

cannot see any pdfs




[unknown IMAGE 7093205732620] #DAG #causal #edx #has-images
In those cases, it is generally better to adjust for L, because even though adjusting for L will not eliminate all confounding by U, it will typically eliminate some of the confounding by U
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on


Parent (intermediate) annotation

Open it
For example, suppose that the causal DAG includes an unmeasured common cause of A and Y, U and also a measured variable L that is an effect of U. In those cases, it is generally better to adjust for L, because even though adjusting for L will not eliminate all confounding by U, it will typically eliminate some of the confounding by U. In those cases we say that L is a surrogate confounder or a proxy confounder. In summary, in any given causal DAG confounding is an absolute concept, but confounder is a relative conce

Original toplevel document (pdf)

cannot see any pdfs




#RNN #ariadne #behaviour #consumer #deep-learning #priority #retail #simulation #synthetic-data
Past study [5] has shown that retailers use conventional techniques with available data to model consumer purchase. While these help in estimating purchase pattern for loyal consumers and high selling items with reasonable accuracy, they do not perform well for the long tail. Since multiple parameters interact non-linearly to define consumer purchase pattern, traditional models are not sufficient to achieve high accuracy across thousands to millions of consumers
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#RNN #ariadne #behaviour #consumer #deep-learning #priority #retail #simulation #synthetic-data
Most retail/e-retail brands, plan their short-term inventory (2-4 weeks ahead) based on consumer purchase pattern. Also, certain sales and marketing strategies like Offer Personalization and personalized item recommendations are made leveraging results of consumer purchase predictions for the near future. Given that every demand planner works on a narrow segment of item portfolio, there is a high variability in choices that different planners recommend. Additionally, the demand planners might not get enough opportunities to discuss their views and insights over their recommendations. Hence, subtle effects like cannibalization [21], and item affinity remain unaccounted for. Such inefficiencies lead to a gap between consumer needs and item availability, resulting in the loss of business opportunities in terms of consumer churn, and out-of-stock and excess inventory
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#RNN #ariadne #behaviour #consumer #deep-learning #priority #retail #simulation #synthetic-data
In one such study [15], the authors develop a model for predicting whether a consumer performs a purchase in prescribed future time frame based on historical purchase information such as the number of transactions, time of the last transaction, and the relative change in total spending of a consumer. They found gradient boosting to perform best over test data. We propose neural network architectures with entity embeddings [9] which out-perform the gradient boosting type of models like Xgboost [4].
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#RNN #ariadne #behaviour #consumer #deep-learning #priority #retail #simulation #synthetic-data
From Neural Network architectures perspective, close to our work is Deep Neural Network Ensembles for Time Series Classification [8]. In this paper, authors show how an ensemble of multiple Convolutional Neural Networks can improve upon the state-of-the-art performance of individual neural networks. They use 6 deep learning classifiers including Multi-Layer Perceptron, Fully Convolutional Neural Network, Residual Network, Encoder [20], Multi-Channels Deep Convolutional Neural Networks [29] and Time Convolutional Neural Network [28]. The first three were originally proposed in [24]. We propose the application of such architectures in the consumer choice world and apply the concept of entity embeddings [9] along with neural network architectures like Multi-Layer Perceptron, Long Short Term Memory (LSTM), Temporal Convolutional Networks (TCN) [13] and TCN-LSTM
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#deep-learning #keras #lstm #python #sequence

2.6 Keras Implementation of TBPTT

The Keras deep learning library provides an implementation of TBPTT for training recurrent neural networks. The implementation is more restricted than the general version listed above. Specifically, the k1 and k2 values are equal to each other and fixed. TBPTT(k1, k2), where k1=k2=k. This is realized by the fixed-sized three-dimensional input required to train recurrent neural networks like the LSTM.

The LSTM expects input data to have the dimensions: samples, time steps, and features. It is the second dimension of this input format, the time steps, that defines the number of time steps used for forward and backward passes on your sequence prediction problem

statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#deep-learning #keras #lstm #python #sequence

Careful choice must be given to the number of time steps specified when preparing your input data for sequence prediction problems in Keras.

The choice of time steps will influence both:

- The internal state accumulated during the forward pass.

- The gradient estimate used to update weights on the backward pass.

Note that by default, the internal state of the network is reset after each batch, but more explicit control over when the internal state is reset can be achieved by using a so-called stateful LSTM and calling the reset operation manually

statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#deep-learning #keras #lstm #python #sequence
Truncated Backpropagation Through Time, or TBPTT, is a modified version of the BPTT training algorithm for recurrent neural networks where the sequence is processed one time step at a time and periodically an update is performed back for a fixed number of time steps
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#deep-learning #keras #lstm #python #sequence

... a naive method that splits the 1,000-long sequence into 50 sequences (say) each of length 20 and treats each sequence of length 20 as a separate training case. This is a sensible approach that can work well in practice, but it is blind to temporal dependencies that span more than 20 time steps.

— Training Recurrent Neural Networks, 2013

This means as part of framing your problem you must split long sequences into subsequences that are both long enough to capture relevant context for making predictions, but short enough to efficiently train the network

statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#deep-learning #keras #lstm #python #sequence

Chapter 3 How to Prepare Data for LSTMs

3.0.1 Lesson Goal
The goal of this lesson is to teach you how to prepare sequence prediction data for use with LSTM models. After completing this lesson, you will know: How to scale numeric data and how to transform categorical data. How to pad and truncate input sequences with varied lengths. How to transform input sequences into a supervised learning problem.

statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#deep-learning #keras #lstm #python #sequence

3.1 Prepare Numeric Data

The data for your sequence prediction problem probably needs to be scaled when training a neural network, such as a Long Short-Term Memory recurrent neural network. When a network is fit on unscaled data that has a range of values (e.g. quantities in the 10s to 100s) it is possible for large inputs to slow down the learning and convergence of your network, and in some cases prevent the network from effectively learning your problem. There are two types of scaling of your series that you may want to consider: normalization and standardization. These can both be achieved using the scikit-learn machine learning library in Python

statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#deep-learning #keras #lstm #python #sequence

3.1. Prepare Numeric Data

3.1.1 Normalize Series Data

Normalization is a rescaling of the data from the original range so that all values are within the range of 0 and 1. Normalization requires that you know or are able to accurately estimate the minimum and maximum observable values. You may be able to estimate these values from your available data. If your series is trending up or down, estimating these expected values may be difficult and normalization may not be the best method to use on your problem. If a value to be scaled is outside the bounds of the minimum and maximum values, the resulting value will not be in the range of 0 and 1. You could check for these observations prior to making predictions and either remove them from the dataset or limit them to the pre-defined maximum or minimum values. You can normalize your dataset using the scikit-learn object MinMaxScaler

statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#deep-learning #keras #lstm #python #sequence
Good practice usage with the MinMaxScaler and other scaling techniques is as follows: Fit the scaler using available training data . For normalization, this means the training data will be used to estimate the minimum and maximum observable values. This is done by calling the fit() function. Apply the scale to training data . This means you can use the normalized data to train your model. This is done by calling the transform() function. Apply the scale to data going forward . This means you can prepare new data in the future on which you want to make predictions. If needed, the transform can be inverted. This is useful for converting predictions back into their original scale for reporting or plotting. This can be done by calling the inverse transform() function. Below is an example of normalizing a contrived sequence of 10 quantities. The scaler object requires data to be provided as a matrix of rows and columns. The loaded time series data is loaded as a Pandas Series
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#deep-learning #keras #lstm #python #sequence
from pandas import Series from sklearn.preprocessing import MinMaxScaler # define contrived series data = [10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0] series = Series(data) print(series) # prepare data for normalization values = series.values values = values.reshape((len(values), 1)) # train the normalization scaler = MinMaxScaler(feature_range=(0, 1)) scaler = scaler.fit(values) print( Min: %f, Max: %f % (scaler.data_min_, scaler.data_max_)) # normalize the dataset and print normalized = scaler.transform(values) print(normalized) # inverse transform and print inversed = scaler.inverse_transform(normalized) print(inversed)
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




[unknown IMAGE 7103991909644] #deep-learning #has-images #keras #lstm #python #sequence
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




3.1.2 Standardize Series Data
#deep-learning #keras #lstm #python #sequence

3.1.2 Standardize Series Data

Standardizing a dataset involves rescaling the distribution of values so that the mean of observed values is 0 and the standard deviation is 1. This can be thought of as subtracting the mean value or centering the data. Like normalization, standardization can be useful, and even required in some machine learning algorithms when your data has input values with differing scales. Standardization assumes that your observations fit a Gaussian distribution (bell curve) with a well behaved mean and standard deviation. You can still standardize your time series data if this expectation is not met, but you may not get reliable results.

statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




3.1.3 Practical Considerations When Scaling
#deep-learning #keras #lstm #python #sequence

3.1.3 Practical Considerations When Scaling

There are some practical considerations when scaling sequence data.

Estimate Coefficients
You can estimate coefficients (min and max values for normalization or mean and standard deviation for standardization) from the training data. Inspect these first-cut estimates and use domain knowledge or domain experts to help improve these estimates so that they will be usefully correct on all data in the future.

Save Coefficients
You will need to scale new data in the future in exactly the same way as the data used to train your model. Save the coefficients used to file and load them later when you need to scale new data when making predictions.

Data Analysis
Use data analysis to help you better understand your data. For example, a simple histogram can help you quickly get a feeling for the distribution of quantities to see if standardization would make sense.

Scale Each Series
If your problem has multiple series, treat each as a separate variable and in turn scale them separately. Here, scale refers to a choice of scaling procedure such as normalization or standardization.

Scale At The Right Time
It is important to apply any scaling transforms at the right time. For example, if you have a series of quantities that is non-stationary, it may be appropriate to scale after first making your data stationary. It would not be appropriate to scale the series after it has been transformed into a supervised learning problem as each column would be handled differently, which would be incorrect.

Scale if in Doubt
You probably do need to rescale your input and output variables. If in doubt, at least normalize your data.

statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#deep-learning #keras #lstm #python #sequence
3.2.1 How to Convert Categorical Data to Numerical Data This involves two steps: 1. Integer Encoding. 2. One Hot Encoding. Integer Encoding As a first step, each unique category value is assigned an integer value. For example, red is 1, green is 2, and blue is 3. This is called label encoding or an integer encoding and is easily reversible. For some variables, this may be enough. The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship. For example, ordinal variables like the place example above would be a good example where a label encoding would be sufficient
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#deep-learning #keras #lstm #python #sequence

One Hot Encoding

For categorical variables where no such ordinal relationship exists, the integer encoding is not enough.

In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).
In this case, a one hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value. In the color variable example, there are 3 categories and therefore 3 binary variables are needed. A 1 value is placed in the binary variable for the color and 0 values for the other colors.

statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs




#deep-learning #keras #lstm #python #sequence
Sequence prediction problems must be re-framed as supervised learning problems. That is, data must be transformed from a sequence to pairs of input and output pairs.
statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

pdf

cannot see any pdfs