Edited, memorised or added to reading queue

Do you want BuboFlash to help you learning these things? Click here to log in or create user.

Annotation 7096105569548

#abm #agent-based #machine-learning #model #priority

This makes the search for valid rules for agent behaviour to one of the biggest challenges in agent-based modelling. To solve this problem of rules that are unknown to us, using a form of machine learning to find them is intuitive. This idea was explored in [24], where a framework for agent-based modelling was presented and used to replicate Schelling’s prominent segregation model [25]. The main idea of the framework is closely related to reinforcement learning [26], in the sense that agents learn how to behave to optimize their score or utility function. However, the goal is completely different. While reinforcement learning tries to find optimal solutions and provides the Neural Network with as much information as possible, the presented framework limits the available information to things the agents can actually perceive and also allows for non-optimal decisions. The goal is to emulate a realistic decision process, not find an optimal solution

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7096175561996

#DataScience #nvidia-synthetic-data-report #synthetic

Synthetic data is divided into two types, based on whether it is generated from actual datasets or not

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Flashcard 7103868177676

Tags

#DAG #causal #edx

Question

So all these methods for confounding adjustment -- stratification, matching, inverse probability weighting, G-formula, G-estimation -- have two things in common. First, they require data on the confounders that block the backdoor path. If those data are available, then the choice of one of these methods over the others is often a matter of personal taste. Unless the treatment is time-varying -- then we have to go to [...]

Answer

G-methods

status	not learned	measured difficulty	37% [default]	last interval [days]
repetition number in this series	0	memorised on		scheduled repetition
scheduled repetition interval		last repetition or drill

Parent (intermediate) annotation

Open it
oor path. If those data are available, then the choice of one of these methods over the others is often a matter of personal taste. Unless the treatment is time-varying -- then we have to go to <span>G-methods <span>

Original toplevel document (pdf)

cannot see any pdfs

Flashcard 7103870799116

Tags

#abm #agent-based #machine-learning #model #priority

Question

[...] To solve this problem of rules that are unknown to us, using a form of machine learning to find them is intuitive. This idea was explored in [24], where a framework for agent-based modelling was presented and used to replicate Schelling’s prominent segregation model [25]. The main idea of the framework is closely related to reinforcement learning [26], in the sense that agents learn how to behave to optimize their score or utility function. However, the goal is completely different. While reinforcement learning tries to find optimal solutions and provides the Neural Network with as much information as possible, the presented framework limits the available information to things the agents can actually perceive and also allows for non-optimal decisions. The goal is to emulate a realistic decision process, not find an optimal solution

Answer

This makes the search for valid rules for agent behaviour to one of the biggest challenges in agent-based modelling.

status	not learned	measured difficulty	37% [default]	last interval [days]
repetition number in this series	0	memorised on		scheduled repetition
scheduled repetition interval		last repetition or drill

Parent (intermediate) annotation

Open it
This makes the search for valid rules for agent behaviour to one of the biggest challenges in agent-based modelling. To solve this problem of rules that are unknown to us, using a form of machine learning to find them is intuitive. This idea was explored in [24], where a framework for agent-based mode

Original toplevel document (pdf)

cannot see any pdfs

Annotation 7103872896268

#abm #agent-based #machine-learning #model #priority

Compared to the conventional approach to agent-based modelling, using this framework has various advantages. First and foremost, the most difficult task in developing an agent-based model, namely the definition of the rules and equations governing agent behaviour is translated to the definition of the goals of each agent and which parts of the system they can observe.

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

Parent (intermediate) annotation

Open it
Compared to the conventional approach to agent-based modelling, using this framework has various advantages. First and foremost, the most difficult task in developing an agent-based model, namely the definition of the rules and equations governing agent behaviour is translated to the definition of the goals of each agent and which parts of the system they can observe. The connection between input and decision is then handled objectively by an Artificial Neural Network. This also means that the model is highly adaptive. If the goals of the agents, the

Original toplevel document (pdf)

cannot see any pdfs

Flashcard 7103874993420

Tags

#abm #agent-based #machine-learning #model #priority #synergistic-integration

Question

The key feature of ML is its ability to automatically learn and improve based on data and empirical information without being [...], which is known as “self-learning.”

Answer

explicitly programmed

status	not learned	measured difficulty	37% [default]	last interval [days]
repetition number in this series	0	memorised on		scheduled repetition
scheduled repetition interval		last repetition or drill

Parent (intermediate) annotation

Open it
The key feature of ML is its ability to automatically learn and improve based on data and empirical information without being explicitly programmed, which is known as “self-learning.”

Original toplevel document (pdf)

cannot see any pdfs

Flashcard 7103877614860

Tags

#DataScience #nvidia-synthetic-data-report #synthetic

Question

Synthetic data is divided into two types, based on whether it is generated [...] or not

Answer

from actual datasets

status	not learned	measured difficulty	37% [default]	last interval [days]
repetition number in this series	0	memorised on		scheduled repetition
scheduled repetition interval		last repetition or drill

Parent (intermediate) annotation

Open it
Synthetic data is divided into two types, based on whether it is generated from actual datasets or not

Original toplevel document (pdf)

cannot see any pdfs

Annotation 7103880236300

#abm #agent-based #machine-learning #model #priority

This makes the search for valid rules for agent behaviour to one of the biggest challenges in agent-based modelling

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

Fig. 1. Four customers with markedly different purchase patterns but identical features in terms of recency (last purchase), frequency (number of purchases), and seniority (first purchase

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103894916364

#feature-engineering #lstm #recurrent-neural-networks #rnn

All four customers in the figure have the same seniority (date of first purchase), recency (date of last purchase), and frequency (number of purchases). However, each of them has a visibly different transaction pattern. A response model relying exclusively on seniority, recency, and frequency would not be able to distinguish between customers who have similar features but different behavioral sequence.

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103896489228

#feature-engineering #lstm #recurrent-neural-networks #rnn

in a complex environment where there are multiple streams of data, such as in a data-rich environment where the analyst has access to historical marketing activity of various sorts (e.g., multiple types of solicitations sent through various marketing channels) and diverse customer behaviors (e.g., purchase histories across various product categories and sales channels) observed across different contexts (e.g., multiple business units or websites, see Park & Fader, 2004), the vast number and exponential complexity of inter-sequence and inter-temporal interactions (e.g., sequences of marketing actions, such as email–phone–catalog vs. catalog–email– phone) will make the data analyst's job arduous

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103898062092

#feature-engineering #lstm #recurrent-neural-networks #rnn

When an analyst uses feature engineering to predict behavior, the performance of the model will depend greatly on the analyst's domain knowledge, and in particular, her ability to translate that domain knowledge into relevant features

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103899634956

#feature-engineering #lstm #recurrent-neural-networks #rnn

While LSTM models take raw behavioral data as input and therefore do not rely on feature engineering or domain knowledge, our experience taught us that some fine-tuning is required to achieve optimal LSTM performance.

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103901207820

[unknown IMAGE 7103902780684]

#feature-engineering #has-images #lstm #recurrent-neural-networks #rnn

Fig. 2. Classic feedforward neural network (A), recurrent neural network (B), and “unrolled” graphical representation of a recurrent neural network (C) where we use sequence data (x 1 ,x 2 ,x 3 ) to make sequence predictions (y 1 ,y 2 ,y 3 ) while preserving information through the hidden states h 1 ,h 2 ,h 3

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103905402124

#feature-engineering #lstm #recurrent-neural-networks #rnn

Each module in the sequence is sometimes referred to as timesteps based on their position in the sequence. The RNN processes a sequence of input vectors (x 1 ,x 2 ,x 3 , …,x T ), with each vector being input into the RNN model at its corresponding timestep or position in the sequence. The RNN has a multidimensional hidden state, which summarizes task-relevant information from the entire history and is updated at each timestep as well.

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103906974988

#feature-engineering #lstm #recurrent-neural-networks #rnn

Because of their typical high-dimensionality, the hidden states of RNN models are usually more potent than that of hidden markov models (e.g., Netzer, Lattin, & Srinivasan, 2008), which are commonly used in marketing to capture customer dynamics. The HMM has N discrete hidden states (where N is typically small) and, therefore, has only log 2 (N) bits of information available to capture the sequence history (Brown & Hinton, 2001). On the other hand, the RNN has distributed hidden states, which means that each input generally results in changes across all the hidden units of the RNN (Ming et al., 2017). RNNs combine a large number of distributed hidden states with nonlinear dynamics to update these hidden states, thereby allowing it to have a more substantial representational capacity when compared with an HMM

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103908547852

#feature-engineering #lstm #recurrent-neural-networks #rnn

The learning mechanism of the recurrent neural network thus involves:

(1) the forward propagation step where the cross- entropy loss is calculated;

(2) the backpropagation step where the gradient of the parameters with respect to the loss is calculated; and finally,

(3) the optimization algorithm, that changes the parameters of the RNN based on the gradient.

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103910120716

#feature-engineering #lstm #recurrent-neural-networks #rnn

The RNN processes the entire sequence of available data without having to summarize it into features. Since customer transactions occur sequentially, they can be modeled as a sequence prediction task using an RNN as well, where all firm actions and customer responses are represented by elements in a vector. For instance, suppose a firm solicits customers either through phone, mail, or email (three channels), and customers may purchase across 17 product categories. All the analyst has to do is to encode each observation period (e.g., a day, a week, a month) as a vector of size 20, where all the values are equal to 0, except when a solicitation is sent, or a purchase is observed. If purchase seasonality is significant, e.g., if peaks in sales occur around Christmas, the analyst can also encode the current month using a one-hot vector of size 12, for a total vector length of 32 raw inputs.

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103911693580

#feature-engineering #lstm #recurrent-neural-networks #rnn

For natural language processing, an RNN would encode the sentence “A black cat jumped on the table” as a sequence of seven vectors (x 1 , x 2 , … x 7 ), where each word would be represented as a single non-zero value in a sparse vector 2 (Goodfellow et al., 2016). For instance, if we train a model with a vocabulary of 100,000 words, the first word “A” in the sentence would be encoded as a sparse vector of 100,000 numerical values, all equal to 0, except the first (corresponding to the word “A”), which would be equal to 1. The word “black” would be encoded as a sparse vector of 100,000 zero's, except the 12,853rd element (corresponding to the word “black”) equal to 1, etc

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103913266444

#feature-engineering #lstm #recurrent-neural-networks #rnn

The dimensionality of the vector is often reduced through word embedding, a technique used in natural language processing, and with little applicability to panel data analysis. We skip this discussion in the interest of space

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103914839308

#feature-engineering #lstm #recurrent-neural-networks #rnn

While an RNN can carry forward useful information from one timestep to the next, however, it is much less effective at capturing long-term dependencies (Bengio, Simard, & Frasconi, 1994; Pascanu, Mikolov, & Bengio, 2013). This limitation turns out to be a crucial problem in marketing analytics.

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103916412172

#feature-engineering #lstm #recurrent-neural-networks #rnn

The effect of a direct mailing does not end after the campaign is over, and the customer has made her decision to respond or not. An advertising campaign or customer retention program can impact customers' behaviors for several weeks, even months. Customers tend to remember past events, at least partially. Hence, the effects of marketing actions tend to carry- over into numerous subsequent periods (Lilien, Rangaswamy, & De Bruyn, 2013; Schweidel & Knox, 2013; Van Diepen et al., 2009). The LSTM neural network, which we introduce next, is a kind of RNN that has been modified to effectively capture long-term dependencies in the data

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103917985036

#feature-engineering #lstm #recurrent-neural-networks #rnn

The LSTM network forms a chain of repeating modules, like any RNN, but the modules, apart from the external recurrent function of the RNN, possess an internal recurrence (or self-loop), which lets the gradients flow for long durations without exploding or vanishing

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103919557900

#feature-engineering #lstm #recurrent-neural-networks #rnn

It is worth noting that though our study focuses on LSTM neural networks, there are other variants of the RNN as well such as the Gated Recurrent Unit (GRU) which use internal recurrence and gating mechanism along with the external recurrence of the RNN (Cho et al., 2014; Chung, Gulcehre, Cho, & Bengio, 2014). However, research seems to suggest that none of the existing variants of the LSTM may significantly improve on the vanilla LSTM neural network

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103921130764

#feature-engineering #lstm #recurrent-neural-networks #rnn

At each timestep, we submit relevant variables x, such as marketing actions (e.g., solicitations), customer behaviour (e.g., purchase occurrences), and seasonality indicators (e.g., month), in the form of a vector of dummy variables. In our illustration, the y variable is a vector of size one that indicates whether the customer has purchased during the following period. However, the dependent variable can easily include multiple indicators.

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103922703628

#feature-engineering #lstm #recurrent-neural-networks #rnn

While training a model, the analyst aims at setting the parameters and hyperparameters such that the model reaches optimal capacity (Goodfellow et al., 2016) and therefore maximizes the chances that the model will generalize well to unseen data. Models with low capacity would underfit the training set and hence have a high bias. However , models with high capacity may overfit the training set and exhibit high variance. Representational capacity is the ability of the model to fit a wide range of functions. However, the effective capacity of a model might be lower than its representational capacity because of limitations and shortcomings, such as imperfect optimization or suboptimal hyperparameters (Goodfellow et al., 2016). To increase the match of the model's effective capacity and the complexity of the task at hand, the analyst needs to tune both the parameters and the hyperparameters of the model. Given how sensitive LSTM models are to hyperparameter tuning, this area requires particular attention.

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103924276492

#feature-engineering #lstm #recurrent-neural-networks #rnn

As the number of hyperparameters and their range grow, the search space becomes exponentially complex, and tuning the models manually or by grid-search becomes impractical . Bayesian optimization for hyperparameter tuning provides hyperparameters (step 1) iteratively based on previous performance (Shahriari, Swersky, Wang, Adams, & De Freitas, 2015). We use Bayesian optimization to search the hyperparameter space for our model extensively.

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103926111500

#feature-engineering #lstm #recurrent-neural-networks #rnn

The logit model performs remarkably well at high lift values (i.e., 20%), whereas the random forest model shines at lower lift values (lift at 1%). This result might suggest that the best traditional model to deploy depends on the degree of targeting the analyst seeks. Random forest models are particularly good at identifying tiny niches of super-responsive donors, and therefore are well suited for ultra-precise targeting. Logit models generalize well. Hence, they appear more appropriate to target wider portions of the donor base.

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103927684364

#feature-engineering #lstm #recurrent-neural-networks #rnn

Interestingly, the LSTM model beats both benchmark model across the board and performs well at all lift levels.

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103929257228

#feature-engineering #lstm #recurrent-neural-networks #rnn

While an LSTM model does not depend on the analyst's ability to craft meaningful model features, traditional benchmarks do heavily rely on human expertise. Consequently, when an LSTM model shows superior results over a traditional response model—as we have shown in the previous illustration —we cannot ascertain whether it is due to the superiority of the LSTM model, or to the poor performance of the analyst who designed the benchmark model. To alleviate that concern, we asked 297 graduate students in data science and business analytics from one of the top-ranked specialized masters in the world to compete in a marketing analytics prediction contest.

Each author participated and submitted multiple models as well, for a total of 816 submissions. With the LSTM model competing against such a wide variety of human expertise and modelling approaches, it becomes easier to disentangle the model performance from its human component

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103930830092

#feature-engineering #lstm #recurrent-neural-networks #rnn

Several students (and the instructor) included a relative measure of time gap compared to contact's recency. For instance, an average time gap between donations of 300 days, and recency of 150 days, gives a ratio of 0.5. A ratio close to 1 indicates perfect timing for a solicitation. A value above 1 indicates the donor might have churned. Variants included the introduction of standard deviations in the computations (confidence interval) and mathematical transformations (log-transform, square root.)

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103932402956

#feature-engineering #lstm #recurrent-neural-networks #rnn

Some contestants linked the donors' ZIP codes to publicly-available Census bureau data to infer education attainment, income, number of children, age, etc., or have linked donors' first names to the average age pyramid of said first names in the population to infer donors' age

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103933975820

#feature-engineering #lstm #recurrent-neural-networks #rnn

For this exercise, the authors developed two separate LSTM models. The first one predicted the likelihood that each donor was going to respond favorably to the solicitation (0/1), and we calibrated it on the entire calibration data (N = 61,928). The second LSTM model predicted the donation amount t in case of donation, and we calibrated it on the individuals who donated in the calibration data (N = 6,456)

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103935548684

#feature-engineering #lstm #recurrent-neural-networks #rnn

For the purpose of sequence generation, we grouped the data in bimonthly increments, for a total of 24 steps per calendar year (e.g., January 1–15 is period 1, January 16–31 is period 2).

Both LSTM models used the same sequences of raw indicators as inputs, namely:

1. Online solicitation (0/1)
2. Offline solicitation (0/1)
3. Online, one-off donation (0/1)
4. Online, automatic deduction (0/1)
5. Offline, one-off donation (0/1)
6. Offline, automatic deduction (0/1)
7. One-off donation amount (0/log(amount))
8. Automatic deduction amount (0/log(amount))

For instance, if a contact donates 50 € by check on February 4, and is solicited by email on February 11, the sequence data for that period (3rd period of the year) indicates “online solicitation = 1,”“offline, one-off donation = 1,” and “one-off donation amount = log(50),” with all the other indicators equal to 0.

The only differences between the two independent LSTM models are: (a) the data we use to train the models and (b) the output functions. For the response model, the output is processed through a sigmoid function to ensure a probability between 0 and 1; for the amount model, the output is exponentiated to guarantee an amount prediction in the positive domain.

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103937907980

#feature-engineering #lstm #recurrent-neural-networks #rnn

Brand Choice and Market Share Forecasting Using Scanner Data

Demand forecasting for products within a category is a critical task for retailers and brand managers alike. The multinomial logit model (MNL) is commonly used to predict brand choice and market share using marketing-mix and loyalty variables (Guadagni & Little, 1983). Artificial feedforward neural networks (ANN) have also been shown to effectively predict household brand choices, as well as brand market shares (Agrawal & Schorling, 1996). Since brand choices can be modeled as sequential choices, and data complexity increases exponentially with the number of brands (with interaction effects), LSTM neural networks offer suitable alternatives. Similar to our studies, we could encode brand choices and the decision environment as we encoded solicitations and donations: as a multidimensional vector. We conjecture that testing the performance of LSTM neural networks in the context of brand choices would constitute an exciting replication area.

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103939480844

#feature-engineering #lstm #recurrent-neural-networks #rnn

The LSTM neural network typology is well-suited for modeling churn, especially in time-series format. However, its performance against standard churn prediction models remains an avenue for further research

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103941840140

#feature-engineering #lstm #recurrent-neural-networks #rnn

Clickstream Data

Online retailers routinely use clickstream data to predict online customer behavior. These retailers observe the clickstream data from a panel of customers and use the history of customers' browsing behavior to make predictions about browsing behaviors, purchasing propensities, or consumer interests. Marketing academics have leveraged the clickstream data of a single website to model the evolution of website-visit behavior (Moe & Fader, 2004a) and purchase-conversion behavior (Moe & Fader, 2004b).

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103943413004

#feature-engineering #lstm #recurrent-neural-networks #rnn

Park and Fader (2004) leveraged internet clickstream data from multiple websites, such that relevant information from one website could be used to explain behavior on the other. The LSTM neural network would be well-suited for modeling online customer behavior across multiple websites since it can naturally capture inter-sequence and inter-temporal interactions from multiple streams of clickstream data without growing exponentially in complexity.

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103944985868

#feature-engineering #lstm #recurrent-neural-networks #rnn

Third, while LSTM models offer a markedly improved solution to the problem of exploding gradients (over vanilla RNN models), they are not guaranteed to be shielded from it entirely. Facing such an issue, the analyst might need to rely on computational tricks, such as gradient clipping (Bengio, 2012), gradient scaling, or batch normalization (Bjorck, Gomes, Selman, & Weinberger, 2018)

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103946558732

#feature-engineering #lstm #recurrent-neural-networks #rnn

Finally, the field of deep learning in general, and recurrent neural networks, in particular, is evolving rapidly. Many alternative model specifications and network architectures offer the promises of improvements over vanilla LSTM models. They have already been proven superior in some domains. Such alternative specifications include Gated Recurrent Units, BiLSTM (Siami-Namini, Tavakoli, & Namin, 2019), Multi-Dimensional LSTM (Graves & Schmidhuber, 2009), Neural Turing Machines (Graves, Wayne, & Danihelka, 2014), Attention-Based RNN and its various implementations (e.g., Bahdanau, Cho, & Bengio, 2014; Luong, Pham, & Manning, 2015), or Transformers (Vaswani et al., 2017). It is not clear that one architecture will lead systematically to the best possible performance. Lacking benchmarking studies, the analyst may be required to experiment with several models (although, as demonstrated in this paper, a simple LSTM model already provides excellent performance)

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103948131596

#feature-engineering #lstm #recurrent-neural-networks #rnn

In this paper, we have shown that recent neural network architectures, traditionally used in natural language processing and machine translation, could effectively do away with the complicated and time-consuming step of feature engineering, even when applied to highly structured problems such as predicting the future behaviors of a panel of customers. We apply the LSTM neural networks to predict customer responses in direct marketing and discuss its possible application in other contexts within marketing, such as market-share forecasting using scanner data, churn prediction, or predictions using clickstream data

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103949704460

#feature-engineering #lstm #recurrent-neural-networks #rnn

There are examples of authors that used 274 features to predict customer behaviors in a non-contractual setting. One of the authors, who has extensive industry experience, has built predictive models with 600 features and more. Feature engineering is not only a time- consuming process, it is also error-prone, complex, and highly dependent on the analyst's domain knowledge (or, sometimes, lack thereof). On the other hand, LSTM neural networks rely on raw unsummarized data to predict customer behaviors and can be scaled easily to very complex settings involving multiple streams of data

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Flashcard 7103952588044

Tags

#DAG #causal #edx #has-images #inference

[unknown IMAGE 7096178707724]

Question

As you may have already noticed, the case-control design selects individuals based on their [...]. Women who did develop cancer are much more likely to be included in the study than women who did not develop cancer. Therefore, our causal graph will include a note for selection-- C-- an arrow from the outcome Y to C, and a box around C to indicate that the analysis is conditional on having been selected into the study, which means that we are only one arrow away from selection bias.

Answer

outcome

status	not learned	measured difficulty	37% [default]	last interval [days]
repetition number in this series	0	memorised on		scheduled repetition
scheduled repetition interval		last repetition or drill

Parent (intermediate) annotation

Open it
As you may have already noticed, the case-control design selects individuals based on their outcome. Women who did develop cancer are much more likely to be included in the study than women who did not develop cancer. Therefore, our causal graph will include a note for selection-- C--

Original toplevel document (pdf)

cannot see any pdfs

Flashcard 7103954423052

Tags

#DAG #causal #edx #has-images

[unknown IMAGE 7093205732620]

Question

For example, suppose that the causal DAG includes an unmeasured common cause of A and Y, U and also a [...] variable L that is an effect of U.

Answer

measured

status	not learned	measured difficulty	37% [default]	last interval [days]
repetition number in this series	0	memorised on		scheduled repetition
scheduled repetition interval		last repetition or drill

Parent (intermediate) annotation

Open it
For example, suppose that the causal DAG includes an unmeasured common cause of A and Y, U and also a measured variable L that is an effect of U.

Original toplevel document (pdf)

cannot see any pdfs

Flashcard 7103956258060

Tags

#DAG #causal #edx

Question

Two sources of bias:

- common cause (confounding)

- [...] (selection bias)

Answer

conditioning on common effect

status	not learned	measured difficulty	37% [default]	last interval [days]
repetition number in this series	0	memorised on		scheduled repetition
scheduled repetition interval		last repetition or drill

Parent (intermediate) annotation

Open it
Two sources of bias: - common cause (confounding) - conditioning on common effect (selection bias)

Original toplevel document (pdf)

cannot see any pdfs

Flashcard 7103958093068

Tags

#DAG #causal #edx

Question

What is the backdoor path criterion?

This is a graphical rule that tells us whether we can identify the causal effect of interest if we know the causal DAG. And the rule is the following: we can identify the causal effect of A and Y if we have sufficient data to block all backdoor paths between A and Y. We sometimes refer to these variables that we use to eliminate the backdoor path as [...].

Answer

confounders

status	not learned	measured difficulty	37% [default]	last interval [days]
repetition number in this series	0	memorised on		scheduled repetition
scheduled repetition interval		last repetition or drill

Parent (intermediate) annotation

Open it
identify the causal effect of A and Y if we have sufficient data to block all backdoor paths between A and Y. We sometimes refer to these variables that we use to eliminate the backdoor path as <span>confounders. <span>

Original toplevel document (pdf)

cannot see any pdfs

Annotation 7103959665932

#DAG #causal #edx

What is the backdoor path criterion?

This is a graphical rule that tells us whether we can identify the causal effect of interest if we know the causal DAG.

And the rule is the following:

we can identify the causal effect of A and Y if we have sufficient data to block all backdoor paths between A and Y

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

Parent (intermediate) annotation

Open it
What is the backdoor path criterion? This is a graphical rule that tells us whether we can identify the causal effect of interest if we know the causal DAG. And the rule is the following: we can identify the causal effect of A and Y if we have sufficient data to block all backdoor paths between A and Y. We sometimes refer to these variables that we use to eliminate the backdoor path as confounders.

Original toplevel document (pdf)

cannot see any pdfs

Annotation 7103962025228

[unknown IMAGE 7093205732620]

#DAG #causal #edx #has-images

In those cases, it is generally better to adjust for L

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

Parent (intermediate) annotation

Open it
For example, suppose that the causal DAG includes an unmeasured common cause of A and Y, U and also a measured variable L that is an effect of U. In those cases, it is generally better to adjust for L, because even though adjusting for L will not eliminate all confounding by U, it will typically eliminate some of the confounding by U. In those cases we say that L is a surrogate confo

Original toplevel document (pdf)

cannot see any pdfs

Annotation 7103963598092

[unknown IMAGE 7093205732620]

#DAG #causal #edx #has-images

In those cases, it is generally better to adjust for L, because even though adjusting for L will not eliminate all confounding by U, it will typically eliminate some of the confounding by U

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

The Keras deep learning library provides an implementation of TBPTT for training recurrent neural networks. The implementation is more restricted than the general version listed above. Specifically, the k1 and k2 values are equal to each other and fixed. TBPTT(k1, k2), where k1=k2=k. This is realized by the fixed-sized three-dimensional input required to train recurrent neural networks like the LSTM.

The LSTM expects input data to have the dimensions: samples, time steps, and features. It is the second dimension of this input format, the time steps, that defines the number of time steps used for forward and backward passes on your sequence prediction problem

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103975918860

#deep-learning #keras #lstm #python #sequence

Careful choice must be given to the number of time steps specified when preparing your input data for sequence prediction problems in Keras.

The choice of time steps will influence both:

- The internal state accumulated during the forward pass.

- The gradient estimate used to update weights on the backward pass.

Note that by default, the internal state of the network is reset after each batch, but more explicit control over when the internal state is reset can be achieved by using a so-called stateful LSTM and calling the reset operation manually

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103978278156

#deep-learning #keras #lstm #python #sequence

Truncated Backpropagation Through Time, or TBPTT, is a modified version of the BPTT training algorithm for recurrent neural networks where the sequence is processed one time step at a time and periodically an update is performed back for a fixed number of time steps

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103979851020

#deep-learning #keras #lstm #python #sequence

... a naive method that splits the 1,000-long sequence into 50 sequences (say) each of length 20 and treats each sequence of length 20 as a separate training case. This is a sensible approach that can work well in practice, but it is blind to temporal dependencies that span more than 20 time steps.

— Training Recurrent Neural Networks, 2013

This means as part of framing your problem you must split long sequences into subsequences that are both long enough to capture relevant context for making predictions, but short enough to efficiently train the network

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103982210316

#deep-learning #keras #lstm #python #sequence

Chapter 3 How to Prepare Data for LSTMs

3.0.1 Lesson Goal
The goal of this lesson is to teach you how to prepare sequence prediction data for use with LSTM models. After completing this lesson, you will know: How to scale numeric data and how to transform categorical data. How to pad and truncate input sequences with varied lengths. How to transform input sequences into a supervised learning problem.

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103983783180

#deep-learning #keras #lstm #python #sequence

3.1 Prepare Numeric Data

The data for your sequence prediction problem probably needs to be scaled when training a neural network, such as a Long Short-Term Memory recurrent neural network. When a network is fit on unscaled data that has a range of values (e.g. quantities in the 10s to 100s) it is possible for large inputs to slow down the learning and convergence of your network, and in some cases prevent the network from effectively learning your problem. There are two types of scaling of your series that you may want to consider: normalization and standardization. These can both be achieved using the scikit-learn machine learning library in Python

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103986142476

#deep-learning #keras #lstm #python #sequence

3.1. Prepare Numeric Data

3.1.1 Normalize Series Data

Normalization is a rescaling of the data from the original range so that all values are within the range of 0 and 1. Normalization requires that you know or are able to accurately estimate the minimum and maximum observable values. You may be able to estimate these values from your available data. If your series is trending up or down, estimating these expected values may be difficult and normalization may not be the best method to use on your problem. If a value to be scaled is outside the bounds of the minimum and maximum values, the resulting value will not be in the range of 0 and 1. You could check for these observations prior to making predictions and either remove them from the dataset or limit them to the pre-defined maximum or minimum values. You can normalize your dataset using the scikit-learn object MinMaxScaler

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103988501772

#deep-learning #keras #lstm #python #sequence

Good practice usage with the MinMaxScaler and other scaling techniques is as follows:

Fit the scaler using available training data . For normalization, this means the training data will be used to estimate the minimum and maximum observable values. This is done by calling the fit() function.

Apply the scale to training data . This means you can use the normalized data to train your model. This is done by calling the transform() function.

Apply the scale to data going forward . This means you can prepare new data in the future on which you want to make predictions. If needed, the transform can be inverted. This is useful for converting predictions back into their original scale for reporting or plotting. This can be done by calling the inverse transform() function.

Below is an example of normalizing a contrived sequence of 10 quantities. The scaler object requires data to be provided as a matrix of rows and columns. The loaded time series data is loaded as a Pandas Series

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103990074636

#deep-learning #keras #lstm #python #sequence

from pandas import Series from sklearn.preprocessing import MinMaxScaler # define contrived series data = [10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0] series = Series(data) print(series) # prepare data for normalization values = series.values values = values.reshape((len(values), 1)) # train the normalization scaler = MinMaxScaler(feature_range=(0, 1)) scaler = scaler.fit(values) print( Min: %f, Max: %f % (scaler.data_min_, scaler.data_max_)) # normalize the dataset and print normalized = scaler.transform(values) print(normalized) # inverse transform and print inversed = scaler.inverse_transform(normalized) print(inversed)

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103993220364

[unknown IMAGE 7103991909644]

#deep-learning #has-images #keras #lstm #python #sequence

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103994531084

3.1.2 Standardize Series Data

#deep-learning #keras #lstm #python #sequence

3.1.2 Standardize Series Data

Standardizing a dataset involves rescaling the distribution of values so that the mean of observed values is 0 and the standard deviation is 1. This can be thought of as subtracting the mean value or centering the data. Like normalization, standardization can be useful, and even required in some machine learning algorithms when your data has input values with differing scales. Standardization assumes that your observations fit a Gaussian distribution (bell curve) with a well behaved mean and standard deviation. You can still standardize your time series data if this expectation is not met, but you may not get reliable results.

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103996890380

3.1.3 Practical Considerations When Scaling

#deep-learning #keras #lstm #python #sequence

3.1.3 Practical Considerations When Scaling

There are some practical considerations when scaling sequence data.

Estimate Coefficients
You can estimate coefficients (min and max values for normalization or mean and standard deviation for standardization) from the training data. Inspect these first-cut estimates and use domain knowledge or domain experts to help improve these estimates so that they will be usefully correct on all data in the future.

Save Coefficients
You will need to scale new data in the future in exactly the same way as the data used to train your model. Save the coefficients used to file and load them later when you need to scale new data when making predictions.

Data Analysis
Use data analysis to help you better understand your data. For example, a simple histogram can help you quickly get a feeling for the distribution of quantities to see if standardization would make sense.

Scale Each Series
If your problem has multiple series, treat each as a separate variable and in turn scale them separately. Here, scale refers to a choice of scaling procedure such as normalization or standardization.

Scale At The Right Time
It is important to apply any scaling transforms at the right time. For example, if you have a series of quantities that is non-stationary, it may be appropriate to scale after first making your data stationary. It would not be appropriate to scale the series after it has been transformed into a supervised learning problem as each column would be handled differently, which would be incorrect.

Scale if in Doubt
You probably do need to rescale your input and output variables. If in doubt, at least normalize your data.

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7103999249676

#deep-learning #keras #lstm #python #sequence

3.2.1 How to Convert Categorical Data to Numerical Data This involves two steps: 1. Integer Encoding. 2. One Hot Encoding. Integer Encoding As a first step, each unique category value is assigned an integer value. For example, red is 1, green is 2, and blue is 3. This is called label encoding or an integer encoding and is easily reversible. For some variables, this may be enough. The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship. For example, ordinal variables like the place example above would be a good example where a label encoding would be sufficient

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7104000822540

#deep-learning #keras #lstm #python #sequence

One Hot Encoding

For categorical variables where no such ordinal relationship exists, the integer encoding is not enough.

In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).
In this case, a one hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value. In the color variable example, there are 3 categories and therefore 3 binary variables are needed. A 1 value is placed in the binary variable for the color and 0 values for the other colors.

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Annotation 7104002395404

#deep-learning #keras #lstm #python #sequence

Sequence prediction problems must be re-framed as supervised learning problems. That is, data must be transformed from a sequence to pairs of input and output pairs.

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Edited, memorised or added to reading queue

on 12-Jul-2022 (Tue)

pdf

pdf

Parent (intermediate) annotation

Original toplevel document (pdf)

Parent (intermediate) annotation

Original toplevel document (pdf)

Parent (intermediate) annotation

Original toplevel document (pdf)

Parent (intermediate) annotation

Original toplevel document (pdf)

Parent (intermediate) annotation

Original toplevel document (pdf)

Parent (intermediate) annotation

Original toplevel document (pdf)

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

Parent (intermediate) annotation

Original toplevel document (pdf)

Parent (intermediate) annotation

Original toplevel document (pdf)

Parent (intermediate) annotation

Original toplevel document (pdf)

Parent (intermediate) annotation

Original toplevel document (pdf)

Parent (intermediate) annotation

Original toplevel document (pdf)

Parent (intermediate) annotation

Original toplevel document (pdf)

Parent (intermediate) annotation

Original toplevel document (pdf)

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

pdf

3.1. Prepare Numeric Data

3.1.1 Normalize Series Data

pdf

pdf