Create Presentation
Download Presentation

Download Presentation
## Discrete Choice Modeling

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**William Greene**Stern School of Business New York University Discrete Choice Modeling**Bayesian Estimation**Philosophical underpinnings: The meaning of statistical information How to combine information contained in the sample with prior information**Classical Inference**Population Measurement Econometrics Characteristics Behavior Patterns Choices Imprecise inference about the entire population – sampling theory and asymptotics**Bayesian Inference**Population Measurement Econometrics Characteristics Behavior Patterns Choices Sharp, ‘exact’ inference about only the sample – the ‘posterior’ density.**Paradigms**• Classical • Formulate the theory • Gather evidence • Evidence consistent with theory? Theory stands and waits for more evidence to be gathered • Evidence conflicts with theory? Theory falls • Bayesian • Formulate the theory • Assemble existing evidence on the theory • Form beliefs based on existing evidence • (*) Gather new evidence • Combine beliefs with new evidence • Revise beliefs regarding the theory • Return to (*)**On Objectivity and Subjectivity**• Objectivity and “Frequentist” methods in Econometrics – The data speak • Subjectivity and Beliefs • Priors • Evidence • Posteriors • Science and the Scientific Method**Foundational Result**• A method of using new information to update existing beliefs about probabilities of events • Bayes Theorem for events. (Conceived for updating beliefs about games of chance)**Likelihoods**• (Frequentist) The likelihood is the density of the observed data conditioned on the parameters • Inference based on the likelihood is usually “maximum likelihood” • (Bayesian) A function of the parameters and the data that forms the basis for inference – not a probability distribution • The likelihood embodies the current information about the parameters and the data**The Likelihood Principle**• The likelihood embodies ALL the current information about the parameters and the data • Proportional likelihoods should lead to the same inferences, even given different interpretations.**“Estimation”**• Assembling information • Prior information = out of sample. Literally prior or outside information • Sample information is embodied in the likelihood • Result of the analysis: “Posterior belief” = blend of prior and likelihood**Bayesian Investigation**• No fixed “parameters.” is a random variable. • Data are realizations of random variables. There is a marginal distribution p(data) • Parameters are part of the random state of nature, p() = distribution of independently (prior to) the data, as understood by the analyst. (Two analysts could legitimately bring different priors to the study.) • Investigation combines sample information with prior information. • Outcome is a revision of the prior based on the observed information (data)**The Bayesian Estimator**• The posterior distribution embodies all that is “believed” about the model. • Posterior = f(model|data) = Likelihood(θ,data) * prior(θ) / P(data) • “Estimation” amounts to examining the characteristics of the posterior distribution(s). • Mean, variance • Distribution • Intervals containing specified probabilities**Priors and Posteriors**• The Achilles heel of Bayesian Econometrics • Noninformative and Informative priors for estimation of parameters • Noninformative (diffuse) priors: How to incorporate the total lack of prior belief in the Bayesian estimator. The estimator becomes solely a function of the likelihood • Informative prior: Some prior information enters the estimator. The estimator mixes the information in the likelihood with the prior information. • Improper and Proper priors • P(θ) is uniform over the allowable range of θ • Cannot integrate to 1.0 if the range is infinite. • Salvation – improper, but noninformative priors will fall out of the posterior.**Symmetrical Treatment of Data and Parameters**• Likelihood is p(data|) • Prior summarizes nonsample information about in p() • Joint distribution is p(data, ) • P(data,) = p(data|)p() • Use Bayes theorem to get p(|data) = posterior distribution**Priors – Where do they come from?**• What does the prior contain? • Informative priors – real prior information • Noninformative priors • Mathematical complications • Diffuse • Uniform • Normal with huge variance • Improper priors • Conjugate priors**Application**Estimate θ, the probability that a production process will produce a defective product. Sampling design: Choose N = 25 items from the production line. D = the number of defectives. Result of our experiment D = 8 Likelihood for the sample of data is L( θ | data) = θ D(1 − θ) 25−D, 0 <θ < 1. Maximum likelihood estimator of θ is q = D/25 = 0.32, Asymptotic variance of the MLE is estimated by q(1 − q)/25 = 0.008704.**Modern Bayesian Analysis**Bayesian Estimate of Distribution of (Posterior mean was .333333) Observations = 5000 (Posterior variance was .007936) Sample Mean = .334017 Sample variance = .007454 Standard Deviation = .086336 Skewness = .248077 Kurtosis-3 (excess)= -.161478 Minimum = .066214 Maximum = .653625 .025 Percentile = .177090 .975 Percentile - .510028**Bayesian Estimator**First generation: Do the integration (math)**Modern Bayesian Analysis**• Multiple parameter settings • Derivation of exact form of expectations and variances for p(1,2 ,…,K |data) is hopelessly complicated even if the density is tractable. • Strategy: Sample joint observations(1,2 ,…,K) from the posterior population and use marginal means, variances, quantiles, etc. • How to sample the joint observations??? (Still hopelessly complicated.)**Magic Tool: The Gibbs Sampler**• Problem: How to sample observations from the a population, p(1,2 ,…,K |data). • Solution: The Gibbs Sampler. • Target: Sample from f(x1, x2) = joint distribution • Joint distribution is unknown or it is not possible to sample from the joint distribution. • Assumed: Conditional distributions f(x1|x2) and f(x2|x1) are both known and marginal samples can be drawn from both. • Gibbs sampling: Obtain one draw from x1,x2 by many cycles between x1|x2 and x2|x1. • Start x1,0 anywhere in the right range. • Draw x2,0 from x2|x1,0. • Return to x1,1 from x1|x2,0 and so on. • Several thousand cycles produces a draw • Repeat several thousand times to produce a sample • Average the draws to estimate the marginal means.**Application: Bivariate Normal**• Obtain a bivariate normal sample (x,y) fromNormal[(0,0),(1,1,)]. N = 5000. • Conditionals: x|y is N[y,(1-2)] y|x is N[x,(1- 2)]. • Gibbs sampler: y0=0. • x1 = y0 + sqr(1- 2)v where v is a N(0,1) draw • y1 = x1 + sqr(1- 2)w where w is a N(0,1) draw • Repeat cycle 60,000 times. Drop first 10,000. Retain every 10th observation of the remainder.**More General Gibbs Sampler**• Objective: Sample joint observations on 1,2 ,…,K. from p(1,2 ,…,K|data) (Let K = 3) • Derive p(1|2,3,data) p(2|1,3,data) p(3|1,2,data) • Gibbs Cycles produce joint observations0. Start 1,2,3 at some reasonable values1. Sample a draw from p(1|2,3,data) using the draws of 1,2 in hand2. Sample a draw from p(2|1,3,data) using the draw at step 1 for 13. Sample a draw from p(3|1,2,data) using the draws at steps 1 and 24. Return to step 1. After a burn in period (a few thousand), start collecting the draws. The set of draws ultimately gives a sample from the joint distribution. • Order within the chain does not matter.**Strategy: Data Augmentation**• Treat yi* as unknown ‘parameters’ with • ‘Estimate’ = (,y1*,…,yN*) = (,y*) • Draw a sample of R observations from the joint population (,y*). • Use the marginal observations on to estimate the characteristics (e.g., mean) of the distribution of |y,X**Gibbs Sampler Strategy**• p(|y*,(y,X)). If y* is known, y is known.p(|y*,(y,X)) = p(|y*,X). • p(|y*,X) defines a linear regression with N(0,1) normal disturbances. • Known result for |y*:p(|y*,(y,X), =N[0,I]) = N[b*,(X’X)-1]b* = (X’X)-1X’y* • Deducea result for y*|**Gibbs Sampler, Continued**• yi*|,xiis Normal[xi’,1] • yi is informative about yi*: • If yi = 1 , then yi* > 0; p(yi*|,xiyi = 1) is truncated normal: p(yi*|,xiyi = 1) = (xi’)/[1-(xi’)]Denoted N+[xi’,1] • If yi= 0, then yi* <0; p(yi*|,xiyi = 0) is truncated normal: p(yi*|,xiyi = 0) = (xi’)/(xi’)Denoted N-[xi’,1]**Gibbs Sampler**• Preliminary: Obtain X’X then L such that LL’ = (X’X)-1. • Preliminary: Choose initial value for such as0 = 0. Start with r = 1. • (y* step) Sample N observations on y*(r) using r-1 , xi and yi and the transformations for the truncated normal distribution. • ( step) Compute b*(r) = (X’X)-1X’y*(r). Draw the observation on (r)from the normal population with mean b*(r) and variance (X’X)-1. • Cycle between the two steps 50,000 times. Discard the first 10,000 and retain every 10th observation from the retained 40,000.**Frequentist and Bayesian Results**0.37 Seconds 2 Minutes**Bayesian Model Estimation**• Specification of conditional likelihood: f(data | parameters) = L(parameters|data) • Specification of priors: g(parameters) • Posterior density of parameters: • Posterior mean = E[parameters|data]**Bayesian Estimators**• Bayesian “Random Parameters” vs. Classical Randomly Distributed Parameters • Models of Individual Heterogeneity • Sample Proportion • Linear Regression • Binary Choice • Random Effects: Consumer Brand Choice • Fixed Effects: Hospital Costs**A Random Effects Approach**• Allenby and Rossi, “Marketing Models of Consumer Heterogeneity” • Discrete Choice Model – Brand Choice • Hierarchical Bayes • Multinomial Probit • Panel Data: Purchases of 4 brands of ketchup