Out of sample tests

JuliavB · Post by **JuliavB** » 11 May 2023, 14:21

Hi,

in some older literature I´ve seen that researchers investigate how the results for two choice tasks which are excluded from the analysis can be predicted by the choosen model. The result of this investigations is something like "75% of the choices (of the two excluded tasks) can be predicted rightly by the model".
In recent papers I cannot find this approach anymore. Is this still common (and if yes, is there a way it is implemented in Apollo package)? Or can these kind of statements on how precise a model predicts the choices be evaluated in any different way?

Your advice is highly appreciated.
Thank you very much in advance.
J.

Post by **stephanehess** » 18 May 2023, 07:03

Hi

this is out of sample validation. It's implemented in Apollo, but in a somewhat different way from the below. Have a look at apollo_outOfSample and the details in the manual and help file

Stephane

JuliavB · Post by **JuliavB** » 23 May 2023, 19:30

Hi Stephane,

thank you for your input.
I´ve already implemented the out of sample testing in my analysis with the result that the model does not seem to be overfitting the data.
But does that result of the out of sample test also say anything about how precise the model predicts the choices? Or should he Rho² be used as an indicator for this?
And can apollo maybe additionally do a LR Test against a completely random model (LL0) with the code below to check how much better the model predicts choices in opposite to a random model?

apollo_lrTest("mnl_model", "LL (0)")

dpalma · Post by **dpalma** » 26 May 2023, 16:08

Hi J,

The cross-validation routine in Apollo (apollo_outOfSample) only calculates the LL of each model in- and out-of sample for each repetition, but it does not calculate any measure of fit such as first preference recovery or market share recovery. But what apollo_outOfSample does do is store all the estimated parameters and descriptions of the samples for you. So you could manually go through them, predict out-of-sample for each estimated model, and calculate some fit indicator. It would -however- require some coding from your part. We might include this functionality in the future, but for now I am afraid you would have to do it manually.

About the LR test against LL(0), Apollo does not do it automatically, but you can do it easily, as in the estimation results for every model you get LL(0) included in the report. You can then easily do the LR test as -2*(LL(0) - LL*) and compare against a Chi-square with as many degrees of freedom as parameters in your model.

Best wishes
David

JuliavB · Post by **JuliavB** » 29 May 2023, 12:41

Hi David,

thank you very much for your advice.
As you are mentioning, that out of sample testing does not calculate any measure of fit directly I am wondering if the Rho² in the report of every model can be seen as the main measure of fit then? Or is there any other measure of fit included in apollo which I am not yet aware of which is more common than the approach that I would have to code manually?

Regarding the LR test against LL(0) - based on your idea I´ve done the following:
LL at equal shares, LL(0) -2507.03
LL(final) -1960.24
Parameters in model I do have to include all parameters including mu and sigma for all random parameters, right?

According to your calculation -2*(LL(0) - LL*) = 1093,58
Chi-square for 10 degrees of freedom and 0,05 alpha is 18,307
So my model is highly significant different from a completely random model, isn´t it? But there cannot be drawn any quantitative statements how much better my model predicts choices than a completely random model with this, right?

Could this insight together with the Rho² measure be a proper analysis for measure of quality?

Post by **stephanehess** » 18 Jun 2023, 10:10

Hi Julia

Your first message says ""75% of the choices (of the two excluded tasks) can be predicted rightly by the model"" but this is a misunderstanding in the literature. There is no such thing as correctly predicting a choice, as that would imply a deterministic outcome while the models are probabilistic

choice model evaluation does not look at absolute goodnees of fit as there are many factors coming into it, not least the number of alternatives. We only focus on relative improvements, comparing different models on the same data against each other. One thing you could of course look at is the average probability for the chosen alternative. But again, there would be no formal test.

You could do this by running apollo_prediction on your model and looking at the average probability the model assigns to the chosen alternative, and then compare that to a random model. This finding will be in line with log-likelihood but may be easier to interpret for you

Stephane

JuliavB · Post by **JuliavB** » 02 Jul 2023, 14:31

Hi Stephane,

thank you very much for this important clarification!

So just to get you right, checking my model fit via Rho² and AIC/BIC as well as LR tests against other model specifications would be a poroper and sufficient way to report the "qualities" of my models?

And does the testing against a LL(0) model with the following approach then add any value to the "quality-evaluation" of my model - I suppose not?!

Regarding the LR test against LL(0) - I´ve done the following:
LL at equal shares, LL(0) -2507.03
LL(final) -1960.24
Parameters in model I do have to include all parameters including mu and sigma for all random parameters, right?

According to your calculation -2*(LL(0) - LL*) = 1093,58
Chi-square for 10 degrees of freedom and 0,05 alpha is 18,307
So my model is highly significant different from a completely random model, isn´t it? But there cannot be drawn any quantitative statements how much better my model predicts choices than a completely random model with this, right?

Best,
J.

Post by **stephanehess** » 07 Jul 2023, 11:35

Hi Julia

tests allow you to say whether you can reject the null hypothesis that two models fit the data equally well, but I wouldn't go down the route of saying that model 2 is X times better than model 1 as the scale of this improvement is not meaningful.

This is why I suggested looking at average probability of correct prediction. That would e.g. tell you that for model 1, the average probability for the chosen alternative is P1, while for model 2, it is P2, and then you can say something about how close these are in terms of real-world implications.

Regarding the test against a random model, I have never really come across a model that doesn't reject the random one

Stephane

ApolloChoiceModelling forum

Out of sample tests

Out of sample tests

Re: Out of sample tests

Re: Out of sample tests

Re: Out of sample tests

Re: Out of sample tests

Re: Out of sample tests

Re: Out of sample tests

Re: Out of sample tests