Page 1 of 2

prediction at individual level

Posted: 26 Jul 2020, 06:01
by zx9203
Dear all,

Following the example 3 and the manual, it seems that the prediction function only works out predicted demand for each alternative at aggregated level. Since my alternatives are unlabelled, I'm more interested in individual rather than aggregated level fitness.

Anyone can help me find out individuals who are badly predicted, so I can remove them from the dataset? I tried calculating utilities and probabilities using the estimates. In this way, I can find out the badly predicted at observation level, and remove the individual with most mismatches. But I wonder if there is a better way.

Thank you in advance!

Best,
Xian

Re: prediction at individual level

Posted: 26 Jul 2020, 09:41
by stephanehess
Hi Xian

the function apollo_prediction returns the predictions at the level of individual observations, so it should be exactly what you're wanting.The list returned by apollo_prediction contains a column called chosen.

You can also after estimation make the following call which will give you the likelihoods at the individual level (rather than observation level).

Lind=apollo_probabilities(model$estimate, apollo_inputs, functionality="estimate")

Best wishes

Stephane

Re: prediction at individual level

Posted: 27 Jul 2020, 02:32
by zx9203
Dear Stephane,

Thank you so much! This works perfectly!

Best,
Xian

Re: prediction at individual level

Posted: 10 Aug 2020, 12:54
by cybey
Hi Xian,

maybe this post of Bryan Orme is also interesing for you?

A poor internal or external validity is not necessarily an indicator of poor response behaviour, as the respondents may have little interest in the topic or may have been overstrained with the tasks. Consequently, they could simply choose anything, with the result that your model does not give a good prediction.

Best,
Nico

Re: prediction at individual level

Posted: 12 Aug 2020, 20:55
by stephanehess
Nico

thanks for bringing this post to our attention. However, I (and many fellow choice modellers, I believe) would fundamentally disagree with the suggestion to "clean from 15% to 30% of "bad" respondents from stated discrete choice".

Outliers in data are often the most useful respondents in a dataset as they tell you there are people whose choices the model is struggling to explain. This is often not the fault of the respondent, but the fault of the model. So outliers are a great opportunity for improving a model.

There is an excellent discussion on this topic in the Ben-Akiva and Lerman book

Stephane

Re: prediction at individual level

Posted: 13 Aug 2020, 09:36
by cybey
Hi, Stephane,

I find this 15-30 percent also very high. In principle, however, I like the idea of combining several indicators for (potentially) poor response behavior. One could choose the speeding indicator very conservatively, e.g. faster than 33% of the median time, and combine these "candidates" of respondents with another indicator, e.g. RLH. However - and this is probably your point - the fact that respondents may simply find the choice experiment uninteresting (e.g. a product in which there is no interest) suggests that the answers of these respondents are still valid. On the other hand, I found with two data sets that half of the respondents identified in this way were also conspicuous in other indicators. For example, these respondents did not show any variance in their response behavior in scale questions using several items (Likert scale 1-7). The respondents identified in this way accounted for only 5% in the first and <10% in the second data set.

Best wishes
Nico

Re: prediction at individual level

Posted: 13 Aug 2020, 09:50
by stephanehess
Nico

time to completion is another tricky point. I know some analysts routinely remove "fast" respondents from the data. But it's much better to try to let the data speak and understand the differences in behaviour across people, e.g. including response time as an indicator (not as an explanatory variable) of response quality. Maybe the people who respond more quickly find the experiments easy but still make meaningful choices. Maybe those respondents who take a longer time are not really concentrating more on your choice tasks but are watching TV at the same time, etc.

Stephane

Re: prediction at individual level

Posted: 13 Aug 2020, 09:58
by stephanehess
Xian

apollo_prediction returns probabilities at the observation level, not just the aggregate level. There is also a final column called chosen which you can use for your purpose. Or you can use apollo_llFitsTest. Or my earlier suggestion to you to use Lind=apollo_probabilities(model$estimate, apollo_inputs, functionality="estimate")

But either way, like in my reply to Nico, it's not a good idea to "find out individuals who are badly predicted, so I can remove them from the dataset". Instead, use them to improve your model

Stephane

Re: prediction at individual level

Posted: 25 Aug 2020, 03:09
by zx9203
Dear Stephane and Nico,

Thank you for bringing in the discussion about removing "bad" respondents. My concern is that, I would like to compare the likelihood between different subsamples, but they originally contained different numbers of respondents. That's the reason I want to trim those outliers to make balanced datasets. Is there any other way to compare the likelihood with different numbers of observations? And I'm confused about how to improve the model while keeping the outliers.

Thanks a lot!!!

Best,
Xian

Re: prediction at individual level

Posted: 28 Aug 2020, 11:56
by dpalma
Hi Xian,

If you have different number of individuals in each sample, you could calculate the average likelihood per individual (or per observation) in each sample and compare that value. That way you would be controlling for the different number of individuals (or observations).

You can obtain the likelihoods at the observation level using apollo_prediction, and at the individual level by using apollo_probabilities(model$estimate, apollo_inputs, functionality="estimate")

If you use apollo_prediction, and you have a choice model, remember to use the probability reported under the "chosen" column in the output.

Best
David