Page 1 of 1

Non-overlapping choice sets

Posted: 25 Jul 2023, 21:15
by BTHopkins
Hi Stephanie,

I am just getting started on my journey with Apollo.

I am working with a dataset where individual's choice sets depend on where they live and the year. For people in different areas / years, the choice sets are non-overlapping. Some people can choose between a handful of products, and others several hundred. That means there are tens of thousands of unique choices in the dataset. When I put the data in wide format, the dataset is much too large to use.

Since the choice sets are non-overlapping, I thought that I might be able to reuse choice IDs to reduce the size of the dataset. In other words, choice #1 for area / year A is not the same product as for area / year B. I don't see why this would cause an issue for calculating probabilities, but I wasn't sure if it would cause an issue in Apollo.

Does that make sense? Thanks for your help!

Re: Non-overlapping choice sets

Posted: 10 Aug 2023, 14:17
by dpalma
Hi,

Yes, you can do that without issues. That way of coding the alternative would be similar to how non-labelled data from an stated choice experiment is recorded.

For example, le us imagine you are modelling ice cream choice. Ice-cream is described by both flavour and price. The flavours can be vanilla, chocolate, lemon and pineapple. However, not all flavours are available for every individual, because of where they live. You have two ways of coding this.

The first way of coding the data is in a labelled form, it would look like below, where cost_j and av_j is the cost and availability of alternative j. The problem with this approach is that if you have too many flavours (alternatives) then you will have a lot of columns.

Code: Select all

id cost_vani cost_choc cost_lemo cost_pine av_vani av_choc av_lemo av_pine
 1         6         9        NA        NA       1       1       0       0
 2         7         8         7        NA       1       1       1       0
 3         9         8        NA         7       1       1       0       1
…
The second approach is the "unlabelled" form, that would look as below. Here we have an additional attribute for each alternatives which is "flavour". So the alternative is not defined by its flavour, but instead each alternative is just a mute container, and the flavour becomes an attribute. Note that you will have to define has many alternatives as the maximum number of alternatives that any individual in your sample has available.

Code: Select all

id flav_1 flav_2 flav_3 cost_1 cost_2 cost_3 av_1 av_2 av_3
 1   vani   choc     NA      6      9     NA    1    1    0
 2   vani   choc   lemo      7      8      7    1    1    1
 3   vani   choc   pine      9      8      7    1    1    1
 ...
Best wishes
David

Re: Non-overlapping choice sets

Posted: 27 Aug 2023, 11:01
by cheriedavy
Yes, your approach of reusing choice IDs for non-overlapping choice sets based on area and year makes sense. This should help reduce the size of the dataset while still maintaining the distinction between different products in different contexts. As long as the choice IDs remain unique within their respective area and year combinations, it should not cause issues for calculating probabilities or using Apollo. It's a valid strategy to manage your dataset efficiently. Good luck with your work!

Re: Non-overlapping choice sets

Posted: 29 Aug 2023, 22:32
by BTHopkins
Thank you both for the responses! I didn't notice until now that you had replied.

Before you responded, I also tested a slimmed down version of the model I was running using the two ways of coding, and confirmed that they yield equivalent results.

Coding variables as an attribute seems to be useful for saving space in general. In my context, for example, I include a shifter for the company that sells an alternative. With J companies and K alternatives, that requires J x K dummy variables. But if I have a variable containing the name of the company instead, I can use the value of that variable in the utility function to refer to the correct shifter. That only requires K variables.