Hi Apollo Team,
I'm grateful for your support in managing this forum.
I've posted previously about my survey, but I am uploading this as I have a different question related to data processing.
I conducted a choice experiment-based survey regarding zero-emission truck choices, including battery electric trucks and hydrogen fuel cell electric trucks. A total of 54 freight companies participated. One of the important explanatory variables for specifying models is the annual revenue of the participating companies. In my survey questionnaire, this variable was presented with five options, and respondents were asked to select one: 1) <$10M, 2) $10-15M, 3) $15-30M, 4) >$30M, and 5) Decline to state (i.e., N/A). The issue is that 16% of respondents chose the "decline to state" option. How would you recommend handling these 'Not Available' data entries?
I have considered the following options, but would like to know which one might be recommended, what cautions should be exercised for each, or if there is a better way to handle this situation:
1. Treating the Annual Revenue variable as a categorical variable.
2. Assuming 'N/A' corresponds to an average annual revenue (e.g., $10-15M in my survey).
3. Excluding the observations with 'N/A' for the Annual Revenue variable (i.e., using 84% of the total observations).
4. Excluding the Annual Revenue variable from model specification.
Please let me know if I need to provide any additional information.
I'd greatly appreciate any guidance and insights you could share. Thank you very much!
Best regards,
YB
Important: Read this before posting to this forum
- This forum is for questions related to the use of Apollo. We will answer some general choice modelling questions too, where appropriate, and time permitting. We cannot answer questions about how to estimate choice models with other software packages.
- There is a very detailed manual for Apollo available at http://www.ApolloChoiceModelling.com/manual.html. This contains detailed descriptions of the various Apollo functions, and numerous examples are available at http://www.ApolloChoiceModelling.com/examples.html. In addition, help files are available for all functions, using e.g. ?apollo_mnl
- Before asking a question on the forum, users are kindly requested to follow these steps:
- Check that the same issue has not already been addressed in the forum - there is a search tool.
- Ensure that the correct syntax has been used. For any function, detailed instructions are available directly in Apollo, e.g. by using ?apollo_mnl for apollo_mnl
- Check the frequently asked questions section on the Apollo website, which discusses some common issues/failures. Please see http://www.apollochoicemodelling.com/faq.html
- Make sure that R is using the latest official release of Apollo.
- Users can check which version they are running by entering packageVersion("apollo").
- Then check what is the latest full release (not development version) at http://www.ApolloChoiceModelling.com/code.html.
- To update to the latest official version, just enter install.packages("apollo"). To update to a development version, download the appropriate binary file from http://www.ApolloChoiceModelling.com/code.html, and install the package from file
- If the above steps do not resolve the issue, then users should follow these steps when posting a question:
- provide full details on the issue, including the entire code and output, including any error messages
- posts will not immediately appear on the forum, but will be checked by a moderator first. This may take a day or two at busy times. There is no need to submit the post multiple times.
Handling 'Not Available' Data
-
- Site Admin
- Posts: 1142
- Joined: 24 Apr 2020, 16:29
Re: Handling 'Not Available' Data
Hi
I would in your case recommend option 1. Even when I treat income as a continuous variable in my models, I do not exclude people with missing data, nor do I assign them to another category. Rather, I estimate a different parameter for them, which would be in line with your idea to treat it as categorical, where this will happen anyway
Stephane
I would in your case recommend option 1. Even when I treat income as a continuous variable in my models, I do not exclude people with missing data, nor do I assign them to another category. Rather, I estimate a different parameter for them, which would be in line with your idea to treat it as categorical, where this will happen anyway
Stephane
Re: Handling 'Not Available' Data
Hi Stephane,
Thank you for sharing your insights. I believe I've understood your suggestion, and I've applied the approach you recommended. Specifically, I've incorporated this categorical variable into an interaction term with vehicle purchase costs. Could you please review what I've done and correct any errors?
The annual revenue (AR) variable in my dataset takes on one of the following values:
*annual_revenue == 1 (for AR <$10M)
*annual_revenue == 2 (for AR between $10M-15M)
*annual_revenue == 3 (for AR between $15M-30M)
*annual_revenue == 4 (for AR >$30M)
*annual_revenue == 5 (for the case of declining to state)
I've treated 'AR > $30M' as the reference category and applied shift terms, as shown below:
b_pcost_value = b_pcost + b_pcost_AR_less_than_10M*(annual_revenue==1) + b_pcost_AR_between_10M_15M*(annual_revenue==2) + b_pcost_AR_between_15M_30M*(annual_revenue==3) + b_pcost_AR_NA*(annual_revenue==5)
An example of the utility function for one alternative (battery electric vehicle) is shown below:
V[["bev"]] = asc_bev_value + b_pcost_value * bev_pcost + b_ocost_value * bev_ocost + b_range * bev_range + b_offsite_value * bev_offsite_binary + b_onsite_bev * bev_onsite
With these settings, I've obtained the following estimation results: Only "b_pcost_AR_between_15M_30M" is significant at the 5% level.
I'm curious if there are alternative approaches for formulating the interaction term between this categorical variable and vehicle purchase costs. Also, I'm wondering if it's still meaningful to have obtained such a small number of significant estimates.
I'd greatly appreciate any suggestions or insights you could provide. Thank you very much!
Best regards,
YB
Thank you for sharing your insights. I believe I've understood your suggestion, and I've applied the approach you recommended. Specifically, I've incorporated this categorical variable into an interaction term with vehicle purchase costs. Could you please review what I've done and correct any errors?
The annual revenue (AR) variable in my dataset takes on one of the following values:
*annual_revenue == 1 (for AR <$10M)
*annual_revenue == 2 (for AR between $10M-15M)
*annual_revenue == 3 (for AR between $15M-30M)
*annual_revenue == 4 (for AR >$30M)
*annual_revenue == 5 (for the case of declining to state)
I've treated 'AR > $30M' as the reference category and applied shift terms, as shown below:
b_pcost_value = b_pcost + b_pcost_AR_less_than_10M*(annual_revenue==1) + b_pcost_AR_between_10M_15M*(annual_revenue==2) + b_pcost_AR_between_15M_30M*(annual_revenue==3) + b_pcost_AR_NA*(annual_revenue==5)
An example of the utility function for one alternative (battery electric vehicle) is shown below:
V[["bev"]] = asc_bev_value + b_pcost_value * bev_pcost + b_ocost_value * bev_ocost + b_range * bev_range + b_offsite_value * bev_offsite_binary + b_onsite_bev * bev_onsite
With these settings, I've obtained the following estimation results: Only "b_pcost_AR_between_15M_30M" is significant at the 5% level.
Code: Select all
Estimate Std.err. t-ratio(0) Rob.std.err. Rob.t-ratio(0)
b_pcost 0.007 0.323 0.023 0.489 0.015
...
b_pcost_AR_less_than_10M -0.268 0.363 -0.739 0.553 -0.485
b_pcost_AR_between_10M_15M -0.123 0.483 -0.255 0.670 -0.184
b_pcost_AR_between_15M_30M -1.868 0.792 -2.360 0.744 -2.511
b_pcost_AR_NA -0.673 0.478 -1.408 0.621 -1.083
I'd greatly appreciate any suggestions or insights you could provide. Thank you very much!
Best regards,
YB
-
- Site Admin
- Posts: 1142
- Joined: 24 Apr 2020, 16:29
Re: Handling 'Not Available' Data
Hi
your specification is correct but your results are worrying as the effect in the reference group is positive and as the income effect is not monotonic
Stephane
your specification is correct but your results are worrying as the effect in the reference group is positive and as the income effect is not monotonic
Stephane
Re: Handling 'Not Available' Data
Hi Stephane,
Thank you for sharing your insights. In my dataset, there are 54 respondents, and here is the distribution of their annual revenue (AR):
The majority fall into the category of <$10M (56%). Only around 10% are dispersed among each of the remaining categories, and the number of respondents in each of these categories are relatively small (less than 10). Might this have contributed to the positive effect in the reference category and lack of monotonicity?
As an alternative approach, I simplified the specification as follows:
b_pcost_value = b_pcost*(AR_NA==0) + b_pcost_AR_NA*(AR_NA==1)
Using this, I obtained a negative value for the 'b_pcost' estimate, significant at the 10% level (t-ratio) or the 5% level (Robust t-ratio):
Then, could this be a better approach? Any suggestions you have would be greatly appreciated.
Thank you!
YB
Thank you for sharing your insights. In my dataset, there are 54 respondents, and here is the distribution of their annual revenue (AR):
Code: Select all
AR #respondents fraction
<$10M 30 56%
$10-15M 6 11%
$15-30M 4 7%
>$30M 7 13%
NA 7 13%
------------------------------------------
Total 54 100%
As an alternative approach, I simplified the specification as follows:
b_pcost_value = b_pcost*(AR_NA==0) + b_pcost_AR_NA*(AR_NA==1)
Using this, I obtained a negative value for the 'b_pcost' estimate, significant at the 10% level (t-ratio) or the 5% level (Robust t-ratio):
Code: Select all
Estimate Std.err. t-ratio(0) Rob.std.err. Rob.t-ratio(0)
b_pcost -0.332 0.198 -1.674 0.160 -2.073
b_pcost_AR_NA 0.002 0.484 0.004 0.307 0.006
Thank you!
YB
-
- Site Admin
- Posts: 1142
- Joined: 24 Apr 2020, 16:29
Re: Handling 'Not Available' Data
Hi
given your small sample size, this is a good approach
Stephane
given your small sample size, this is a good approach
Stephane
Re: Handling 'Not Available' Data
Thank you Stephen,
I greatly appreciate you sharing your insight on the issues in this post and another one as well. I will get back to you if I have further questions. Once again, thank you for organizing this forum, which is very helpful for making effective progress on my research!
YB
I greatly appreciate you sharing your insight on the issues in this post and another one as well. I will get back to you if I have further questions. Once again, thank you for organizing this forum, which is very helpful for making effective progress on my research!
YB