First of all, thanks for providing this very useful package and especially the detailed documentation and examples. I found the "step-by-step" approach extremely useful and the warning and error messages helped me to eliminate bugs (in my data) easily.
Overall goal
I am looking for help regarding the following problem. My final goal is to estimate a mixed logit model with around 10-15 parameters on a dataset with 6861 rows (individuals). The data is cross-sectional, i.e. I don't observe repeated choices per individual. Originally, me and my team had the goal to model each individual's choice set consisting of around 1,000 alternatives (in reality, these are granular locations). We are also hoping to estimate this model many times, in order to iteratively vary a fixed parameter of the model (not yet included).
Current Status and Problem
To test the package, I decided to start with 50 alternatives per individual. As a start, I wanted to estimate a MNL model to retrieve estimates I can later use as starting values. However, the (pre-)estimation of the MNL appears to be very demanding, and I am not exactly sure why and how it could be improved. It appears to me that the "only" non-standard aspect of my setting is the fact that each individual has a larger number of alternatives (here 50, later hopefully more) and that these choice sets of alternatives are mostly not overlapping (for the 6861 individuals with 50 choices each I arrive at 30521 distinct alternatives).
So far, the MNL estimation takes
on 1 core:
Time taken (hh:mm:ss) : 21:24:44.54
pre-estimation : 20:26:0.9
estimation : 00:27:11.54
post-estimation : 00:31:32.1
and on 2 cores:
Time taken (hh:mm:ss) : 135:01:0.02
pre-estimation : 109:39:17.24
estimation : 00:21:44.85
post-estimation : 24:59:57.93
So as you can see, the vast majority of time is pre and post estimation. So questions I have are:
- what can I do to reduce this pre and post estimation time?
- are there any recommended options, especially when proceed towards more alternatives and the mixed logit estimation, that I could benefit from?
I am aware of the speed test function of the package, which in principle is great but does not really help me to understand what the underlying issue is.
Any help would be greatly appreciated.
Best wishes,
Felix
Please find the code below. I add the output after it while omitting some parts of the "Overview of choices for MNL model component". I can send the full .txt file in a message at any time. My code is:
Code: Select all
### Clear memory
rm(list = ls())
### Set working directory
### Packages
library(tidyverse)
library(apollo)
library(data.table)
library(purrr)
# Load Data
load("data_output/choicedata_apollo_cutoff50.RDS")
# Model options -----
NCores = 1
name = paste0("MNL_realdata_3fits_cutoff50_cores",NCores)
description = "MNL model only 3 fit variables on real data, cutoff=50"
# Specify either search = 1 (in that case initial_betas are irrelevant)
# or search = 0 and initial_betas based on some stored values or guesses.
initial_betas = c(mu_partisan_fit = -1,
mu_income_fit = 1,
mu_race_fit = 1)
# Create directory for output
output_directory = paste0("output/", name)
dir.create(output_directory)
# Estimate ----
# Apollo Choice Modeling: MNL Model, Test Data
### Initialise code ----
apollo_initialise()
# Get number of cores of computer
# nCores = parallel::detectCores()
### Set core controls
apollo_control = list(
modelName = name,
modelDescr = description,
indivID = "caseid",
outputDirectory = output_directory,
calculateLLC=FALSE,
workInLogs=TRUE,
nCores = NCores #use few cores for the simple estimations, but use multithreading to catch potential errors.
)
# Define Model & Parameters ----
### Vector of parameters, including any that are kept fixed in estimation
apollo_beta = initial_betas
### Vector with names (in quotes) of parameters to be kept fixed at their starting value in apollo_beta, use apollo_beta_fixed = c() if none
apollo_fixed = c()
# Group and validaty inputs ----
apollo_inputs = apollo_validateInputs(database=database_wide)
apollo_inputs = c(apollo_inputs, J=J)
# note: J is the number of distinct alternatives in the dataset. It was created via the following code (on a dataset in long format)
# unique_alts <- unique(database_long$alt_number)
# J <- length(unique_alts)
# Define Model & Likelihood Function ----
apollo_probabilities=function(apollo_beta, apollo_inputs, functionality="estimate"){
### Attach inputs and detach after function exit
apollo_attach(apollo_beta, apollo_inputs)
on.exit(apollo_detach(apollo_beta, apollo_inputs))
### Create list of probabilities P
P = list()
### List of utilities: will be included in the ml_settings below.
V = list()
for(j in 1:apollo_inputs$J) V[[paste0("alt",j)]] = 0
for(j in 1:apollo_inputs$J) V[[paste0("alt",j)]] = V[[paste0("alt",j)]] + mu_partisan_fit * get(paste0("partisan_fit_",j)) + mu_income_fit * get(paste0("income_fit_",j)) + mu_race_fit * get(paste0("race_fit_",j))
### Define settings for MNL model component
mnl_settings = list(
alternatives = setNames(1:apollo_inputs$J, names(V)),
avail = setNames(apollo_inputs$database[,paste0("av",1:apollo_inputs$J)], names(V)),
choiceVar = choice_alt_number,
utilities = V
)
### Compute probabilities for MNL model component
P[["model"]] = apollo_mnl(mnl_settings, functionality)
### Prepare and return outputs of function
P = apollo_prepareProb(P, apollo_inputs, functionality)
return(P)
}
## Estimate model
assign(paste0("model_",name), apollo_estimate(apollo_beta, apollo_fixed, apollo_probabilities, apollo_inputs))
# Model Outputs ----
# FORMATTED OUTPUT (TO FILE, using model name)
apollo_saveOutput(get(paste0("model_",name)))
# FORMATTED OUTPUT (TO SCREEN)
apollo_modelOutput(get(paste0("model_",name)))
Model run by user using Apollo 0.3.4 on R 4.4.1 for Linux.
Please acknowledge the use of Apollo by citing Hess & Palma (2019)
DOI 10.1016/j.jocm.2019.100170
www.ApolloChoiceModelling.com
Model name : MNL_realdata_3fits_cutoff50_cores2
Model description : MNL model only 3 fit variables on real data, cutoff=50
Model run at : 2024-11-30 17:37:15.901963
Estimation method : bgw
Model diagnosis : Relative function convergence
Optimisation diagnosis : Maximum found
hessian properties : Negative definite
maximum eigenvalue : -167.544817
reciprocal of condition number : 0.0879228
Number of individuals : 6861
Number of rows in database : 6861
Number of modelled outcomes : 6861
Number of cores used : 2
Model without mixing
LL(start) : -33616.88
LL at equal shares, LL(0) : -26976.26
LL at observed shares, LL(C) : Not applicable
LL(final) : -25637.73
Rho-squared vs equal shares : 0.0496
Adj.Rho-squared vs equal shares : 0.0495
Rho-squared vs observed shares : Not applicable
Adj.Rho-squared vs observed shares : Not applicable
AIC : 51281.47
BIC : 51301.97
Estimated parameters : 3
Time taken (hh:mm:ss) : 135:01:0.02
pre-estimation : 109:39:17.24
estimation : 00:21:44.85
post-estimation : 24:59:57.93
Iterations : 15
Unconstrained optimisation.
Estimates:
Estimate s.e. t.rat.(0) Rob.s.e. Rob.t.rat.(0)
mu_partisan_fit -1.8452 0.07681 -24.02 0.08432 -21.88
mu_income_fit -0.9386 0.02312 -40.59 0.02590 -36.23
mu_race_fit -0.7618 0.05482 -13.90 0.06120 -12.45
WARNING: Some alternatives are never chosen in your data!
WARNING: Some alternatives are always chosen when available!
Overview of choices for MNL model component :
alt1 alt2 alt3 alt4 alt5 alt6 alt7 alt8 alt9
Times available 13 9 16 9 20 15 18 15.00 13
Times chosen 0 0 0 0 0 0 0 1.00 0
Percentage chosen overall 0 0 0 0 0 0 0 0.01 0
alt10 alt11 alt12 alt13 alt14 alt15 alt16
Times available 8 16 12 10 15 10 10
Times chosen 0 0 0 0 0 0 0
Percentage chosen overall 0 0 0 0 0 0 0
alt17 alt18 alt19 alt20 alt21 alt22 alt23
Times available 7.00 8 22 18 18.00 20 16
Times chosen 1.00 0 0 0 5.00 0 0
Percentage chosen overall 0.01 0 0 0 0.07 0 0
alt24 alt25 alt26 alt27 alt28 alt29 alt30
Times available 14 11 16 12 17 11 17
Times chosen 0 0 0 0 0 0 0
Percentage chosen overall 0 0 0 0 0 0 0
...
alt30511 alt30512 alt30513 alt30514 alt30515
Times available 2 1 5 7 1
Times chosen 0 0 0 0 0
Percentage chosen overall 0 0 0 0 0
alt30516 alt30517 alt30518 alt30519 alt30520
Times available 1 1 1 1e+00 1
Times chosen 0 0 0 1e+00 0
Percentage chosen overall 0 0 0 1e-02 0
alt30521
Times available 1
Times chosen 0
Percentage chosen overall 0
[ reached getOption("max.print") -- omitted 1 row ]
Classical covariance matrix:
mu_partisan_fit mu_income_fit mu_race_fit
mu_partisan_fit 0.005900 3.977e-05 -4.4966e-04
mu_income_fit 3.977e-05 5.3470e-04 1.5011e-04
mu_race_fit -4.4966e-04 1.5011e-04 0.003005
Robust covariance matrix:
mu_partisan_fit mu_income_fit mu_race_fit
mu_partisan_fit 0.007109 -2.019e-05 -0.001722
mu_income_fit -2.019e-05 6.7107e-04 3.5045e-04
mu_race_fit -0.001722 3.5045e-04 0.003746
Classical correlation matrix:
mu_partisan_fit mu_income_fit mu_race_fit
mu_partisan_fit 1.00000 0.02239 -0.1068
mu_income_fit 0.02239 1.00000 0.1184
mu_race_fit -0.10678 0.11842 1.0000
Robust correlation matrix:
mu_partisan_fit mu_income_fit mu_race_fit
mu_partisan_fit 1.000000 -0.009243 -0.3337
mu_income_fit -0.009243 1.000000 0.2210
mu_race_fit -0.333712 0.221034 1.0000
20 most extreme outliers in terms of lowest average per choice prediction:
row Avg prob per choice
23516 1.454843e-05
1708 6.548501e-05
21493 8.255964e-05
38269 9.717677e-05
16676 1.944270e-04
45681 3.019932e-04
44312 3.308913e-04
32562 4.128866e-04
32759 6.054975e-04
28952 6.447211e-04
4592 6.638346e-04
33270 7.613318e-04
23741 8.128866e-04
38883 9.383360e-04
41965 1.042226e-03
3827 1.080671e-03
21300 1.210480e-03
34844 1.434141e-03
36240 1.565656e-03
21989 1.577433e-03
Changes in parameter estimates from starting values:
Initial Estimate Difference
mu_partisan_fit -1.000 -1.8452 -0.8452
mu_income_fit 1.000 -0.9386 -1.9386
mu_race_fit 1.000 -0.7618 -1.7618
Settings and functions used in model definition:
apollo_control
--------------
Value
modelName "MNL_realdata_3fits_cutoff50_cores2"
modelDescr "MNL model only 3 fit variables on real data, cutoff=50"
indivID "caseid"
outputDirectory "output/MNL_realdata_3fits_cutoff50_cores2/"
calculateLLC "FALSE"
workInLogs "FALSE"
nCores "2"
debug "FALSE"
seed "13"
mixing "FALSE"
HB "FALSE"
noValidation "FALSE"
noDiagnostics "FALSE"
analyticHessian "FALSE"
memorySaver "FALSE"
panelData "FALSE"
analyticGrad "TRUE"
analyticGrad_manualSet "FALSE"
overridePanel "FALSE"
preventOverridePanel "FALSE"
noModification "FALSE"
Hessian routines attempted
--------------------------
numerical second derivative of LL (using numDeriv)
Scaling used in computing Hessian
---------------------------------
Value
mu_partisan_fit 1.8452117
mu_income_fit 0.9386108
mu_race_fit 0.7617979
apollo_probabilities
----------------------
function (apollo_beta, apollo_inputs, functionality = "estimate")
{
apollo_attach(apollo_beta, apollo_inputs)
on.exit(apollo_detach(apollo_beta, apollo_inputs))
P = list()
V = list()
for (j in 1:apollo_inputs$J) V[[paste0("alt", j)]] = 0
for (j in 1:apollo_inputs$J) V[[paste0("alt", j)]] = V[[paste0("alt",
j)]] + mu_partisan_fit * get(paste0("partisan_fit_",
j)) + mu_income_fit * get(paste0("income_fit_", j)) +
mu_race_fit * get(paste0("race_fit_", j))
mnl_settings = list(alternatives = setNames(1:apollo_inputs$J,
names(V)), avail = setNames(apollo_inputs$database[,
paste0("av", 1:apollo_inputs$J)], names(V)), choiceVar = choice_alt_number,
utilities = V)
P[["model"]] = apollo_mnl(mnl_settings, functionality)
P = apollo_prepareProb(P, apollo_inputs, functionality)
return(P)
}