Investigating Attitudes Towards Vegan Ice Cream

Author

Anna Ly

Published

December 1, 2024

1. Abstract

There are negative attitudes towards vegan food, particularly products designed to imitate traditional animal-based foods like dairy or meat. However, prior research indicates that opinions of vegan foods tend to improve after direct experience. While much of the existing research on vegan diets focuses on health impacts, efforts to model the predictors of stigma against veganism have been limited, often relying on basic statistical approaches. This analysis uses existing survey data to examine demographic factors that are associated with attitudes towards vegan ice cream. Two modeling approaches were applied: linear regression with elastic net regularization and Bayesian additive regression trees (BART). Both models performed similarly, revealing that attitudes toward meat consumption, racial identity, and political orientation were the strongest predictors of perceptions of vegan ice cream. Shifting attitudes towards vegan food is therefore challenging due to political factors and cultural sensitivities.

2. Introduction

In a simple study where participants rate popular desserts like ice cream or cake on a scale from 1 to 10, high ratings would typically be expected. However, adding the label “vegan” to the dessert name may elicit more negative perceptions from participants. Sekhon et al. [9] found that although people often perceive vegan food as less tasty, their opinions improve after actually trying it.

This analysis focuses on the Vegan Ice Cream dataset (https://osf.io/mazve/) collected by UCLA social psychologist Daniel L. Rosenfeld [8]. Initially, 500 U.S. adult participants were recruited using Amazon Mechanical Turk (MTurk), a crowdsourcing marketplace that allows businesses and researchers to outsource tasks virtually, such as survey participation.

Participants who failed an attention check, vegan participants, and those with unique responses (e.g., identifying as non-binary or indigenous North American, which were too infrequent for analysis) were excluded. The uploaded dataset includes 275 participants, but one additional entry had to be removed for analysis due to missing data.

Rosenfeld’s primary analysis involved t-tests comparing vegan ice cream against regular ice cream ratings, along with basic inferential linear regression. The objective here is to develop predictive models using statistical learning techniques, incorporating a training/test split and penalization methods outlined in the next section.

Understanding the stigma surrounding vegan food has practical benefits. Reducing meat and dairy has environmental benefits, and vegan diets are known to have the lowest carbon emissions [3]. The results may help vegan companies and advertisers improve marketing strategies and reduce global meat and dairy consumption.

3. Methods

Two groups are compared by creating and evaluating two models: participants who rated regular ice cream and those who rated vegan ice cream. While the primary focus is on the latter, factors influencing regular ice cream ratings are also analyzed to assess variable importance between the two models. The response variable is each participant’s average rating of ice cream attributes.

The problem is approached using the following methods:

Linear Regression with Elastic Net Regularization (ENET).
Bayesian Additive Regression Trees (BART) model.

Leave-one-out cross validation (LOOCV) is also used due to the relatively small sample size. 80% of the data is used for training, with the remaining 20% held out for comparing the two models.

The resulting split gives 104 data points in the training set for the vegan model and 27 in the test set. For the regular ice cream response, there are 114 in the training set and 29 in the test set.

3.1 Linear Regression with ENET

Linear regression estimates a linear relationship between a numeric response variable (typically written as $\mathbf{y}$) and explanatory variable(s) $\mathbf{X}$. The classical form is: \[y = X \beta + \epsilon\] where $\epsilon \sim \text{Normal}(0, \sigma^{2})$ accounts for the error between the expected and observed response, and $\beta$ is typically estimated by minimizing the residual sum of squares: \[\hat{\beta} = \text{min} \left( \sum_{i=1}^{n} \left( y_{i} - f(x_{i}) \right)^{2} \right)\] where $f(x_{i})$ denotes the predicted values of $y_{i}$.

There are two main issues with classical linear regression:

Over-fitting: there is a tendency to favor models with more variables.
All predictors are retained, including those that may be irrelevant for analysis.

Arbitrarily removing variables is not effective, so statistical methods for feature selection are preferred. LASSO (Least Absolute Shrinkage and Selection Operator) introduces a penalty term to the $\beta$ coefficients:

\[\hat{\beta}^{LASSO} = \text{min} \left\{ \sum_{i=1}^{n} \left( y_{i} - f(x_{i}) \right)^{2} + \lambda \sum_{k=1}^{p} |b_{k+1}| \right\}\]

where $\lambda$ is a penalty term that increases with model complexity to prevent overfitting. As $\lambda$ increases, some coefficients are set to zero, effectively performing model selection. The term $\sum_{k=1}^{p} |b_{k+1}| = ||\beta||_{1}$ is the $\ell_{1}$ norm.

Ridge regression squares the penalty terms instead, which shrinks coefficients toward zero without setting them exactly to zero:

\[\hat{\beta}^{RIDGE} = \text{min} \left\{ \sum_{i=1}^{n} \left( y_{i} - f(x_{i}) \right)^{2} + \lambda \sum_{k=1}^{p} b_{k+1}^{2} \right\}\]

Elastic Net Regularization is a mix between LASSO and ridge regression:

\[\hat{\beta}^{ENET} = \text{min} \left\{ \sum_{i=1}^{n} \left( y_{i} - f(x_{i}) \right)^{2} + \lambda (\alpha ||\beta||_{1} + (1 - \alpha)||\beta||_{2}) \right\}\]

where $\alpha$ denotes the amount of weight given to LASSO vs. ridge. A smaller value of $\alpha$ indicates a model that is mostly ridge, while a larger value indicates mostly LASSO. Both $\alpha$ and $\lambda$ are found using LOOCV.

3.2 BART Model

Understanding BART requires knowledge of boosting trees, likelihood estimation, empirical Bayesian methods, and algorithms like Metropolis-Hastings. The key ideas are summarized here.

In Bayesian statistics, the prior represents beliefs about some quantity before observing data, and the posterior distribution represents updated beliefs after observing data.

BART is similar to boosting trees, but incorporates a regularization prior to prevent overfitting. The prior is the structure of the regression trees before any modifications; these trees then get modified into posteriors. According to the original authors [4], the model is robust to small changes in the prior, and the default parameters are suitable for most uses.

Regression trees partition the predictor space into non-overlapping regions, making predictions based on observations within each region. Each split is represented by a node, where a decision is made to divide data based on a specific predictor. The terminal nodes (leaves) represent the final regions, where a single prediction is assigned. Splitting rules are determined by minimizing the residual sum of squares at each step.

Let $m$ denote the number of regression trees and $B$ the number of iterations for the BART algorithm. Initial trees generated early in the algorithm are typically not suitable, so $L$ burn-in samples are excluded from the final calculation.

The prior hyper-parameters determine how many nodes the initial trees have. The probability that the node at depth $d$ is non-terminal is $\alpha (1 + d)^{-\beta}$, with default $\alpha = 0.95, \beta = 2$. Another key feature is the use of empirical Bayes, where the prior structure is adjusted based on the data itself. After choosing prior parameters, initial trees are generated and then perturbated via the Metropolis-Hastings algorithm.

Let $\mathcal{T}^{b}_{k}(x)$ represent the prediction at $x$ for the $k$-th tree at the $b$-th iteration. Then:

Step 1: Initialize all trees to the average response: \[\mathcal{T}_{1}^{1}(X) = \ldots = \mathcal{T}_{m}^{1}(X) = \frac{1}{m} \overline{y}\]

Step 2: For $b = 2, \ldots, B$, for each tree $k$:

Compute the partial residual: \[r_{i} = y_{i} - \sum_{i=1}^{k-1} \mathcal{T}_{i}^{b}(x_{i}) - \sum_{i=k+1}^{m} \mathcal{T}_{i}^{b-1}(x_{i})\]
Randomly perturbate tree $k$ (pruning, adding nodes, or changing terminal node values), then use Metropolis-Hastings to accept or reject the change.

Step 3: Compute the mean after removing burn-in samples: \[\mathcal{T}(X) = \frac{1}{B - L} \sum_{i=L+1}^{B} \mathcal{T}^{i}(X)\]

This model is not limited by linearity assumptions, but sacrifices interpretability for flexibility and is computationally expensive. The bartMachine R package is used here; for more details, see the package vignette [6].

3.3 Methods for Comparisons

Models are compared using the mean squared error (MSE) on the test set: \[\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_{i} - \hat{y}_{i})^{2}\]

Variable importance is also compared. For ENET, this is assessed by standardizing coefficients. For BART, inclusion proportions are used.

4. Results

4.1 Description of Variables

All variables included in the analysis are:

Quantitative: Likert-scale responses for 8 statements on attitudes toward eating meat (1 = strongly disagree, 7 = strongly agree), political stance (1 = liberal, 7 = conservative), and 4 statements on ice cream attributes (1 = very bad, 7 = very delicious). Self-reported age.
Categorical: Gender (binary), race/ethnicity (White, Black, Hispanic, Asian, Native American, Mixed, or Other), education, household income.

The primary objective is to understand what factors are associated with vegan ice cream ratings. In the original survey, participants rated both vegan and regular ice cream on four traits: taste, flavor, consistency, and texture. Although a multivariate response model was initially considered, the ratings across these traits were highly consistent within each participant (details in the next subsection). The analysis therefore simplifies by averaging across the four traits to create a single response variable.

For the categorical variables, one-hot encoding was applied, as this approach suits models that work best with numerical explanatory variables. Education and household income are ordinal but were not treated as numeric since the gaps between categories are not uniform.

For instance, education levels included (1) less than a high school diploma, (2) a high school diploma, (3) some college with no degree, and (4) an associate degree. It is unclear whether the difference between levels 1 and 2 is equivalent to the difference between levels 3 and 4.

The 8 survey statements on attitudes toward eating meat are listed below. Rosenfeld labeled them Carnism 1 to Carnism 8 [8]; these names are retained here for consistency.

Variable Name	Statement
Carnism 1	Humans should continue to eat meat because we’ve been doing it for thousands of years.
Carnism 2	Eating meat is better for my health.
Carnism 3	I’ve been eating meat my whole life, I could never give it up.
Carnism 4	The production of meat causes animals to suffer.
Carnism 5	Animals are dirty and deserve to be eaten.
Carnism 6	Not eating animals is a sign of weakness.
Carnism 7	I have the right to kill any animal I want.
Carnism 8	Animals aren’t intelligent enough to suffer in intensive confinement.

Summary tables for demographic variables:

Demographic Summary
Statistic	Conservatism	Age
Mean (SD)	3.78 (1.82)	39.76 (13.16)
Median (IQR)	4.00 (3.00)	37.00 (18.00)

Political Stance of Survey Participants. 1: Very Liberal, 7: Very Conservative
Political Stance	Count	Percentage (%)
1	31	11.31
2	52	18.98
3	46	16.79
4	46	16.79
5	36	13.14
6	43	15.69
7	20	7.30

Ethnicity of Survey Participants.
Ethnicity	Count	Percentage (%)
White / Caucasian	204	74.45
Black / African American	25	9.12
Hispanic or Latinx	9	3.28
Asian or Pacific Islander	30	10.95
Native American	5	1.82
Other	1	0.36

Education Level of Survey Participants.
Education	Count	Percentage (%)
Less than high school diploma	1	0.36
High school diploma or equivalent (e.g., GED)	31	11.31
Some college, but no degree	49	17.88
Associate degree	37	13.50
Bachelor’s degree	120	43.80
Graduate or professional degree	36	13.14

Income Level of Survey Participants.
Income	Count	Percentage (%)
$25,000 or less	46	16.79
$25,001 - $50,000	80	29.20
$50,001 - $80,000	80	29.20
$80,001 - $120,000	39	14.23
$120,001 - $200,000	20	7.30
$200,001 or more	9	3.28

4.2 Exploratory Data Analysis

Most of the data was previously transformed by the data collector. The attributes participants rated (texture, flavour, etc.) were highly consistent within each participant for both ice cream versions. The Cronbach’s alpha was $0.944$ for dairy ice cream and $0.955$ for vegan ice cream, both indicating excellent internal consistency. Averaging the four attribute ratings to form a single response variable is therefore reasonable.

Boxplots showing the attributes ratings for regular ice cream (left) and vegan ice cream (right).

Regular Ice Cream Evaluation
Statistic	Taste	Flavour	Texture	Consistency
Mean (SD)	6.08 (1.03)	6.09 (1.01)	5.82 (1.20)	5.95 (1.09)
Median (IQR)	6.00 (1.50)	6.00 (1.00)	6.00 (2.00)	6.00 (2.00)

Vegan Ice Cream Evaluation
Statistic	Taste	Flavour	Texture	Consistency
Mean (SD)	3.51 (1.53)	3.44 (1.51)	3.19 (1.51)	3.21 (1.53)
Median (IQR)	4.00 (2.00)	4.00 (2.00)	3.00 (2.00)	3.00 (2.00)

The survey contains 8 Likert questions on participant attitudes toward eating meat and animal welfare. For 7 of these questions, a higher value indicates a stronger belief in eating animals; Statement 4 was reversed for analysis.

Since multicollinearity is not a concern for elastic net or BART, the initial responses to each statement are retained rather than averaged. However, for a simple correlation analysis, the average of these attitudes is used (referred to as “average carnism attitude”).

There was a slight positive correlation between average carnism attitude and political stance ($\rho = 0.400$), meaning that more conservative participants tended to have higher carnism attitudes. Furthermore, there was a positive correlation between average carnism attitude and perceptions of regular ice cream ($\rho = 0.349$), but a negative correlation with perceptions of vegan ice cream ($\rho = -0.360$). Participants with higher carnism attitudes tended to rate regular ice cream higher and vegan ice cream lower.

4.3 Results for Linear Regression with ENET

For the vegan ice cream model, LOOCV recommended $\alpha = 0$, indicating pure Ridge Regression, with penalty $\lambda = 2.85$. The test set MSE was 0.91.

To standardize the $\beta$ coefficients, Gelman’s approach was followed: scaling predictors by twice the standard deviation of the corresponding explanatory variable [5]. The bar plot below shows scaled coefficients ordered by magnitude to indicate variable importance:

Variable importance based on standardized coefficients for vegan ice cream ratings (left) and regular ice cream ratings (right).

“Other” refers to individuals who identify as a race not listed in the survey’s dropdown options. Race is the strongest predictor of attitudes toward vegan ice cream. Specifically, individuals identifying as “Hispanic/Latinx” or “Other” exhibited more negative attitudes, while those identifying as “Black/African American” and high school students had a more positive impression.

For the regular ice cream model, LOOCV suggested $\alpha = 0.1$ (mostly ridge, with a small LASSO component), and $\lambda = 0.72$. The test set MSE was 1.05. With the partial LASSO component, some coefficients were reduced to zero. Interestingly, the results differed significantly from the vegan model: income and education appear connected to opinions about regular ice cream. Individuals earning over $120,000 per year expressed a highly positive perception of dairy ice cream, while those with a graduate school education had a more negative impression.

Initially, variable importance was assessed by standardizing coefficients by the ratio of the predictor standard deviation to the response standard deviation [2]. However, this approach is controversial for one-hot encoded variables. The variable importance from this method was almost identical to those from the BART model, discussed next.

4.4 Results for BART Model

The BART model used the default hyper-parameters from the bartMachine library: $\alpha = 0.95, \beta = 2, k = 2$, with $m = 50$ trees, $L = 250$ burn-in samples, and $B = 1250$ total iterations.

Variable importance is assessed using the inclusion proportion: the fraction of times a given predictor is selected as a splitting rule across all splits in the sum-of-trees model. The lollipop chart below shows variable importance for both models:

Variable importance based on inclusion proportions for vegan ice cream ratings (left) and regular ice cream ratings (right).

For the vegan ice cream model, attitudes toward eating meat (Carnism 1–8) and political stance are among the top predictors. Race is also prominent. In contrast, education and income have relatively little influence on vegan ice cream perceptions.

Some variables were not selected in any splitting decisions because they were absent from the corresponding training sets. There are relatively few respondents who identified as mixed race or as a race outside the provided options, and most participants had at least a high school education. Counts for these groups were shown in Section 4.1.

The test set MSE for the vegan ice cream model was 0.91 and for regular ice cream it was 1.11, closely matching the ENET results.

5. Conclusion

ENET produced an MSE of 0.91 for vegan ice cream and 1.05 for regular ice cream. BART produced an identical MSE of 0.91 for vegan ice cream but a slightly higher MSE of 1.11 for regular ice cream. The two models perform comparably, but BART is significantly more computationally expensive. The benefits of BART do not outweigh its computational cost here, and simpler ENET models are preferred.

In terms of variable importance, the two models differ slightly. ENET suggests race is the most influential factor, while BART ranks attitudes toward meat consumption first. Nonetheless, BART also identifies race as a secondary factor, which aligns with the ENET results.

The differences in variable importance could also be attributable to the different methods used to measure importance. The BART inclusion proportions and the standardized coefficients relative to the response variable produce very similar orderings, as noted in Section 4.3.

Since race, political orientation, and attitudes toward meat consumption shape perceptions of vegan ice cream, shifting attitudes toward plant-based foods would be complicated by cultural and political sensitivities. One potential approach could involve engaging social media influencers from diverse cultural and racial backgrounds to promote plant-based meals within their respective communities. Implementing such strategies, however, lies outside the scope of this report.

Like most studies, this analysis faces limitations. Assessing variable importance was complicated by inconsistencies in variable types: most variables were one-hot encoded, while others (Likert scales from 1 to 7) differ in scale. The small sample size also limited the analysis, with only 274 usable responses out of the initial 500, and even fewer for the vegan ice cream group.

Additionally, the demographic variables focus heavily on attitudes toward eating meat. Including factors such as religion or geographic location would likely make the findings more actionable for businesses and marketing departments.

6. References

[1]

Auguie, B (2017 ). gridExtra: Miscellaneous Functions for "Grid" Graphics. https://CRAN.R-project.org/package=gridExtra

[2]

Bring, J (1994 ). How to standardize regression coefficients. The American Statistician. Taylor & Francis. 48 209–13

[3]

Chai, B C, Voort, J R van der, Grofelnik, K, Eliasdottir, H G, Klöss, I and Perez-Cueto, F J A (2019 ). Which diet has the least environmental impact on our planet? A systematic review of vegan, vegetarian and omnivorous diets. Sustainability. 11. https://www.mdpi.com/2071-1050/11/15/4110

[4]

Chipman, H A, George, E I and McCulloch, R E (2010 ). BART: Bayesian additive regression trees. The Annals of Applied Statistics. Institute of Mathematical Statistics. 4 266–98

[5]

Gelman, A (2008 ). Scaling regression inputs by dividing by two standard deviations. Statistics in Medicine. Wiley Online Library. 27 2865–73

[6]

Kapelner, A and Bleich, J (2016 ). bartMachine: Machine learning with Bayesian additive regression trees. Journal of Statistical Software. 70 1–40

[7]

Rizopoulos, D (2006 ). Ltm: An r package for latent variable modelling and item response theory analyses. Journal of Statistical Software. 17 1–25. https://doi.org/10.18637/jss.v017.i05

[8]

Rosenfeld, D L, Rothgerber, H and Tomiyama, A J (2024 ). When meat-eaters expect vegan food to taste bad: Veganism as a symbolic threat. Group Processes & Intergroup Relations. SAGE Publications Sage UK: London, England. 27 453–68

[9]

Sekhon, T S, Soule, C A A and Ciao, A (2019 ). It’s not as bad as i thought it would be: Health halos and expectation disconfirmation effects on affective and cognitive evaluations of food products. Advances in Consumer Research. Association for Consumer Research. 47 834–5

[10]

Wickham, H, Averick, M, Bryan, J, Chang, W, McGowan, L D, François, R, Grolemund, G, Hayes, A, Henry, L, Hester, J, Kuhn, M, Pedersen, T L, Miller, E, Bache, S M, Müller, K, Ooms, J, Robinson, D, Seidel, D P, Spinu, V, Takahashi, K, Vaughan, D, Wilke, C, Woo, K and Yutani, H (2019 ). Welcome to the tidyverse. Journal of Open Source Software. 4 1686

[11]

Xie, Y (2024 ). Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/

[12]

(2024 ). GitHub copilot · your AI pair programmer. GitHub. https://github.com/features/copilot/

7. Supplementary Material

7.1 Loading the Data and Data Cleaning

Code

library("tidyverse")
library("knitr")
library("gridExtra")
library("ltm")
library("fastDummies")
library("caret")
library("rJava")
library("bartMachine")

dataset = read.csv("Study1b.csv")
dataset = na.omit(dataset)

dataset = dataset %>%
  mutate(
    average_icecream = rowMeans(across(c(Taste1, Taste2, Taste3, Taste4)), na.rm = TRUE),
    Carnism_4_rev = 8 - Carnism_4,
    average_carnism = rowMeans(across(c(
      Carnism_1, Carnism_2, Carnism_3, Carnism_4_rev,
      Carnism_5, Carnism_6, Carnism_7, Carnism_8
    )), na.rm = TRUE)
  )

dataset_original = dataset

dataset = dummy_cols(dataset, select_columns = c("Gender", "Race", "Education", "Income"), remove_first_dummy = FALSE)

dairy_data = dataset[dataset$Condition == "1", ]
vegan_data = dataset[dataset$Condition == "2", ]

create_summary_table1 = function(vector){
  mean_val = mean(vector); sd_val = sd(vector)
  median_val = median(vector); iqr_val = IQR(vector)
  mean_sd = sprintf("%.2f (%.2f)", mean_val, sd_val)
  median_iqr = sprintf("%.2f (%.2f)", median_val, iqr_val)
  return(c(mean_sd, median_iqr))
}

create_summary_table2 = function(data, columns_name, column_labels, cap_label){
  summary_table = t(apply(data[, columns_name], 2, create_summary_table1))
  summary_table = data.frame(
    Statistic = c("Mean (SD)", "Median (IQR)"),
    t(summary_table))
  colnames(summary_table) = column_labels
  return(kable(summary_table, caption = cap_label))
}

common_var_names = c("Carnism 1", "Carnism 2", "Carnism 3", "Carnism 5", "Carnism 6", "Carnism 7", "Carnism 8", "Politics", "Age", "Carnism 4", "Male", "Female", "White/Caucasian", "Black/African American", "Hispanic/Latinx", "Asian or Pacific Islander", "Mixed Race", "Other", "Less than HS", "High School", "Some College", "Associate", "Bachelor", "Graduate", "Less than $25,000", "$25,000 - $50,000", "$50,001 - $80,000", "$80,001 - $120,000", "$120,001 - $200,000", "$200,001 or more")

7.2 Tables for Exploratory Data Analysis

Code

columns_demo = c("Politics_1", "Age")
column_lab = c("Statistic", "Conservatism", "Age")
create_summary_table2(dataset, columns_demo, column_lab, "Demographic Summary")

n = nrow(dataset_original)
politics_table = dataset_original %>%
  count(Politics_1) %>%
  mutate(Percentage = round(n / sum(n) * 100, 2))
kable(politics_table, col.names = c("Political Stance", "Count", "Percentage (%)"),
      caption = "Political Stance of Survey Participants. 1: Very Liberal, 7: Very Conservative")

race_table = dataset_original %>%
  count(Race) %>%
  mutate(
    Ethnicity = recode(Race,
                       "1" = "White / Caucasian",
                       "2" = "Black / African American",
                       "3" = "Hispanic or Latinx",
                       "4" = "Asian or Pacific Islander",
                       "6" = "Native American",
                       "7" = "Other"),
    Percentage = round(n / sum(n) * 100, 2)
  ) %>%
  dplyr::select(Ethnicity, n, Percentage)
kable(race_table, col.names = c("Ethnicity", "Count", "Percentage (%)"),
      caption = "Ethnicity of Survey Participants.")

education_table = dataset_original %>%
  count(Education) %>%
  mutate(
    Education = recode(Education,
          "1" = "Less than high school diploma",
          "2" = "High school diploma or equivalent (e.g., GED)",
          "3" = "Some college, but no degree",
          "4" = "Associate degree",
          "5" = "Bachelor's degree",
          "6" = "Graduate or professional degree"),
    Percentage = round(n / sum(n) * 100, 2)
  ) %>%
  dplyr::select(Education, n, Percentage)
kable(education_table, col.names = c("Education", "Count", "Percentage (%)"), caption = "Education Level of Survey Participants.")

income_table = dataset_original %>%
  count(Income) %>%
  mutate(
    Income = recode(Income,
          "1" = "$25,000 or less",
          "2" = "$25,001 - $50,000",
          "3" = "$50,001 - $80,000",
          "4" = "$80,001 - $120,000",
          "5" = "$120,001 - $200,000",
          "6" = "$200,001 or more"),
    Percentage = round(n / sum(n) * 100, 2)
  ) %>%
  dplyr::select(Income, n, Percentage)
kable(income_table, col.names = c("Income", "Count", "Percentage (%)"), caption = "Income Level of Survey Participants.")

columns_icecream = c("Taste1", "Taste2", "Taste3", "Taste4")
columns_icelab = c("Statistic", "Taste", "Flavour", "Texture", "Consistency")
create_summary_table2(dairy_data, columns_icecream, columns_icelab, "Regular Ice Cream Evaluation")
create_summary_table2(vegan_data, columns_icecream, columns_icelab, "Vegan Ice Cream Evaluation")

7.3 Boxplots Comparing Ice Cream Ratings

Code

dairy_data_long = dairy_data %>%
  pivot_longer(cols = c(Taste1, Taste2, Taste3, Taste4),
               names_to = "Variable", values_to = "Value") %>%
  mutate(Variable = recode(Variable,
                         Taste1 = "Taste", Taste2 = "Flavour",
                         Taste3 = "Texture", Taste4 = "Consistency"))

vegan_data_long = vegan_data %>%
  pivot_longer(cols = c(Taste1, Taste2, Taste3, Taste4),
               names_to = "Variable", values_to = "Value") %>%
  mutate(Variable = recode(Variable,
                           Taste1 = "Taste", Taste2 = "Flavour",
                           Taste3 = "Texture", Taste4 = "Consistency"))

p3 = ggplot(dairy_data_long, aes(x = Variable, y = Value, fill = Variable)) + geom_boxplot() +
  labs(x = "Evaluation", y = "Rating (1: Very Bad, 7: Very Delicious)") +
  scale_fill_manual(values=c("#FF6666", "#6699FF", "#05DEB2", "#947aff")) +
  guides(fill="none")

p4 = ggplot(vegan_data_long, aes(x = Variable, y = Value, fill = Variable)) + geom_boxplot() +
  labs(x = "Evaluation", y = "Rating (1: Very Bad, 7: Very Delicious)") +
  scale_fill_manual(values=c("#FF6666", "#6699FF", "#05DEB2", "#947aff")) +
  guides(fill = "none")

grid.arrange(p3, p4, widths = c(2, 2), ncol=2)

7.4 Linear Regression with ENET

Code

remove_cols = c("StartDate", "EndDate", "Consent", "Gender", "Condition",
                "Carnism_4", "average_carnism", "Race", "Education", "Income",
                "Taste1", "Taste2", "Taste3", "Taste4")

dairy_data = dairy_data[, !names(dairy_data) %in% remove_cols]
vegan_data = vegan_data[, !names(vegan_data) %in% remove_cols]

set.seed(1)
vegan_samp = sort(sample(1:nrow(vegan_data), floor(0.8 * nrow(vegan_data)), replace = FALSE))
dairy_samp = sort(sample(1:nrow(dairy_data), floor(0.8 * nrow(dairy_data)), replace = FALSE))

vegan_train = vegan_data[vegan_samp, ]
dairy_train = dairy_data[dairy_samp, ]
vegan_test = vegan_data[-vegan_samp, ]
dairy_test = dairy_data[-dairy_samp, ]

ctrl = trainControl(method = "LOOCV")

dairy_elnet = train(
  average_icecream ~ ., data = dairy_train,
  method = "glmnet",
  tuneGrid = data.frame(alpha = 0.1, lambda = 0.72),
  trControl = ctrl
)

vegan_elnet = train(
  average_icecream ~ ., data = vegan_train,
  method = "glmnet",
  tuneGrid = data.frame(alpha = 0, lambda = 2.85),
  trControl = ctrl
)

vegan_coefs = coef(vegan_elnet$finalModel, s = vegan_elnet$bestTune$lambda)
dairy_coefs = coef(dairy_elnet$finalModel, s = dairy_elnet$bestTune$lambda)

standardized_coefficients = function(data, coefs_output){
  sds = apply(data, 2, sd)
  coef_2 = as.vector(coefs_output)
  names(coef_2) = rownames(coefs_output)
  x_sds = sds[names(coef_2)[-1]]
  standardized_coefs = coef_2[-1] / (2*x_sds)
  names(standardized_coefs) = names(coef_2)[-1]
  return(list(
    standardized_coefs = standardized_coefs,
    sorted_coefs = sort(abs(standardized_coefs))
  ))
}

vals1 = standardized_coefficients(vegan_train, vegan_coefs)
vals2 = standardized_coefficients(dairy_train, dairy_coefs)

LR_barplot = function(important_variables){
  df = data.frame(Variable = common_var_names, Coefficient = as.numeric(important_variables))
  df = na.omit(df)
  df = df[df$Coefficient != 0, ]
  df$AbsCoefficient = abs(df$Coefficient)
  df = df[order(-df$AbsCoefficient), ]
  ggplot(df, aes(x = reorder(Variable, AbsCoefficient), y = Coefficient, fill = Coefficient > 0)) +
    geom_bar(stat = "identity") +
    coord_flip() +
    scale_fill_manual(values = c("red", "blue"), labels = c("Negative", "Positive")) +
    theme_minimal() +
    labs(x = "Variable", y = "Standardized Coefficient", fill = "Direction") +
    theme(legend.position = "top")
}

barplot1 = LR_barplot(vals1$standardized_coefs)
barplot2 = LR_barplot(vals2$standardized_coefs)
grid.arrange(barplot1, barplot2, widths = c(2, 2), ncol=2)

vegan_test_x = subset(vegan_test, select = -average_icecream)
dairy_test_x = subset(dairy_test, select = -average_icecream)
vegan_test_y = vegan_test$average_icecream
dairy_test_y = dairy_test$average_icecream

y_hat_dairy = predict(dairy_elnet, s = 0.72, dairy_test_x)
y_hat_vegan = predict(vegan_elnet, s = 2.85, vegan_test_x)

(sum((dairy_test_y - y_hat_dairy)^(2)))/(length(dairy_test_y))
(sum((vegan_test_y - y_hat_vegan)^(2)))/(length(vegan_test_y))

7.5 BART

The LOOCV BART code is computationally intensive (fits 218 BART models). Results were pre-computed and saved as RDS files. The code used is shown below.

Code

LOOCV_bart = function(train_data, train_y, test_data){
  predictions = matrix(0, nrow = nrow(test_data), ncol = nrow(train_data))
  variable_imp = matrix(0, nrow = nrow(train_data), ncol = ncol(train_data)-1)
  predictor_names = colnames(train_data)[-10]
  colnames(variable_imp) = predictor_names

  for(i in 1:nrow(train_data)){
    temp_train_data = train_data[-i, ]
    temp_train_y = train_y[-i]
    bart_model = bartMachine(X = temp_train_data[, -10],
                             y = temp_train_y, verbose = FALSE,
                             run_in_sample = FALSE)
    var_imp = investigate_var_importance(bart_model, plot = FALSE)$avg_var_props
    variable_imp[i, names(var_imp)] = var_imp
    predictions[, i] = predict(bart_model, test_data[, -10, drop = FALSE])
  }
  final_predictions = rowMeans(predictions)
  avg_variable_imp = colMeans(variable_imp)
  return(list(predictions = final_predictions, variable_importance = avg_variable_imp))
}

set.seed(1)
y_hat_dairy_bart = LOOCV_bart(dairy_train, dairy_train$average_icecream, dairy_test)
set.seed(1)
y_hat_vegan_bart = LOOCV_bart(vegan_train, vegan_train$average_icecream, vegan_test)

saveRDS(y_hat_dairy_bart, "bart_dairy_results.rds")
saveRDS(y_hat_vegan_bart, "bart_vegan_results.rds")

create_lollipop_chart = function(var_imp_vector){
  importance_df = data.frame(Variable = common_var_names, Importance = var_imp_vector)
  importance_df = importance_df[order(importance_df$Importance, decreasing = TRUE), ]
  ggplot(importance_df, aes(x = reorder(Variable, Importance), y = Importance)) +
    geom_segment(aes(xend = Variable, yend = 0), color = "gray") +
    geom_point(color = "blue", size = 3) +
    coord_flip() +
    labs(x = "Variables", y = "Importance") +
    theme_minimal()
}

p1 = create_lollipop_chart(y_hat_vegan_bart$variable_importance)
p2 = create_lollipop_chart(y_hat_dairy_bart$variable_importance)
grid.arrange(p1, p2, widths = c(2, 2), ncol=2)

vegan_test_x = subset(vegan_test, select = -average_icecream)
dairy_test_x = subset(dairy_test, select = -average_icecream)
vegan_test_y = vegan_test$average_icecream
dairy_test_y = dairy_test$average_icecream

y_hat_dairy = y_hat_dairy_bart$predictions
y_hat_vegan = y_hat_vegan_bart$predictions

(sum((dairy_test_y - y_hat_dairy)^(2)))/(length(dairy_test_y))
(sum((vegan_test_y - y_hat_vegan)^(2)))/(length(vegan_test_y))