Show R code
grid.arrange(boxplot1, boxplot2, boxplot3, ncol=3, nrow=1, widths=c(1, 1, 1.6))Logistic Regression, Random Forests, and Support Vector Machines
Wind turbines are often promoted as an environmentally friendly energy source. However, bird and bat carcasses frequently accumulate near these turbines, raising concerns about their impact on wildlife.
Carcass searches are typically conducted by humans, often accompanied by trained detection dogs. Dogs are generally more effective than humans, likely due to their superior sense of smell [1], but other factors such as carcass size and weather conditions may also influence detection rates.
Domínguez del Valle et al. [3] conducted a controlled field experiment in Spain (January–March) in which bird carcasses were systematically placed near wind turbines. Two human searchers and three trained dog handlers each followed a predetermined path independently, with a handler present to verify each detection. The dataset is publicly available on Data Dryad [2] and includes detection outcomes, searcher type and ID, turbine type, carcass wingspan (cm), vegetation cover, temperature, and wind speed.
This study applies three statistical learning methods, logistic regression, random forest decision trees, and support vector machines, to classify whether carcasses are successfully detected and to identify key explanatory variables influencing detection. Elastic net regularization is used for variable selection.
Three categorical variables (searcher type, individual searcher ID, and turbine type) were one-hot encoded using the fastDummies package [5]. Search date was excluded, as searches spanned roughly three months and most dates contained only four observations, making it unlikely to contribute meaningful information.
The dataset was split 80/20 into training and test sets, stratified by detection outcome.
grid.arrange(boxplot1, boxplot2, boxplot3, ncol=3, nrow=1, widths=c(1, 1, 1.6))grid.arrange(density1, density2, density3, ncol=3, nrow=1, widths=c(1, 1, 1.6))Temperature and wind speed distributions are nearly identical between detected and missed carcasses, suggesting limited predictive value. Wingspan shows modest separation and was ultimately retained in the final model.
Variable selection was performed using elastic net regularization via the glmnet package [9]. Elastic net combines the LASSO \(\ell_1\) penalty (which zeros out coefficients entirely) with the ridge \(\ell_2\) penalty (which shrinks them):
\[\hat{\beta}^{\text{ENET}} = \underset{\beta}{\text{argmin}} \left\{ -\ell(\beta) + \lambda \left( \alpha \|\beta\|_1 + (1-\alpha)\|\beta\|_2^2 \right) \right\}\]
Using 5-fold cross-validation, the optimal tuning parameters were \(\alpha = 0.2\) (mostly ridge) and \(\lambda \approx 0.18\). The selected variables include carcass wingspan, searcher type (dog or human), specific searcher identifiers (Dog 2, Human 1, Human 2), and Turbine 3. Temperature and wind speed were not selected, which is consistent with the exploratory plots. The same variable set is used across all three models to allow a direct comparison.
Logistic regression models the log-odds of a successful detection as a linear function of the selected predictors:
\[\ln\!\left(\frac{\pi}{1 - \pi}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\]
where \(\pi = P(\text{Detected} = \text{Yes} \mid \mathbf{X})\). An observation is classified as detected if the predicted probability exceeds 0.5. The model was fit using base R [8].
Random forests build an ensemble of decision trees, each trained on a bootstrap sample with \(\lfloor\sqrt{p}\rfloor = 2\) randomly selected predictors available at each node (from 7 total). Predictions are made by majority vote across 3,000 trees [6]. Variable importance is measured as the mean decrease in Gini impurity when each predictor is removed.
SVMs find the hyperplane that maximises the margin between classes, with a cost parameter \(C\) controlling the trade-off between margin width and misclassification. A linear kernel was used, with \(C\) tuned over {0.001, 0.01, 0.1, 1, 5, 10, 100} via cross-validation; \(C = 0.01\) was selected. The e1071 package [7] was used. Variable importance was approximated by the Euclidean norm of the SVM weight vector [4].
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -1.8625 | 0.3904 | -4.7707 | 0.0000 |
| size | 0.0177 | 0.0055 | 3.2204 | 0.0013 |
| type_dog | 2.3331 | 0.3461 | 6.7412 | 0.0000 |
| ind_d2 | 0.1742 | 0.2315 | 0.7524 | 0.4518 |
| ind_h1 | -0.5653 | 0.5314 | -1.0639 | 0.2874 |
| turb_t3 | -15.8002 | 620.2929 | -0.0255 | 0.9797 |
Holding all other variables constant, having a dog as the searcher increases the odds of successful detection by \(e^{2.33} \approx 10\times\). Wingspan also has a statistically significant positive effect. Human Searcher and Human 2 show NA due to collinearity from one-hot encoding, as expected.
par(mfrow=c(1, 2))
dotchart(var_imp2_df$Importance, labels=var_imp2_df$Variable,
xlab="Mean Decrease in Gini (RF)", bg="steelblue")
dotchart(rev(w), labels=rev(new_labels2),
xlab="Weight Magnitude (SVM)", bg="coral")Both the random forest and logistic regression agree that searcher type (dog vs. human) is the dominant predictor, followed by wingspan. The SVM weights differ somewhat, as wingspan is ranked lower, but searcher type remains the most important variable across all three methods.
All three models produce identical confusion matrices on the held-out test set:
| Predicted: No | Predicted: Yes | |
|---|---|---|
| Actual: No | 19 | 26 |
| Actual: Yes | 5 | 93 |
The convergence across methods is notable. When a single categorical variable dominates prediction (searcher type), models of very different complexity tend to find the same decision boundary. The convergence was confirmed to be genuine rather than a coding error by verifying that alternative train/test splits produce small but non-zero differences across models.
The strongest predictor of accurate detection was whether the searcher was a dog, aligning with the findings from the original study [3]. Holding all other variables constant, having a dog as the searcher increases the odds of successful detection by approximately 10 times.
A misclassification rate of 21.7% is relatively acceptable given that the elastic net selected mostly categorical variables. With only five distinct searchers, the results may reflect those individuals rather than generalizing to all dogs and humans.
Since all three models perform identically and logistic regression is the most interpretable and computationally efficient, it is the most suitable model for this dataset.
Future studies may consider examining what factors allow dogs to be more successful at detection than humans, such as breed, training method, or handler experience.