Predicting Wildlife Carcass Detection Near Wind Turbines

Logistic Regression, Random Forests, and Support Vector Machines

Author

Anna Ly

Published

April 1, 2025

Background

Wind turbines are often promoted as an environmentally friendly energy source. However, bird and bat carcasses frequently accumulate near these turbines, raising concerns about their impact on wildlife.

Carcass searches are typically conducted by humans, often accompanied by trained detection dogs. Dogs are generally more effective than humans, likely due to their superior sense of smell [1], but other factors such as carcass size and weather conditions may also influence detection rates.

Domínguez del Valle et al. [3] conducted a controlled field experiment in Spain (January–March) in which bird carcasses were systematically placed near wind turbines. Two human searchers and three trained dog handlers each followed a predetermined path independently, with a handler present to verify each detection. The dataset is publicly available on Data Dryad [2] and includes detection outcomes, searcher type and ID, turbine type, carcass wingspan (cm), vegetation cover, temperature, and wind speed.

This study applies three statistical learning methods, logistic regression, random forest decision trees, and support vector machines, to classify whether carcasses are successfully detected and to identify key explanatory variables influencing detection. Elastic net regularization is used for variable selection.

Data Preparation

Three categorical variables (searcher type, individual searcher ID, and turbine type) were one-hot encoded using the fastDummies package [5]. Search date was excluded, as searches spanned roughly three months and most dates contained only four observations, making it unlikely to contribute meaningful information.

The dataset was split 80/20 into training and test sets, stratified by detection outcome.

Show R code

grid.arrange(boxplot1, boxplot2, boxplot3, ncol=3, nrow=1, widths=c(1, 1, 1.6))

Distributions of continuous predictors by detection outcome. Little separation is visible for temperature and wind speed.

Show R code

grid.arrange(density1, density2, density3, ncol=3, nrow=1, widths=c(1, 1, 1.6))

Density plots of continuous predictors by detection outcome.

Temperature and wind speed distributions are nearly identical between detected and missed carcasses, suggesting limited predictive value. Wingspan shows modest separation and was ultimately retained in the final model.

Methods

Variable Selection via Elastic Net

Variable selection was performed using elastic net regularization via the glmnet package [9]. Elastic net combines the LASSO \(\ell_1\) penalty (which zeros out coefficients entirely) with the ridge \(\ell_2\) penalty (which shrinks them):

\[\hat{\beta}^{\text{ENET}} = \underset{\beta}{\text{argmin}} \left\{ -\ell(\beta) + \lambda \left( \alpha \|\beta\|_1 + (1-\alpha)\|\beta\|_2^2 \right) \right\}\]

Using 5-fold cross-validation, the optimal tuning parameters were \(\alpha = 0.2\) (mostly ridge) and \(\lambda \approx 0.18\). The selected variables include carcass wingspan, searcher type (dog or human), specific searcher identifiers (Dog 2, Human 1, Human 2), and Turbine 3. Temperature and wind speed were not selected, which is consistent with the exploratory plots. The same variable set is used across all three models to allow a direct comparison.

Logistic Regression

Logistic regression models the log-odds of a successful detection as a linear function of the selected predictors:

\[\ln\!\left(\frac{\pi}{1 - \pi}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\]

where \(\pi = P(\text{Detected} = \text{Yes} \mid \mathbf{X})\). An observation is classified as detected if the predicted probability exceeds 0.5. The model was fit using base R [8].

Random Forests

Random forests build an ensemble of decision trees, each trained on a bootstrap sample with \(\lfloor\sqrt{p}\rfloor = 2\) randomly selected predictors available at each node (from 7 total). Predictions are made by majority vote across 3,000 trees [6]. Variable importance is measured as the mean decrease in Gini impurity when each predictor is removed.

Support Vector Machines

SVMs find the hyperplane that maximises the margin between classes, with a cost parameter \(C\) controlling the trade-off between margin width and misclassification. A linear kernel was used, with \(C\) tuned over {0.001, 0.01, 0.1, 1, 5, 10, 100} via cross-validation; \(C = 0.01\) was selected. The e1071 package [7] was used. Variable importance was approximated by the Euclidean norm of the SVM weight vector [4].

Results

Logistic Regression Coefficients

Logistic regression coefficients on selected variables. Variables with NA are omitted due to perfect collinearity from one-hot encoding.
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-1.8625	0.3904	-4.7707	0.0000
size	0.0177	0.0055	3.2204	0.0013
type_dog	2.3331	0.3461	6.7412	0.0000
ind_d2	0.1742	0.2315	0.7524	0.4518
ind_h1	-0.5653	0.5314	-1.0639	0.2874
turb_t3	-15.8002	620.2929	-0.0255	0.9797

Holding all other variables constant, having a dog as the searcher increases the odds of successful detection by \(e^{2.33} \approx 10\times\). Wingspan also has a statistically significant positive effect. Human Searcher and Human 2 show NA due to collinearity from one-hot encoding, as expected.

Feature Importance

Show R code

par(mfrow=c(1, 2))
dotchart(var_imp2_df$Importance, labels=var_imp2_df$Variable,
         xlab="Mean Decrease in Gini (RF)", bg="steelblue")
dotchart(rev(w), labels=rev(new_labels2),
         xlab="Weight Magnitude (SVM)", bg="coral")

Feature importance from the random forest (left, mean decrease in Gini) and SVM (right, weight magnitude). Higher values indicate greater predictive contribution.

Both the random forest and logistic regression agree that searcher type (dog vs. human) is the dominant predictor, followed by wingspan. The SVM weights differ somewhat, as wingspan is ranked lower, but searcher type remains the most important variable across all three methods.

Prediction Accuracy

All three models produce identical confusion matrices on the held-out test set:

Confusion matrix (all three models). Overall misclassification rate: **21.7%**. Corrected Rand index: 0.27.
	Predicted: No	Predicted: Yes
Actual: No	19	26
Actual: Yes	5	93

The convergence across methods is notable. When a single categorical variable dominates prediction (searcher type), models of very different complexity tend to find the same decision boundary. The convergence was confirmed to be genuine rather than a coding error by verifying that alternative train/test splits produce small but non-zero differences across models.

Discussion

The strongest predictor of accurate detection was whether the searcher was a dog, aligning with the findings from the original study [3]. Holding all other variables constant, having a dog as the searcher increases the odds of successful detection by approximately 10 times.

A misclassification rate of 21.7% is relatively acceptable given that the elastic net selected mostly categorical variables. With only five distinct searchers, the results may reflect those individuals rather than generalizing to all dogs and humans.

Since all three models perform identically and logistic regression is the most interpretable and computationally efficient, it is the most suitable model for this dataset.

Future studies may consider examining what factors allow dogs to be more successful at detection than humans, such as breed, training method, or handler experience.

References

[1]

Arnett, E B (2006 ). A preliminary evaluation on the use of dogs to recover bat fatalities at wind energy facilities. Wildlife Society Bulletin. Wiley Online Library. 34 1440–5

[2]

Domínguez del Valle, J, Cervantes Peralta, F and Jaquero Arjona, M I (2020 ). Data from: Factors affecting carcass detection at wind farms using dogs and human searchers. Data Dryad. https://doi.org/10.5061/dryad.n02v6wwtx

[3]

Domı́nguez del Valle, J, Cervantes Peralta, F and Jaquero Arjona, M I (2020 ). Factors affecting carcass detection at wind farms using dogs and human searchers. Journal of Applied Ecology. Wiley Online Library. 57 1926–35

[4]

Guyon, I and Elisseeff, A (2003 ). An introduction to variable and feature selection. Journal of machine learning research. 3 1157–82

[5]

Kaplan, J (2024 ). fastDummies: Fast Creation of Dummy (Binary) Columns and Rows from Categorical Variables. https://CRAN.R-project.org/package=fastDummies

[6]

Liaw, A and Wiener, M (2002 ). Classification and regression by randomForest. R News. 2 18–22. https://CRAN.R-project.org/doc/Rnews/

[7]

Meyer, D, Dimitriadou, E, Hornik, K, Weingessel, A and Leisch, F (2024 ). E1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. https://CRAN.R-project.org/package=e1071

[8]

R Core Team (2024 ). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/

[9]

Tay, J K, Narasimhan, B and Hastie, T (2023 ). Elastic net regularization paths for all generalized linear models. Journal of Statistical Software. 106 1–31