Projects

Statistical learning, data science, and software projects from research and graduate coursework.

Featured

๐Ÿ”’
Extended the evaluation of BalDRO, a robust optimisation-based machine unlearning method, in two directions: testing alternative forget objectives (WGA and TNPO) and running experiments across four additional model families (Llama-3.2-1B, Llama-3.1-8B, Qwen3-8B, Mistral-7B). Using the TOFU benchmark, the clearest result is that BalDRO + NPO substantially recovers forgetting on Llama-3.1-8B where plain NPO nearly fails. Results on other models and objectives are mixed and preliminary.
Python PyTorch HuggingFace Machine Unlearning LLMs BlueDot In Progress
๐ŸŒพ
Fit log-normal linear mixed models (LMMs) to ecological camera trap data from two sites in Germany, modelling distance to the nearest settlement as a function of landscape features and predator counts. Compared three estimation approaches (lme4 REML, blme MAP Bayesian, and brms full MCMC), all producing consistent results. Distance to the nearest road was the most strongly associated predictor.
R Linear Mixed Models lme4 Bayesian brms Ecology
๐ŸŒฟ
Applied elastic net regularization for variable selection, then compared logistic regression, random forests, and SVMs for binary classification of successful bird/bat carcass detections near Spanish wind farms. Using real ecological field data (~400 observations), the analysis finds that dog searchers are approximately 10ร— more likely to successfully detect a carcass than human searchers, the dominant predictor across all three models.
R Binary Classification Elastic Net Random Forest SVM Logistic Regression Ecology
๐Ÿ“Š
Built and deployed a full-stack R Shiny web application for Bayesian diagnostic model comparison, developed during a research assistantship at the University of Toronto. The app enables researchers to run ROC/AUC analysis and relative belief ratio inference across multiple models and sampling regimes without writing code.
R Shiny Bayesian ROC/AUC Statistical Software Deployed

Additional Analyses

๐Ÿ‡
Compared gradient boosting, bagging, and random forests for distinguishing two Turkish raisin varieties (Kecimen and Besni) from 7 image-derived morphological features. All three methods achieve ~80โ€“82% accuracy; bagging and random forests outperform boosting. Major axis length and perimeter are the most discriminative features.
R Ensemble Methods Gradient Boosting Random Forest Feature Importance
๐Ÿซ˜
Trained a feedforward neural network with softmax output to classify 13,611 dry bean samples across 7 varieties using 16 morphological features. Hyperparameters tuned via 5-fold cross-validation (size = 20, decay = 8). Achieved 81.9% overall accuracy; Bombay beans proved challenging due to class imbalance despite strong visual separability in exploratory analysis.
R Neural Networks Multiclass Classification Cross-Validation Class Imbalance
๐ŸŒฝ
Analyzed GeoTIF crop inventory data from Agriculture and Agri-Food Canada across Saskatchewan and Manitoba (2009โ€“2013), tracking land use for corn and soybeans. Corn production increased roughly 7-fold over the period; most growth occurred in Manitoba. An interactive Shiny app extends the analysis to additional crop types with user-selectable plots and colour schemes.
R Geospatial terra ggplot2 Shiny
๐ŸŽฎ
Clustered 721 Pokรฉmon species by six battle stats (HP, Attack, Defense, Special Attack, Special Defense, Speed) using agglomerative clustering, K-means, and PCA-based hierarchical clustering. All three methods consistently suggested two clusters, with Rand indices above 0.85 across all pairwise comparisons. Ward's linkage was the most stable across linkage types.
Python Clustering K-Means PCA Agglomerative scikit-learn
๐Ÿฆ
Compared elastic net regularization (ENET) and Bayesian additive regression trees (BART) for predicting vegan ice cream ratings from U.S. survey respondents (n = 274). Both models achieved an MSE of 0.91 on the vegan ice cream test set. Attitudes toward meat consumption, racial identity, and political orientation were the strongest predictors; ENET was preferred given comparable accuracy and much lower computational cost.
R Elastic Net BART Ridge Regression LOOCV Feature Importance