Automated Yield Monitor Data Post-processing Pipeline via Explainable Model Benchmarking and Stacked Ensemble
ORCID
Drewry: https://orcid.org/0000-0003-3221-4364
MSU Affiliation
College of Agriculture and Life Sciences; Department of Agricultural and Biological Engineering; James Worth Bagley College of Engineering
Creation Date
2026-01-15
Abstract
Accurate yield data post-processing is a key component of agricultural field management and precision farming analyses. Traditional approaches to post-processing yield monitor data rely on rule-based filtering, manual inspection, and thresholding, which is time-consuming, inconsistent, and relies on expert knowledge. This study demonstrates the viability of machine learning for automated, non-expert detection of erroneous data points, thereby increasing the scalability of yield map generation from raw yield monitor data. Historical yield data (4.6 million data points) were collected from 326 soybean and corn fields in the Delta region of Mississippi. Extensive feature engineering was conducted to derive spatial, operational, and geometric features to enrich model learning. Eight machine learning algorithms (Decision Tree, Random Forest, XGBoost, CatBoost, K-Nearest Neighbors, Artificial Neural Network, Naïve Bayes, and SGDClassifier) were trained with Bayesian-optimized hyperparameters and evaluated across multiple resampling strategies. CatBoost emerged as the best performing model on the raw feature set, achieving an F1-score of 0.77. Random Forest (F1 = 0.76), XGBoost (F1 = 0.74), and Decision Tree (F1 = 0.72) also performed competitively on the raw dataset, though they fell slightly short of CatBoost's score. Evaluation on unseen fields demonstrated the model's ability to locate error-prone regions, though isolated false negatives were observed. A stacked ensemble model using XGBoost as the meta-learner slightly improved F1-score (0.78), but gains were limited by high prediction correlation among top base learners and the suboptimal performance of weaker classifiers. SHAP-based interpretation of CatBoost and XGBoost revealed that their predictions aligned well with the domain knowledge.
Publication Date
12-5-2025
Publication Title
Computers and Electronics in Agriculture
Publisher
Elsevier
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License
Recommended Citation
Uddin, M., Samiappan, S., & Drewry, J. L. (2026). Automated yield monitor data post-processing pipeline via explainable model benchmarking and stacked ensemble. Computers and Electronics in Agriculture, 241, 111278. https://doi.org/10.1016/j.compag.2025.111278