Da Corte, MiguelBaptista, Jorge2026-06-022026-06-022025978-989-758-746-7http://hdl.handle.net/10400.1/29075This study investigates the adequacy of Machine Learning (ML)-based systems, specifically ACCUPLACER, compared to human rater classifications within U.S. Developmental Education. A corpus of 100 essays was assessed by human raters using 6 linguistic descriptors, with each essay receiving a skill-level classification. These classifications were compared to those automatically generated by ACCUPLACER. Disagreements among raters were analyzed and resolved, producing a gold standard used as a benchmark for modeling ACCUPLACER’S classification task. A comparison of skill levels assigned by ACCUPLACER and humans revealed a “weak” Pearson correlation (ρ = 0.22), indicating a significant misplacement rate and raising important pedagogical and institutional concerns. Several ML algorithms were tested to replicate ACCUPLACER’S classification approach. Using the Chi-square (χ2) method to rank the most predictive linguistic descriptors, Na¨ıve Bayes achieved 81.1% accuracy with the top-four ranked features. These findings emphasize the importance of refining descriptors and incorporating human input into the training of automated ML systems. Additionally, the gold standard developed for the 6 linguistic descriptors and overall skill levels can be used to (i) assess and classify students’ English (L1) writing proficiency more holistically and equitably; (ii) support future ML modeling tasks; and (iii) enhance both student outcomes and higher education efficiency.engDevelopmental Education (DevEd)Automatic writing assessment systemsEnglish (L1) writing proficiency assessmentNatural language processing (NLP)Machine-learning (ML) modelsToward consistency in writing proficiency assessment: mitigating classification variability in developmental educationconference object10.5220/0013353900003932