Summary
Small and Medium-sized Enterprises (SMEs) often have non‑standard, context-specific transaction descriptions, which makes automated categorisation hard and limits the usefulness of cash‑flow lending models. This paper presents a practical pipeline that: (1) uses a large language model to generate class‑balanced synthetic transaction text to address data scarcity and imbalance; (2) fine‑tunes a financial domain model (FinBERT) on the enriched dataset; and (3) calibrates probabilities with temperature scaling so predicted class probabilities reflect real‑world label frequencies. Using Open Banking data we demonstrate strong performance with particularly high accuracy on confident predictions, showing robust generalisation across firms.
Authors: Pietro Alessandro Aluffi (University of Warwick, Navrisk), Brandi Jess (Navrisk), Marya Bazzi (University of Warwick, sea.dev, SME Capital), Kate Kennedy (SME Capital, Navrisk), Matt Arderne (sea.dev, SME Capital), Daniel Rodrigues (SME Capital, Navrisk), Martin Lotz (University of Warwick)