Categorising SME Bank Transactions with Machine Learning and Synthetic Data Generation

Aug 7, 2025

Marya Bazzi

A practical pipeline for classifying SME bank transactions using synthetic data augmentation, FinBERT fine-tuning, and temperature-scaling calibration to handle business context specific transaction descriptions and improve automated categorisation for cash-flow lending models.

Summary

Small and Medium-sized Enterprises (SMEs) often have non‑standard, context-specific transaction descriptions, which makes automated categorisation hard and limits the usefulness of cash‑flow lending models. This paper presents a practical pipeline that: (1) uses a large language model to generate class‑balanced synthetic transaction text to address data scarcity and imbalance; (2) fine‑tunes a financial domain model (FinBERT) on the enriched dataset; and (3) calibrates probabilities with temperature scaling so predicted class probabilities reflect real‑world label frequencies. Using Open Banking data we demonstrate strong performance with particularly high accuracy on confident predictions, showing robust generalisation across firms.

Authors: Pietro Alessandro Aluffi (University of Warwick, Navrisk), Brandi Jess (Navrisk), Marya Bazzi (University of Warwick, Sea.dev, SME Capital), Kate Kennedy (SME Capital, Navrisk), Matt Arderne (Sea.dev, SME Capital), Daniel Rodrigues (SME Capital, Navrisk), Martin Lotz (University of Warwick)

Read the full paper →