Sheetpedia

A 300K-Spreadsheet Corpus for Spreadsheet Intelligence and LLM Fine-Tuning

Zailong Tian1,2, Zhuoheng Han1, Houfeng Wang1*, Lizi Liao2

1Peking University 2Singapore Management University
Accepted to NeurIPS 2025 (Spotlight, DB track)

Spreadsheets are everywhere, but AI struggles to understand their complex mix of data and formulas. We introduce Sheetpedia, a new, massive corpus of over 290,000 real-world spreadsheets designed to tackle this challenge. To demonstrate its power, we created two new tasks—translating natural language to cell ranges (NL2SR) and formulas (NL2Formula)—and fine-tuned models on our dataset to achieve state-of-the-art results, reaching 97.5% and 71.7% accuracy, respectively. Sheetpedia fills a crucial gap, paving the way for smarter, more intuitive spreadsheet tools.

A Glimpse into the Corpus

Formula co-occurrence network
Formula Co-occurrence Network
Word cloud of formula patterns
Formula Pattern Word Cloud

Putting Sheetpedia to the Test

Examples of the NL2SR and NL2Formula tasks
Task Examples: NL to Semantic Range & NL to Formula
Heatmap of fine-tuning results
Fine-Tuning Performance on LLaMA-3.1-8B