Sheetpedia: A 300K-Spreadsheet Corpus

Spreadsheets are everywhere, but AI struggles to understand their complex mix of data and formulas. We introduce Sheetpedia, a new, massive corpus of over 290,000 real-world spreadsheets designed to tackle this challenge. To demonstrate its power, we created two new tasks—translating natural language to cell ranges (NL2SR) and formulas (NL2Formula)—and fine-tuned models on our dataset to achieve state-of-the-art results, reaching 97.5% and 71.7% accuracy, respectively. Sheetpedia fills a crucial gap, paving the way for smarter, more intuitive spreadsheet tools.

A Glimpse into the Corpus

Formula co-occurrence network — Formula Co-occurrence Network

Word cloud of formula patterns — Formula Pattern Word Cloud

Putting Sheetpedia to the Test

Examples of the NL2SR and NL2Formula tasks — Task Examples: NL to Semantic Range & NL to Formula

Heatmap of fine-tuning results — Fine-Tuning Performance on LLaMA-3.1-8B