How We Measure What We Can’t See: Crop Classification Accuracy Across 20 Countries
Why measurement is the product
Crop intelligence is only as useful as the confidence you can place in it. When a trading desk uses planted-area figures to size a position, or an input manufacturer plans sales territories based on crop mix data, the margin for error is the margin for loss.
And yet, most crop intelligence providers don’t publish how they validate their numbers. Accuracy claims are vague (“over 90%”), evaluation scope is unclear, and the testing methodology is rarely explained.
At Hyperplan, we believe that the way you measure is as important as what you measure. We run a structured, repeatable validation framework across every country, every crop, every season. And we do it in-season — not as a retrospective exercise months after the data was already used.
This article explains how.
How we classify crops
For each agricultural field in our database, our model answers one question: what’s growing here this season?
It does this by processing satellite time series at the individual parcel level. Each field is observed multiple times throughout the growing season through Sentinel-2 imagery — capturing visible light, near-infrared, and vegetation indices at up to 25 points in time per year. The model watches how each field’s spectral signature changes across the season: In the nothern hemisphere, winter wheat and winter barley look identical in January, but diverge by April. Sunflower and maize are indistinguishable in March, but distinct by July.
The architecture is a recurrent neural network (RNN) — a type of deep learning model specifically designed to process sequences. It reads the satellite observations in order, building up a picture of what each field is growing as more data arrives. Critically, it doesn’t need to wait for the full season of data: the model produces a classification and a confidence score at each timestep, so it can identify winter crops as early as January and spring crops by mid-season.
To help distinguish crops that look spectrally similar (wheat vs. barley, for instance), the model can also use auxiliary data: historical crop rotation patterns, or regional crop statistics that act as a prior on what’s likely growing in a given area.
Each country gets its own model, calibrated to local crop calendars, climate patterns, and the specific set of crops grown there. A model trained for France won’t be applied to Romania — the pedo-climatic conditions are too different. This is expensive to build, but it’s what separates production-grade crop classification from research prototypes.
Area-level accuracy: does the total area match official statistics?
The question
When we sum up all the fields classified as wheat in France, does the total match what the French government reports? If we say there are 4,400 thousand hectares of wheat, and Eurostat says 4,454 thousand hectares, that’s a 1% gap — well within the margin any commercial decision can tolerate.
Why it matters
This is the metric that matters most for market sizing, procurement planning, and production forecasting. When a trader needs to estimate Ukraine’s sunflower supply, or a cooperative needs to plan silo allocation by crop, what they need is a reliable total — not necessarily the label on every single field.
How we measure it
We compare our total estimated crop area per country against published national statistics from official sources: Eurostat for EU countries, StatCan for Canada, SSSU for Ukraine, APIA for Romania, and others. The result is a percentage deviation per crop per country. Even if a model occasionally misclassifies individual fields, the errors can cancel out at the aggregate level: a wheat field incorrectly labeled as barley might be offset by a barley field labeled as wheat elsewhere.
Results
Below is a selection of country-level comparisons, shown as Hyperplan kha / Official kha (Δ%). Green means within 7%, yellow means 7–15%, orange means above 15%.
|
Country |
Source |
Wheat |
Maize |
Barley |
Rapeseed |
Sunflower |
|
France |
Eurostat |
4,399/4,454 (-1%) |
2,893/2,869 (+1%) |
1,815/1,808 (0%) |
1,328/1,327 (0%) |
794/754 (+5%) |
|
Bulgaria |
Eurostat |
1,194/1,195 (0%) |
542/531 (+2%) |
198/196 (+1%) |
66/64 (+3%) |
933/929 (0%) |
|
Romania |
APIA |
1,933/2,010 (-4%) |
1,530/1,592 (-4%) |
536/557 (-4%) |
508/527 (-4%) |
1,220/1,269 (-4%) |
|
Germany |
Eurostat |
2,739/2,615 (+5%) |
2,635/2,547 (+3%) |
1,725/1,660 (+4%) |
1,138/1,088 (+5%) |
53/51 (+4%) |
|
Ukraine |
SSSU |
5,080/4,911 (+3%) |
4,075/4,264 (-4%) |
1,422/1,408 (+1%) |
1,184/1,264 (-6%) |
5,318/4,988 (+7%) |
|
Canada |
StatCan |
11,973/10,940 (+9%) |
1,942/1,953 (-1%) |
2,557/2,483 (+3%) |
9,394/8,750 (+7%) |
— |
|
Czechia |
Eurostat |
773/776 (0%) |
315/311 (+1%) |
316/317 (0%) |
346/343 (+1%) |
— |
|
Denmark |
Eurostat |
476/476 (0%) |
192/192 (0%) |
592/571 (+4%) |
181/181 (0%) |
— |
Values in thousand hectares. Official sources: Eurostat, StatCan, SSSU, APIA.
The picture is remarkably clean. France is within 0–1% across all crops. Bulgaria is within 0–3%. Romania is consistently at -4% across the board — a systematic offset likely linked to boundary coverage, not classification error. Denmark and Czechia are effectively spot-on.
For the commercial use cases that depend on area estimation, these are production-grade numbers.
Field-level accuracy: did we get the right crop on each field?
The question
If we say a specific field is growing wheat, is that correct? And among all the fields that really are growing wheat, how many did we correctly identify?
Why it matters
This is the metric that matters most for parcel-level use cases: commercial targeting, farm-level reporting, field-by-field yield estimation, and any workflow where you need to know what a specific farmer is growing. A trader might tolerate aggregate-level noise, but a sales rep visiting a farm needs the field-level classification to be right.
It also underpins API use cases: crop classification delivered at field level feeds directly into farm management systems and decision platforms used by agribusinesses.
How we measure it
We use a metric called F1 score. In plain terms, it combines two questions into one number: among all the fields we labeled as crop X, how many really were crop X (precision)? And among all the fields that really were crop X, how many did we find (recall)? F1 ranges from 0 to 1. Anything above 0.70 is considered strong for satellite-based crop classification. Above 0.90 is exceptional.
We compute F1 for every crop in every country, validated against official government ground truth data — the same data that governments use for subsidy administration. This isn’t self-assessment.
Results
The table below shows field-level F1 scores across our largest countries. Green (≥ 0.70) = strong, yellow (0.60–0.70) = moderate, orange (< 0.60) = model being improved.
|
Country |
Wheat |
Maize |
Barley |
Rapeseed |
Soybean |
Sunflower |
Sugarbeet |
|
🇫🇷 France |
🟢 0.96 |
🟢 0.92 |
🟢 0.91 |
🟢 0.96 |
🟢 0.87 |
🟢 0.90 |
🟢 0.85 |
|
🇩🇪 Germany |
🟢 0.99 |
🟢 0.92 |
🟢 0.97 |
🟢 0.99 |
🟢 0.74 |
🟢 0.85 |
🟢 0.97 |
|
🇧🇬 Bulgaria |
🟢 0.94 |
🟢 0.85 |
🟢 0.87 |
🟢 0.87 |
N/A |
🟢 0.92 |
N/A |
|
🇨🇿 Czechia |
🟢 0.93 |
🟢 0.99 |
🟡 0.69 |
🟢 0.91 |
🟢 0.94 |
🟢 0.93 |
🟢 0.96 |
|
🇷🇴 Romania |
🟢 0.91 |
🟢 0.83 |
🟢 0.84 |
🟢 0.87 |
🟢 0.80 |
🟢 0.82 |
🟢 0.98 |
|
🇺🇦 Ukraine |
🟢 0.85 |
🟢 0.84 |
🟡 0.67 |
🟢 0.75 |
🟢 0.87 |
🟢 0.84 |
🟢 0.93 |
|
🇬🇧 United Kingdom |
🟢 0.89 |
🟢 0.75 |
🟢 0.93 |
🟢 0.93 |
N/A |
N/A |
🟢 0.94 |
|
🇨🇦 Canada |
🟡 0.70 |
🟢 0.88 |
🟢 0.77 |
🟢 0.92 |
🟢 0.86 |
🟢 0.96 |
N/A |
|
🇿🇦 South Africa |
WIP |
🟢 0.81 |
WIP |
WIP |
🟡 0.68 |
WIP |
N/A |
F1 scores validated against official ground truth (LPIS, StatCan, APIA). “—” = crop not significant in country or not evaluated.
France and Germany are at 0.85–0.99 for all major crops — effectively matching what a human surveyor would achieve. Bulgaria, Czechia, and Romania are consistently above 0.80 for most crops.
A note on barley: barley is the crop that consistently scores lower across countries, because its spectral signature closely resembles wheat throughout the growing season. This is where our use of prior data becomes particularly valuable — historical crop rotation patterns and regional statistics help the model disambiguate barley from wheat in-season, especially when the satellite signal alone is insufficient. Barley accuracy improves significantly as the season progresses and more observations accumulate.
We don’t hide the orange cells. A crop intelligence provider that only shows you their best numbers isn’t telling you the full story.
Deep dive example: Bulgaria in-season accuracy evolution
Every season tells its own story — some crops lock in early, others keep the market guessing until harvest. That's precisely why static accuracy benchmarks miss the point. Here’s where our obsession with measurement becomes most visible. We don’t just validate once at the end of the season. We track accuracy at every cutoff date as new satellite data arrives — showing exactly when each crop becomes reliably classifiable.
The table below shows Bulgaria’s F1 score evolution across the 2025 season, from pre-season (December 2024) to end-of-season (December 2025):
|
Crop |
Dec '24 |
Feb |
Mar |
Apr |
May |
Jun |
Jul |
Dec '25 |
|
Weighted F1 |
0.69 |
0.73 |
0.75 |
0.76 |
0.76 |
0.79 |
0.86 |
0.88 |
|
Winter wheat |
0.85 |
0.89 |
0.91 |
0.92 |
0.93 |
0.94 |
0.94 |
0.94 |
|
Sunflower |
0.69 |
0.72 |
0.73 |
0.71 |
0.71 |
0.74 |
0.87 |
0.92 |
|
Maize |
0.40 |
0.43 |
0.45 |
0.43 |
0.40 |
0.51 |
0.78 |
0.85 |
|
Rapeseed |
0.77 |
0.82 |
0.81 |
0.81 |
0.85 |
0.87 |
0.88 |
0.87 |
|
Winter barley |
0.21 |
0.34 |
0.47 |
0.66 |
0.67 |
0.74 |
0.81 |
0.72 |
Bulgaria 2025 backtest. F1 validated against official ground truth at each cutoff date.
Read left to right and you see the season come into focus:
Winter wheat is already at 0.85 in December — before the new year starts. By March it’s at 0.91. The model identifies winter wheat confidently and early, because its phenological signature is distinct by late autumn.
Maize starts at 0.40 — essentially unusable — because it hasn’t been planted yet. It jumps to 0.78 in July when the crop reaches full canopy, and finishes at 0.85. This is the expected pattern: spring crops can’t be classified until they exist in the field.
Sunflower follows a similar trajectory: low early, strong late (0.92 by season end).
Winter barley is the hardest crop in Bulgaria — spectrally close to wheat, with a smaller planted area that makes it harder for the model to learn from. It starts at 0.21 in December, climbs steadily as prior data and new satellite observations accumulate, and reaches 0.81 by July.
And this isn’t just retrospective analysis. We’re running the same validation live on the 2026 season right now. As of mid-April 2026, Bulgaria’s weighted F1 is already at 0.83 — and the crop-level breakdown shows the model is performing in real time, not only in hindsight:
Winter wheat reaches 0.94 by April — the same level it took until July to reach in 2025. Sunflower is at 0.86. Rapeseed is at 0.85. And winter barley — the hardest crop in the Bulgarian mix — is already at 0.87 in April 2026, compared to 0.67 at the same point in 2025. That’s a +20 point year-over-year improvement on the single crop that historically gives satellite classification the most trouble.
Maize sits at 0.65 — expected, since spring planting is still in progress and the crop hasn’t reached canopy closure yet. Based on the 2025 trajectory, maize will converge above 0.80 by July.
The season is still in progress, and the numbers are already actionable. This is what in-season validation looks like: not a promise of future accuracy, but a live, measurable curve that clients can track alongside their own planning cycles.
Why this matters for our clients
In-season validation is not an academic exercise. It’s what allows us to tell a client: “your wheat area estimates for Bulgaria are reliable today, your maize estimates will be reliable by July.” That’s a fundamentally different conversation than “our data is generally 90% accurate.”
It means an input manufacturer can trust the territory-level crop mix data they’re using to plan sales campaigns in April. It means a trader can start building supply estimates for winter crops in January, with known confidence levels. It means a cooperative can plan procurement logistics months before harvest, with visibility into exactly how the accuracy curve will improve.
And when we see orange cells in our validation tables — a crop that’s underperforming in a specific country — we don’t hide it. We flag it, explain why, and track the model improvement cycle until it’s resolved. That’s what measurement as a discipline looks like.
The numbers behind the numbers
20 countries. 11 crop classes. 100M+ ground truth datapoints powering the models. Area estimates within 0–7% of official statistics across most countries and crops. Field-level F1 above 0.80 for major crops in core geographies.
These are the numbers behind the insights. Not marketing claims — measured, documented, reproducible results that we track in-season and update with every new satellite pass.
If you want to see what this looks like on your crops and geographies, we’ll show you the full validation dashboard — not a slide deck.
Reach out at hyperplan.ag/contact or DM us directly. We’ll walk you through the accuracy numbers for the countries and crops that matter to your business.