The Hardest Thing to Copy in AI Is Not the Model. It Is the Data.
Why the dataset, not the architecture, is the advantage that compounds.
In AI, models propagate quickly. Data does not. Architectures get published and training methods get shared, but a competitor cannot download the real-world data you spent years collecting.
Over the past two articles I wrote about what it means to deploy AI inside physical systems, and why that AI has to respect physics rather than just recognise patterns. Physics gives the model boundaries. Data gives it experience. This article addresses the part of the problem that receives the least attention and determines outcomes most: the data.
There is a common assumption that the competitive advantage in AI lives in the model. The architecture, the number of parameters, the cleverness of the training method. Those things matter. But they are also the most portable. Architectures are published. Training techniques are presented at conferences and copied within months. If your advantage is your model design, your advantage has a short shelf life.
The thing that does not transfer is the data. And for AI that operates physical systems, the data is the hardest asset in the world to assemble.
Why physical-system data is so hard to get
A language model can be trained on text that already exists. The internet is full of it. An image model can learn from billions of labelled pictures. The raw material is abundant and, for the most part, already collected.
Battery data is the opposite. It does not exist until you generate it. Every useful data point comes from a real cell, cycled under real conditions, measured over real time. To understand how a cell behaves across its life you have to age it across its life. Time is an irreducible constraint. A dataset that captures how cells degrade over years takes years to build.
It is also capital- and time-intensive. Lab cycling requires equipment, controlled environments, and months of continuous testing per cell across a matrix of chemistries, temperatures, and load profiles. Field data requires deployed systems in real vehicles and real energy installations, instrumented and collected over the operating life of the asset.
Eatron has been collecting both for years. We monitor tens of thousands of vehicles and assets in the field, across hundreds of millions of kilometres of real driving, alongside years of controlled lab campaigns that capture how cells age, degrade and fail under stress. The result is a structured, multi-source battery dataset built from three distinct sources: lab truth, field data, and rare-event data. None of it came from single download. It came from years of work that cannot be compressed.
The data nobody else has: rare events
Volume is only part of the story. The more important part is what is inside the data.
The dataset comprises three complementary components. The first is lab truth: ground-truth data generated under controlled stress, where cells are deliberately aged and pushed to failure across chemistries and temperatures. The second is fleet data: the messy, real-world behaviour of assets in the field, with real duty cycles, real climate, telemetry gaps and genuine faults. The third is the one that is hardest to get, and the one that matters most.
Model performance in safety-critical systems is dominated by rare events. Most battery AI initiatives fail not because of model capability, but because their datasets systematically exclude the failure modes they are expected to detect. The early signature of a cell beginning to fail, thermal events, the subtle anomalies that precede a real problem. These are, by definition, rare. Most datasets do not contain them, because most operators have never observed them, and you cannot easily replicate them on demand.
“You cannot confidently train a model to recognise a failure it has never seen. The rare events are the data that matters most, and the data hardest to get.”
Eatron has this data because it has been in the field at scale for years, with fault-labelled subsets collected across real fleets. This combination of scale, field exposure, and labelled edge-case data is structurally difficult to replicate. It requires multi-year data accumulation under real operating conditions, which cannot be accelerated or synthetically reproduced.
What the data makes possible: faster deployment and minimal testing
A rare-event-rich data asset is not valuable in itself. It is valuable because of what it lets the model do.
Most Battery Management Systems require weeks of per-asset calibration. Before the model can be trusted on a new cell or a new pack, it has to be tested and tuned against that specific hardware. That is slow and capital- and time-intensive, and it creates a scaling bottleneck.
Because our proprietary foundational model is pre-trained on a sufficiently diverse dataset to generalise, it can produce state-of-charge / health and remaining-useful-life estimates from raw voltage, current, and temperature with far less battery-specific work. This removes cell-specific testing as a primary bottleneck in scaling deployment. Exposure to diverse degradation pathways and failure precursors allows the model to learn invariant representations of battery behaviour, rather than overfitting to specific cells or duty cycles. This is what allows the model to generalize across chemistries, ageing profiles, and duty cycles without requiring per-asset recalibration.
We recently put this to the test with an Asian partner on an LFP cell/pack the AI model had not seen before. The chemistry was familiar, but this specific cell was new to the model, with no prior characterisation. The model was evaluated under degraded sensing conditions representative of real deployment, including sensor noise and initial state errors. The model maintained accuracy across a range of real-world duty cycles, from grid services to EV driving. The one area that needed work was cold-temperature performance, which was expected and was closed with a small amount of targeted fine-tuning.
The dataset enables a model that can generalise to a new battery far faster than a model trained on narrower data, under conditions that would break it otherwise.
Why this is a moat that grows
The most important property of a data advantage is that it compounds.
Every system we deploy generates more field data. Diversity of accumulated field data enables invariant representation learning. Improved model performance increases deployment probability. More deployments generate more data, including more of the rare events that are so hard to capture. The advantage compounds over time. In physical systems, this advantage compounds more slowly than in purely digital domains, but with significantly higher defensibility because data acquisition is constrained by real-world time and deployment cycles.
“A model advantage can be copied in months. A data advantage built over years, growing with every deployment, cannot.”
This is why I think the industry pays too much attention to model architecture and not enough to the data underneath it. For AI that operates in the physical world, the data asset is the durable advantage. It is the part that took years, that captures events nobody else has seen, and that gets stronger every time a system goes live.
Architecture is how the model thinks. Data defines the model’s operating knowledge. In safety-critical physical systems, that is the thing you cannot afford to get wrong, and the thing a competitor cannot quickly reproduce. For organisations building AI for physical systems, the primary risk is not choosing the wrong model, but building on insufficient data.
Next time our CTO writes about the architecture itself: how we built a network that respects the laws of physics, and why that design is what makes the data usable in real time on embedded hardware.
Dr Umut Genc is co-founder and CEO of Eatron Technologies, a UK-based deep-tech company specialising in AI-powered Battery Optimisation Software for mobility and energy applications. Eatron’s Battery Optimisation Software is production-validated across automotive OEMs, two-wheeler platforms, and grid-scale energy storage systems.