Autonomy: Is Data a Big Deal?
By Sahil Potnis
February 13, 2025
Prelude
In the world of cutting-edge technology, from the most simplistic automation to the most advanced Artificial Intelligence (AI) applications – our global corpus of machines emits on average more than 400 million terabytes[1] of data every single day. While it took us ~2.5 million years to harness fire, it merely took us 66 years from the first flight to landing on the moon[2]. This exponential hyper-explosive progress shares its version of success in the area of Autonomy and the impact it has had at a global scale on transportation, manufacturing, defense, and mobility in general. Our evolutionary biology of millions of years from Homo Erectus to Homo Technologies coupled with cognitive adaptation, and muscle memory has helped us learn new skills. Take driving a car for example, a skill that can be easily learned in two days at best! What lies at the heart of this human civilization development is the same micro-unit that trains our machines, robots, and Autonomous Vehicles (AV) – i.e. Data.
The human brain is the most sophisticated neural network. It analyzes patterns within data, aggregates collected experiences, and uses this contextually to make decisions. Autonomous Systems (or Autonomy) do exactly the same – I’m not only talking about the obvious aspect of training neural networks but in fact the entire data value chain necessary to convert a human-supervised application to a fully capable, commercialized, hands-free Autonomous solution. From crafting a smart training data collection strategy, streamlining feedback from the field, and deploying simulation to test at volume (and cheaply so)... every single step in the process radiates niche data that needs to be backward propagated into the product development matrix. A good analogy I can think of is essentially of automotive gear (pun intended), tiny flywheels feeding into bigger flywheels, connected to a driving shaft, and so on. Technology’s time to mature is a direct reflection of this “gearbox efficiency factor” and data plays arguably the most important role as a necessary lubricant.
Let’s double-click on why it is a big deal.
Phase 1: Prove It Works
From “Stanley the robot” winning the 2nd DARPA Grand Challenge[3] in 2005 to Waymo’s consistent market expansion in 2025, our Autonomy index has macro-inflated over the last couple of decades. Productizing research and converting a strong technology conviction into a commercial reality takes a lot of good engineering backed by a strong data signal. In my decade's worth of first-hand exposure to this evolution, we very rarely see an automotive platform designed specifically for Autonomy in its first iteration. It takes several hits (and misses) to figure out the sensor suite, compute requirements, driving controls, and data format to build a true system that can lift off and generate meaningful results. Not to neglect the complicated supply chain and logistics behind this massive uphill engineering task. The landscape is shifting positively with more purpose-built platforms for autonomous driving that are equipped to provide SAE L2-L3[4] support functions, with an extended scope to integrate L4-L5 automated driving levels further via strategic technology partnerships.
New platform bring-up activities get simpler iteratively as the output data becomes more rich and meaningful to the Autonomy development. Problems start shifting from sensor point cloud density, basic vehicular controls, and task latency to more so of raw driving behavior. Viola! There we have our first prototype, traversing a straight line or a small loop from A to B without any human intervention on the closed course. This all is way simplified of course to keep the length of the article in check – point being, the gritty picture it paints is clear on how packaging and structuring data from the get-go is critically transformative in building prototypes. Bench development of individual components has become more organized with state-of-the-art hardware-software integration (HSI) tools, calibration is more routine than a research process, and it takes much less effort to plug and play ROS output data into a neat visualization application than developing one from scratch, off the shelf data ingest and management solutions are plenty, etc.
General purpose technologies like cloud engineering, data pipelines, web GPUs, and full stack development have solidified to help us solve the real Autonomy problem. Foundational data models and GenAI are taking us multi-step further in real-world behavior interpretation. This is how we keep riding new technology waves. The ecosystem of data experts is stronger than ever, taking us to the next segment – now that you have data at your fingertips, how do you optimize engineering operations to move measurably quicker and build a verifiable, launch-worthy product?
Phase 2: Develop. Fail. Learn and Repeat.
I remember almost a year back, a horse galloping on I-95 made headlines[5] across the US. Now imagine an autonomous truck driving at 70 MPH next to it. Do you think its Perception stack can handle this situation? We or at least the Equus caballus most certainly would hope so! It’s a no-brainer that as humans, we will slow down or lane change and get further away from the stray horse to reduce the probability of conflict. The autonomous truck in our hypothetical example need not have a hyper-specific response to such a situation as long as it can safely, and predictably handle anomalies. These longtail scenarios or edge cases are true gold for data-driven ML Model Development.
The above-simplified flow chart is true for supervised learning systems where the starting step is to figure out which model attributes need attention. Further, that decision gets multiplexed into a structured data collection >> curation >> annotations strategy. The opportunity (time) cost of this process is invariably high and hence a scientific approach to this data-driven effort-impact problem is a must. Material advancements in the availability of nuanced annotation tooling platforms with technical solutions as offered by companies like DDD have made this process highly predictable, cost: quality efficient, and democratized. Similar to the ML model development proposition, a few other data-centric areas remain critically important to talk about. Let’s take a couple of examples.
Performance Evaluation: Feedback from the field is indispensable for any learned behavior system, especially Autonomy. In a nutshell, performance evaluation refers to: a frequent activity of aggregating output from a range of test modalities (simulation, test track, public roads, HIL benches) into a crystallized set of priorities to improve the product performance. This involves predictive analysis, what-if scenarios, and data-driven failure defect management to remove any delays in improving the system's performance. I truly believe that for any Autonomy product to succeed, its performance evaluation strategy needs to be spot on, else countless cycles are wasted in figuring out how to measure performance, what problems to fix, by when, and why.
Simulation Operations: Another complementary area or the flywheel we referred to earlier is, Simulation. Refers to: a product for simulating the true physical world representation of any system in a digital environment. Millions and billions of scenarios can be simulated in a shorter period of time, the number being the less important part compared to the time. Companies providing simulation tech as a service or platform have greatly appreciated the product-worthy nature of this vertical. From the primitive synthetic sim to advanced neural sims, the goal all along is to build solid evidence for proving the verifiability of the AI system. Top of the line players have figured out how to – build the sim engine, scale infrastructure, spawn out analysis workstreams, converge back the learnings, and finally, improve the product.
Machine Learning Model Development, Performance Evaluation,and Simulation are the top three continuous learning feedback loops which in my opinion remain fundamental to developing a safer, predictable autonomous product. The job however is not done yet, transferring this tech into the hands of the end user remains a key step and a long(er) pole than some of us had originally anticipated.
Phase 3: The Launch
Operational muscle helps catapult Autonomy’s commercial deployment after the technology is ready for a launch. Locking in the operational recipe serves a very important role when it comes down to a holistic “all systems ready for launch” program status. Taking a step back, in the last 5 years or so, vertical integration of the commercial model has nicely shaped and taken priority frankly compared to the over-emphasized silos of early market entry advantage. This has led OEMs, Tier-1 suppliers, ridesharing platforms, and technology champions to partner together, overall diversifying the deployment risk. Data is at the forefront of planning such joint fleet operations – from command (control) center management, remote assistance, or planning a normalized exposure of your product to the target Operational Design Domain (ODD). I have massive respect for the teams managing CONOPS, and field support services to preserve the business continuity for applications like robotaxis. A substantial variable of this equation is a Human-Robot UXR problem, and data once again is a key catalyst in solving for the unknowns.
From the simplest of fleet management problems to the more involved ODD expansion needs, Autonomy development and its necessary commercialization are backed by data - tools that ingest the data - workforces that transform the data - and engineers who act on the data. We have made great strides in these areas over the past several years, but the job is surely not done yet.
In Conclusion
Data-driven development is more than just an acceptance that data is the key enabler for building Autonomy, it’s the actuality of building necessary infrastructure (tech + people) required to cycle through the data, selectively and with the right judgment to propel the progress.
DDD’s Autonomy Solutions are here to help you accelerate meeting the ends and making a quicker impact. We’re onward to something new that’s more exciting and cutting-edge in the coming days. Get in touch and don’t miss out!
Is data a big deal? Most certainly so.
Reference Links