The Case for Smarter Autonomy V&V

Author: Sahil Potnis Co-Author: Renae E Gregoire

1 April, 2025

No One-Size-Fits-All Model

Autonomous systems, be they robotaxis, AV trucks, or UAV drones, don’t carry the luxury of a learning curve for full autonomous deployment. When handling expected (and unexpected) scenarios safely and predictably like humans, these systems must demonstrate flawless behavior every single time with no room for guesswork. Can they learn from the environment to perfect their behavior? Of course, yes, but there’s a notional difference between learning to tune up unprotected left turns versus hitting a curb. You get it.

There are often philosophical and technological debates about what constitutes an ideal playbook for validating fully autonomous behavior. The reality is, there is not one, nor should there be. While standards and guidelines help achieve a faster convergence rate and guardrail the problem, they shouldn’t dilute the innovative methods in proving a system is safe(r). Honestly, that recipe varies depending on the product technology, operational design domain (ODD), safety case claims and subclaims, and methods of CONOPS for every company.

What matters at the end are a few specifics:

  • Can you demonstrate qualitatively and quantitatively that your product meets the desired safety case claims (in whichever bounding box you want to draw)?

  • Have you built sufficient evidence from your final product validation to bridge the gap between saying “our product is safe” and “actually proving it”?

  • Have you or can you build public trust on top of the latter two via transparency?

It’s much easier said than done. I can say that first-hand.

Validating AI systems with deep learning and neural networks is a tough problem. It doesn’t follow the traditional automotive or software systems validation approach. You’re no longer working with deterministic inputs and closed-form logic. You’re working with stochastic, black-box behavior, statistical probabilities, and edge cases that rarely repeat. Yet you’re still on the hook to demonstrate functional safety, regulatory compliance, and operational reliability under every possible ODD condition.

Let’s break down the problem further.

Chain of Thought: Starting with Continuous Validation

There’s a reason why continuous validation is replacing end-of-line testing across the autonomy industry. Early-stage issues cost less to fix. They’re also easier to isolate and more informative for engineers. Wait too long, and you may have to deal with system-wide rework, safety case rewrites, or worse — a product recall that makes headlines.

The stakes are even higher when building for Level 3 or Level 4 autonomy. These systems must take complete control in real-world environments. The only way to get there confidently is to validate continually — from unit test to public test, from synthetic sim to closed course to shadow-mode deployment. It’s a regressed approach. If you’re waiting for a completely signed-off system module before you can validate it within your test corpus, it just takes too long, and there’s no guarantee that you won’t find issues to restart the process again.

At DDD, we see this shift up close every day. Teams aren’t validating autonomy as a last step. They're baking verification and validation (V&V) into every sprint, every milestone, and every release. That’s what it takes to succeed in today’s market. Let’s break the problem down even further.

  1. Define the validation scope – component, sub-system, system, or even an autonomous behavior, per se.

  2. Highlight the constraints – Write down operating conditions to ensure the floor and ceiling of your tests are locked in. This is crucial. Just because you can test for everything doesn’t mean you need to. 

  3. Create test campaigns – {insert simulation / HIL / closed course tests}; it’s important not to boil the ocean here for an early signal. If it’s not the final validation, you have a next shot.

  4. Draw the classic requirements traceability matrix (RTM) – Document your findings in the RTM and feed back into the engineering problems.

  5. Rinse and repeat…

Teams often misconstrue validation as a final step pre-release. That is true if you’re doing product validation for a commercial deployment. However, it doesn't mean you can’t use validation principles to start the cycle early and in a smaller incremental scope. More than likely, what breaks complex distributed systems is the interface – any amount of testing that exposes interface definition problems and detects dependency failures will always benefit the system architecture and end product validation.

V&V is a process. Not a final destination. The goal is to deploy proven technology.

Basic Tenets of a Healthy Validation Pipeline

Figure 1. Simplified Validation Pipeline for Autonomous Systems

Synthetic Data Generation

You can’t validate what you can’t simulate.

That’s why scenario-based testing has become the gold standard for autonomy V&V. It allows developers to test real-world and edge-case interactions across varied ODDs long before a system hits the road. The amount of progress the industry has achieved in creating world foundation models is insane over the last 12 months. From what was a limiting factor to test all limiting ODD conditions five years ago to creating hyper-realistic, neurally simulated, actor-overlaid scenes is straight out of a fiction book.

What does this mean? Tools exist, but system integrators and end user applications must catch up to make the most of them. This advancement solves the software problem but, interestingly, complicates the systems engineering problem – you need to pick and choose which scenarios matter, why, and how that test builds your confidence in the system under test (SUT). You can now operate for 1,000 hours in Japan and a few hundred hours in the US to drive (nearly) perfectly, let's say, in the UK. The additive nature of unsupervised learning using synthetic data greatly simplifies the model training problem. But, I digress…

The bottom line is that you should supplement your real-world testing with NVIDIA Cosmos-like tools to turbocharge data curation needs. Then, couple that with scenario ontology and ODD constraint definitions to create a dataset that gives you an unprecedented edge in creating V&V scenario sets.

State of the Art Simulation

Almost every autonomous systems company uses digital twin validation to increase confidence in simulation-based results by creating virtual replicas of physical systems and correlating their behavior against real-world performance. This validation layer assures that simulation outcomes are trustworthy enough to inform safety decisions at scale. There’s plenty of public domain literature on simulation realism and correlation scores. We won’t go into those details. That said:

  • Input information (II): What do you need to build in simulations and why? Is the data sourced from real-world (log-based) sensor models or the output of ODD characterization studies?

  • Input constraints (IC): Which simulation aspect gives you the most realistic behavior and, hence, which part of ODD does it leave open gaps for?

  • Desired outcomes (DO): What is the intent of simulation tests for your SUT, i.e., learning to expect or things to cross off from the known risks SOTIF quadrant?

When you plug II + IC + DO in any tool, it should spit out a spectrum of simulated tests that help you build at volume. If you can crack the simulation-in-the-loop development workflow, with the rest of the tenets in this section, you’re almost guaranteed to succeed quicker than others.

As scenario libraries grow, managing them becomes a discipline in itself. Enter simulation operations (SimOps) — a formal practice that includes scenario lifecycle management, suite health monitoring, and adversarial testing orchestration. SimOps ensures testing is systematic, repeatable, and aligned with your development goals as they evolve.

At DDD, we begin with what’s real: logs from the field, annotated sensor data, and even moments when a safety driver took over. From that foundation, we build what you need — synthetic scenarios with adjustable variables like speed, visibility, or pedestrian intent. We reconstruct real-world incidents in simulation with near pixel-perfect accuracy. We also design adversarial tests meant to push systems to the limits, on purpose, so we can see how they recover under pressure.

Data Driven Development

It’s great to have tools at your disposal that can simulate the real world, traffic behavior, agent reactions, etc., but ultimately, how you use the output data for inference and performance improvements is where things make or break.

A mental model categorizes your autonomy capabilities into standard scenario metrics – with every new software or hardware release, these metrics change against your baseline regression test suites. You can use this data to close the loop with scenario generation, simulation, and public road testing campaigns to improve the release for next time. This cuts more into your systematic verification, but the principles are the same for an overall product validation. Instead of sub-system metrics, you swap with system-level KPIs such as system latency, field assists, safety incidents, etc.

Well-Oiled Data Infrastructure

The validation pipeline can be extremely efficient or severely laggy, depending on the data infrastructure of your setup. Feeding the test systems with your inputs and executing them precisely has to happen quickly. Otherwise, the temporal costs stack on and become a perfect recipe to skip the continuous validation I mentioned earlier.

Data ingestion, cloud uploads, scenario creation or simulation execution time, and automated triage with manual QA checks are all important to optimize for streamlining the data infrastructure.

Predictive Performance Modeling 

There are plenty of advancements to build predictive statistical models based on your system’s behavioral attributes, the nature of the change, and expected performance improvement. This largely helps predict the product roadmap forward and ensures that the V&V runway is sufficient to complete the testing, analysis, conclusions, and changes.

Towards More Automation

Simulation generates data. Edge cases generate questions. However, powerful insights come from understanding why something failed and where it can be traced back.

Modern V&V teams need systems that do more than detect anomalies. They need infrastructure for root-cause defect analysis — tools and workflows that isolate what went wrong, why it went wrong, and what to fix.

We’re seeing a fast pivot away from traditional human-led triage, which doesn’t scale. In its place? Automated log analysis, tagged defect taxonomies, and closed-loop workflows that route issues back to development with semantic context. DDD has helped build such taxonomies for leading autonomy customers, and we continue to refine them as systems grow more complex.

Coverage is both Metric and Deployment Decision

Coverage is a word that gets tossed around a lot. In autonomy, it means one thing: Have we tested enough to deploy with confidence? You can’t answer that question with line counts or percentage bars. You need:

  • Complete ODD analysis (day/night, urban/highway, dry/snow/low-visibility).

  • Scenario frequency distribution (not just edge cases but also the right mix of nominal vs. critical interactions).

  • Subsystem validation (perception, prediction, planning, control, fallback logic).

  • Human touchpoints (remote ops, customer support, fail-safes).

Teams preparing for geographic expansion also use region-specific scenarios to simulate local conditions, traffic patterns, and regulatory environments. This targeted approach helps adapt AV behavior to new markets quickly and safely.

The right approach combines simulation, structured closed-course testing, and targeted real-world validation — all mapped back to coverage goals. We help teams define those goals and hit them systematically.

Traceability: The Quiet Power Move in V&V

Most engineers don’t get excited about traceability. But safety auditors, regulatory bodies, and deployment leads do — and with good reason.

Traceability is a provable link from requirement to scenario to test to result. It’s the backbone of ISO 26262 and SOTIF compliance. It’s the reason regulators say yes.

And here’s the reality: The more complex your autonomy stack, the harder it is to trace. At DDD, we help clients build end-to-end traceability frameworks that embed links early and preserve them as systems evolve. That work includes:

  • Mapping requirements to acceptance criteria in simulation and real-world testing.

  • Tagging results with scenario IDs, environment variables, and sensor configurations.

  • Correlating anomalies to architectural layers (sensor, fusion, planning, actuation).

  • Making sure every test result directly reinforces a safety case element.

With traceability in place, teams can know what passed, what matters, and how to prove it to anyone who asks — regulator, insurer, or the public.

Automation isn’t the Goal; Understanding is

It’s easy to think automation is the final destination in V&V. However, the real goal is understanding. Understanding is why human-in-the-loop (HITL) validation remains essential — especially in early-stage autonomy development or when AI systems behave unexpectedly. No matter how advanced the model, there will be edge cases where human judgment is faster, sharper, and more adaptable.

At DDD, we balance automation and human review by:

  • Using AI to flag anomalies in massive data streams.

  • Routing ambiguous or high-impact cases to expert human reviewers.

  • Feeding human-labeled data back into the model for continuous improvement.

  • Creating feedback loops that combine speed with insight.

We also design performance evaluation workflows that integrate feedback from both onboard and offboard sources so that each iteration of the autonomy stack gets evaluated against business goals, technical benchmarks, and safety criteria.

This work is especially powerful for safety-critical edge cases where pedestrian intent, cyclist behavior, or sensor dropouts challenge even the most advanced AV stack. With our HITL workflows, nothing gets missed. No bad decision gets a free pass.

The Bottom Line

Looking ahead, V&V will only grow more dynamic. Engineers will regenerate entire worlds from a single log file and spin off dozens of test variants with synthetic weather, lighting, occlusion, and pedestrian behaviors. Simulations will be stress-tested by agents designed to provoke failure, not avoid it — because that’s where the system shows its true limits.

Think prompt engineering but for autonomy: crafting inputs that reveal how the model reasons under pressure, testing not just behavior but reasoning, not just capability but recoverability. It’s what makes this space exciting. And it’s why companies that invest in smarter V&V now will move faster, scale safer, and lead longer.

Verification and validation is a strategy, not a checkbox. Done right, it’s your moat, your launchpad, your competitive edge.

At DDD, we can help you validate systems, prove system readiness, improve system reliability, and shorten the road to system deployment. From automated performance analysis to HITL review, from scenario curation to traceability infrastructure — it’s the work we do every day.

And we’d love to help you do it smarter.

References

  1. Validation and Verification Processes: Keeping Up with the Driving Advances – David Ip, LinkedIn

  2. Metrics That Matter in AV Development – Applied Intuition

  3. Navigating SOTIF (ISO/PAS 21448) and Ensuring Safety in Autonomous Driving – Automotive IQ

  4. AV Compliance: Still a State-by-State Slog (for Now) – Frost Brown Todd

  5. Traceability Standards and Regulations in the Automotive Industry – SodiusWillert

  6. Continually Verify and Validate ADAS and AV Compliance and Performance – Siemens

  7. Human-in-the-Loop AI – KJR

  8. What Is Edge Case Testing? – TestScenario

  9. AV Development: 4 Triage Considerations – Applied Intuition

  10. Tech-Driven AV Performance Validation – Applied Intuition

Previous
Previous

Cross-Modal Retrieval-Augmented Generation (RAG): Enhancing LLMs with Vision & Speech

Next
Next

The Role of Human Oversight in Ensuring Safe Deployment of Large Language Models (LLMs)