Dependable AI systems for real-world impact
João Galego $$\left|\text{🧠}\right>$$
Head of AI @ CSW
Invited Professor @ ISEG
$ whoamiMSc Physics
PgDip Forensics*
PhD Cognitive Science / ABD**
* Not-so-fun fact: I once performed an autopsy
** Dropped out to live life and have fun doing it
Lead ML Engineer
Solutions Architect
Head of AI
Break things at scale
Build things faster
Make brains* go brrr
* all brain types welcome!
great demos, fragile products
and why models aren't the problem
models $\rightarrow$ systems $\rightarrow$ society
building systems people can trust
what you might be wondering,
but were afraid to ask
Machine Learning that matters by Kiri Wagstaff

a colleague pulled me aside and said
"what you do is not engineering"
Offense
Denial
I owe them an apology
They were right
to set the record straight
This year, global spending on AI
will reach $300B growing 4.2x faster
than average IT spend.
34% of enterprises have deployed
AI in production and 22% will
deploy in the next 12 months.
Generative AI will increase
the impact of all AI by 15 to 40%
across all industries.
When it comes to AI adoption,
64% of companies lack a clear roadmap
with measurable goals.
67% of organizations expect
to maintain or increase AI spending, yet
only 21% report any proven outcomes.
86% of all AI projects fail to deliver,
while 50% never make it to production.



Source: Adapted from Sculley et al. (2015)

At any AI conference, you'll hear about:
Real-world impact isn't about intelligence.
It's about RELIABILITY.
Can we build AI?
Can we trust it when it matters?

prod...Who wants to take a guess?

Source: Pan et al. (2025)
Because AI is already everywhere
that matters most
Why it's a terrible, horrible, no good idea
Nuclear energy and AI are converging
to shape the future
Operations / predictive maintenance
Design / reactor modelling
Safety / accident simulation
Safeguards / surveillance footage analysis
"Reassuringly, despite its brilliance, AI still needs a human to make sure it is right and impartial, and to understand the politics behind a safeguards footnote"
What it is & why it's a bad idea
quietly running in the background
until something goes wrong
A system whose failure may cause
real accidents happen!
Lost a spacecraft because one team
used metric and the other used imperial 📏
Killed 28 soldiers due to a cumulative
rounding error in the system’s software 🎯
Lost $440M in 30 minutes
after deploying buggy code 💸
Spaghetti code broke the brakes 🚗
At least, not in critical systems
“Do you code with your
loved ones in mind?”
― Emily Durie-Johnson, Strategies for Developing Safety-Critical Software in C++
Is it really a good idea to bring AI to critical systems?
Same input, same output... always
Rules are explicit and readable
Bugs have clear causes and fixes

Source: Andersson et al. (2024)
You write the rules
You know what it will do
You know why it broke
Source: He et al. (2025)
You shift the agency to data:
The data wrote the rules
Change the data, change the behavior
Garbage in, garbage out
You didn't write the rules
You don't always know what it will do
You don't always know why it broke
and complexity breaks things.
Models will make mistakes
or when is a stop sign not like a stop sign?
False positives posed traffic risks to drivers
Polite software that 'moved out of the way'
by illegal passing. 🚌
often in strange and unpredictable ways
Let's turn to the ECSS ML handbook...

Do NOT build AI
just because you have data.
Do NOT use AI
just because you can.
Instead, constrain it so it can't be unsafe.

Source: Delseny et al. (2021) / DEEL

The doer optimizes for performance.
The doer can be low SIL ⬇️
The checker must be high SIL 🚨
Safety Integrity Levels (SIL)
Development Assurance Levels (DAL)
Is this application DAL-C or DAL-D certifiable?

Source: Adapted from Dimitriev et al. (2023)
Different versions of ML models and/or their inputs are used in a system to improve the output reliability.

Source: Adapted from Machida (2019)

Source: Adapted from Machida (2019)

Source: Flad (2026)

Source: Phan et al. (2019)

Safety Architecture for Autonomous Vehicles

"While all of Helsing’s work primarily focused on software model training, integration with Gripen E APIs and testing, Saab actually set the groundwork for operating a software-defined aircraft several years ago with an overhaul to the Gripen’s avionics."

"Gripen’s avionics system separates 10% of the aircraft's flight critical management codebase from 90% of its tactical management code, resulting in avionics that are 'hardware agnostic'."
"Many of the well-known approaches used to ensure the reliability of software are difficult or impossible to apply to AI-based software, where models are created from data rather than hand-coded by software developers. This creates friction in the commissioning and development of AI-based software, because it is unclear what criteria will be used to assure it. The potential worst case is that assurance of systems involving AI are subject to a matrix of both poorly-fitting existing requirements and new but underspecified AI-related requirements."
AI is part of the system.
So test it like it is.
The ECSS ML handbook suggests checking:
Known cases (the expected)
Coverage (the internals)
Edge cases (the unknown)
Adversarial cases (the hostile)


Source: EASA / Daedalean (2024)
Mathematically prove that
certain behaviors cannot happen.
* Oldie, but goodie!
Systems that maintain an ongoing interaction
with the environment, as opposed to computing
some final value on termination.
but when they do it.
"Software engineering is programming integrated over time."
Winters, Manshreck & Wright (2020)
$$\texttt{SWE} = \int \texttt{Programming} ~dt$$
$$f \mapsto \texttt{E}[f] = \int^{\min[\text{EOL}, ~+\infty]}_{\max[-\infty, ~\text{idea}]} f ~dt$$
Ensure that certain properties hold at all times.
bad thing never happens
$$\square ~\neg \texttt{bad}$$
good thing eventually happens
$$\diamond ~\texttt{good}$$
Let's make ONNX deterministic and fully verifiable...
module COPFlatten
use OPFlatten
use tensor.Tensor
use list.List
use list.Length
use int.Int
use libtensor.CTensor
use libvector.CIndex
use std.Clib
use mach.int.Int32
use std.Cfloat
let cflatten (x r : ctensor) (axis: int32)
requires { valid_tensor x }
requires { valid_tensor r }
requires { r.t_rank = 2 }
requires { let axis_normalized = normalize_axis (to_int axis) (length (tensor x).dims) in
(ivector r.t_dims r.t_rank) = flat_dims (tensor x) axis_normalized }
requires { vdim x.t_dims x.t_rank = vdim r.t_dims r.t_rank }
requires { -length (tensor x).dims <= (to_int axis) <= length (tensor x).dims }
ensures { tensor r = flatten (tensor x) (to_int axis) }
=
let m = cdim_size r.t_dims r.t_rank in
for i = 0 to m - 1 do
invariant { forall k. 0 <= k < i -> value_at r.t_data k = value_at x.t_data k }
r.t_data[i] <- x.t_data[i]
done;
assert { tensor r == flatten (tensor x) (to_int axis) }
end

void cflatten(struct ctensor x, struct ctensor r, int32_t axis) {
int32_t m, i, o;
m = cdim_size(r.t_dims, r.t_rank);
o = m - 1;
if (0 <= o) {
for (i = 0; ; ++i) {
r.t_data[i] = x.t_data[i];
if (i == o) {
break;
}
}
}
}
who verifies it?
Most people think of verification as a cost, a tax on development, justified only for safety-critical systems. That framing is outdated. When AI can generate verified software as easily as unverified software, verification is no longer a cost. It is a catalyst.
― Leonardo de Moura, creator of Lean

Write the code
-- Natural language description of the coding problem
-- Remove an element from a given array of integers at a specified index...
-- Code implementation
def removeElement (s : Array Int) (k : Nat) (h_precond : removeElement_pre s k) : Array Int :=
s.eraseIdx! k
-- Pre-condition
def removeElement_pre (s : Array Int) (k : Nat) : Prop :=
k < s.size -- the index must be smaller than the array size
-- Post-condition
def removeElement_post (s : Array Int) (k : Nat) (result: Array Int)
(h_precond : removeElement_pre s k) : Prop :=
result.size = s.size - 1 ∧ -- Only one element is removed
(∀ i, i < k → result[i]! = s[i]!) ∧ -- Elements before index k remain unchanged
(∀ i, i < result.size → i ≥ k → result[i]! = s[i + 1]!) -- Elements after are shifted
Prove the code
-- Formal proof (establishing code-specification alignment)
theorem removeElement_spec (s: Array Int) (k: Nat) (h_precond : removeElement_pre s k) :
removeElement_post s k (removeElement s k h_precond) h_precond := by
unfold removeElement removeElement_postcond
unfold removeElement_precond at h_precond
simp_all
unfold Array.eraseIdx!
simp [h_precond]
constructor
· intro i hi
have hi' : i < s.size - 1 := by
have hk := Nat.le_sub_one_of_lt h_precond
exact Nat.lt_of_lt_of_le hi hk
rw [Array.getElem!_eq_getD, Array.getElem!_eq_getD]
unfold Array.getD
simp [hi', Nat.lt_trans hi h_precond, Array.getElem_eraseIdx, hi]
· intro i hi hi'
rw [Array.getElem!_eq_getD, Array.getElem!_eq_getD]
unfold Array.getD
have hi'' : i + 1 < s.size := by exact Nat.add_lt_of_lt_sub hi
simp [hi, hi'']
have : ¬ i < k := by simp [hi']
simp [Array.getElem_eraseIdx, this]
Test the code
-- Positive test with valid inputs and output
(s : #[1, 2, 3, 4, 5]) (k : 2) (result : #[1, 2, 4, 5])
-- Negative test: inputs violate the pre-condition at Line 12
(s : #[1, 2, 3, 4, 5]) (k : 5)
-- Negative test: output violates the first clause of the post-condition
(s : #[1, 2, 3, 4, 5]) (k : 2) (result : #[1, 2, 4])
-- Negative test: output violates the second clause of the post-condition at Line 17
(s : #[1, 2, 3, 4, 5]) (k : 2) (result : #[2, 2, 4, 5])
-- Negative test: output violates the third clause of the post-condition at Line 18
(s : #[1, 2, 3, 4, 5]) (k : 2) (result : #[1, 2, 4, 4])
And now for a word from our sponsors!
Building systems that last is
HARD
The real challenge isn't model accuracy.
It's system reliability under UNCERTAINTY.
We need to move from models to systems!
This thread reveals 3 things:
Expect failure
Design for recovery
Monitor everything
Keep humans around
Because good intentions are not enough!
Garbage in, garbage out
If the data is wrong, incomplete, or drifting,
the system will fail.
Focus on:
your data does
Accuracy isn't reliability
A high benchmark score does not guarantee
safe real-world behavior
Evaluate for:
not just the average case.
If you can’t see it, you can’t trust it.
Track:
Silent failures are the most dangerous failures.
Expect failure. Design for safety.
Systems must handle that safely.
Common patterns:
They fail gracefully.
AI works best when we are around
Humans provide:
Design systems that allow:
# Predict: AI takes a shot...
result, confidence = model.predict(input_data)
# Check: Too unsure? Don't guess!
if confidence < threshold:
result = route_to_fallback() or route_to_human()
# Log: Always leave a trail
log_decision(input_data, result)
AI acts only when a
human approves each decision.
AI acts autonomously, but humans
monitor and can intervene.
AI operates independently, while humans
set goals and review outcomes.
Who will watch the watchmen?
We are part of the safety system.
It's engineering discipline.
Build smarter AI
Build trustworthy systems that safely amplify our capabilities.
model accuracy $\rightarrow$ system reliability
benchmarks $\rightarrow$ real-world impact
research $\rightarrow$ engineering
real problems for real people
it begins at it lasts
AI first, human always!
A new talk titled 'Build AI That Matters'
introduces a practical framework for designing
dependable AI systems that deliver
real-world impact.
Production failures rarely originate
from the model itself.
$$\cdots$$
Dependability requires addressing
the entire system.
No, it makes deployments sustainable.
$$\cdots$$
Without it, repeated failures erode trust
and slow adoption.
Humans are not replaced by AI.
$$\cdots$$
They are part of the system that ensures
safety and accountability.
AI creates value only when it is
reliable enough to be trusted.
$$\cdots$$
The future of AI will be shaped
not just by better models, but by better
engineering of the systems around them.
🙏