# Build
AI
that matters Dependable AI systems for real-world impact
[João Galego](https://jgalego.github.io) $$\left|\text{🧠}\right>$$ Head of AI @ CSW Invited Professor @ ISEG
--- # `$ whoami` -- ## Academic Background MSc Physics
PgDip Forensics
*
PhD Cognitive Science / ABD
**
* **Not-so-fun fact:** I once performed an autopsy
** Dropped out to live life and have fun doing it
-- ## Professional Experience Lead ML Engineer
Solutions Architect
Head of AI
--
-- ## TL;DR Break things at scale Build things faster Make brains
*
go brrr
* **all** brain types welcome!
--- # Agenda 📋 -- ## Mind the
gap
great demos, fragile products -- ## Why AI
fails
and why models aren't the problem -- ##
Dependable
AI models $\rightarrow$ systems $\rightarrow$ society -- ## AI that (actually)
matters
building systems people can trust -- ## PR
FAQ
what you might be wondering,
but were afraid to ask -- ## This talk was inspired by... [Machine Learning that matters](https://arxiv.org/abs/1206.4656) by Kiri Wagstaff
-- ### In my first month at Critical... a colleague pulled me aside and said > "what you do is
not
engineering" -- ### My first reaction? Offense ### My second? Denial -- ### One year later... I owe them an apology
They were right
-- ### This talk is my attempt to set the record straight -- ## Want to dive deeper? [awesome.critical-ai.dev](https://awesome.critical-ai.dev)
--- # Mind the
gap
-- ## The AI revolution is
accelerating
... -- ### [Increased Spending](https://www.idc.com/getdoc.jsp?containerId=prUS49670322)
This year, global spending on AI
will reach $300B growing 4.2x faster
than average IT spend. -- ### [Widespread Adoption](https://www.gartner.com/document/4839631)
34% of enterprises have deployed
AI in production and 22% will
deploy in the next 12 months. -- ### [Generative AI Impact](https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/the-economic-potential-of-generative-ai-the-next-productivity-frontier#introduction)
Generative AI will increase
the impact of all AI by 15 to 40%
across all industries. -- ## ... but
reality
tells
a different story -- ### [No Roadmap, No Results](https://finance.yahoo.com/news/organizations-accelerating-ai-investments-early-110000212.html)
When it comes to AI adoption,
64% of companies lack a clear roadmap
with measurable goals. -- ### [Spending Big, Delivering Small](https://finance.yahoo.com/news/organizations-accelerating-ai-investments-early-110000212.html)
67% of organizations expect
to maintain or increase AI spending, yet
only 21% report any proven outcomes. -- ### [From Prototype To Nowhere](https://www.infoworld.com/article/2270692/why-ai-investments-fail-to-deliver.html)
86% of all AI projects
fail
to deliver,
while 50% **never** make it to production. -- ## The AI
production gap
is real and growing... --
--
-- ## Why is it so
hard
to *productionize* ML? -- ### The State of
Production
ML in 2025
**Source:** [The Institute for Ethical AI & Machine Learning](https://ethical.institute/state-of-ml-2025)
-- ###
Not-So-Hidden
technical debt in ML systems
**Source:** Adapted from Sculley *et al.* (2015)
-- ## ML is just
one among many
components... --
--- # Why AI
fails
-- ### Here's an uncomfortable truth... At any AI conference, you'll hear about: - better models - bigger models - more data - higher scores -- ## The Main Problem Real-world impact isn't about **intelligence**. It's about
**RELIABILITY**
. -- ## NOT > Can we build AI? -- ## BUT > Can we **trust** it when it matters? -- ### Meet [US6883201B2](https://patents.google.com/patent/US6883201B2/en) AKA Roomba
-- ### Vacuum cleaning is *simple*... right?
-- ### Vacuum cleaning is *simple*... right?
--
-- ### Let's play a game...
-- ### There are 3 main reasons
why ML systems are removed
from
`prod`
... Who wants to take a guess? -- ### 🥉 Cost ### 🥈 Security ### 🥇
**RELIABILITY**
-- ### Measuring Agents in
Production
**Source:** [Pan *et al.* (2025)](https://arxiv.org/abs/2512.04123)
--
-- ## Why does this matter? Because AI is already
everywhere
that matters most -- ### [AI is saving lives in the ICU...](https://link.springer.com/article/10.1007/s00134-023-07102-y)
-- ### [... making life-or-death decisions](https://link.springer.com/article/10.1007/s00134-023-07102-y)
-- ### [AI is flying drones...](https://news.mit.edu/2025/ai-enabled-control-system-helps-autonomous-drones-uncertain-environments-0609)
-- ### [... and directing air traffic](https://interactive.aviationtoday.com/avionicsmagazine/november-december-2022/how-ai-makes-air-traffic-management-more-predictable-and-more-efficient/)
-- ### [AI is in space...](https://www.esa.int/Applications/Observing_the_Earth/Phsat-2/New_satellite_demonstrates_the_power_of_AI_for_Earth_observation)
-- #### [ESA's Φsat-2](https://www.esa.int/Applications/Observing_the_Earth/Phsat-2/New_satellite_to_show_how_AI_advances_Earth_observation)
-- ##### Maritime Vessel Detection
-- ##### Wildfire Detection
-- #### [Autonomous in-space assembly](https://parolaanalytics.com/parolanews/ai-nasa-autonomous-in-space-assembly-tech/)
-- ##### "(...) a convergence of modern control theory,
and machine learning" (Patent: [US11989009B2](https://patents.google.com/patent/US11989009B2/en))
-- ### [Datacenters in space](https://taranis.ie/datacenters-in-space-are-a-terrible-horrible-no-good-idea/) // Taranis Why it's a terrible, horrible, no good idea
-- ### [AI is inside nuclear reactors](https://www.anl.gov/ntns/article/nuclear-energy-becomes-smarter-and-safer-with-ai)
-- #### [The Atom and the Algorithm](https://www.iaea.org/newscenter/statements/the-atom-and-the-algorithm-nuclear-energy-and-ai-are-converging-to-shape-the-future) Nuclear energy and AI are converging
to shape the future
-- #### AI is already improving
nuclear
in many ways... - Operations / predictive maintenance - Design / reactor modelling - Safety / accident simulation - Safeguards / surveillance footage analysis -- > "Reassuringly, despite its brilliance, **AI still needs a human** to make sure it is right and impartial, and to understand the politics behind a safeguards footnote"
**Source:** [IAEA Director General Rafael Mariano Grossi](https://www.iaea.org/newscenter/statements/the-atom-and-the-algorithm-nuclear-energy-and-ai-are-converging-to-shape-the-future)
-- #### Nuclear at Argonne / PRO-AID
-- ### [Vibe nuclear](https://pivot-to-ai.com/2025/11/18/vibe-nuclear-lets-use-ai-shortcuts-on-reactor-safety/) // Pivot-to-AI What it is & why it's a bad idea
-- ## AI is in our
critical
services... quietly running in the background until something goes ʷrₒnᵍ -- ## What is a
critical
system? -- A system whose failure may cause - injury or loss of life 😵 - infrastructure damage 💥 - environmental harm 🚱 - mission failure 🚀 - major financial loss 📉 -- ## When these systems
fail
... real accidents happen! -- ### [Mars Climate Orbiter](https://science.nasa.gov/mission/mars-climate-orbiter/) Lost a spacecraft because one team
used metric and the other used imperial 📏
-- ### [Patriot Missile Failure](https://cs.nyu.edu/~exact/resource/mirror/patriot.htm) Killed 28 soldiers due to a cumulative
rounding error in the system’s software 🎯
-- ### [Knight Capital Trading Glitch](https://www.cio.com/article/286790/software-testing-lessons-learned-from-knight-capital-fiasco.html) Lost $440M in 30 minutes
after deploying buggy code 💸
-- ### [Toyota Unintended Acceleration](https://www.transportation.gov/briefing-room/us-department-transportation-releases-results-nhtsa-nasa-study-unintended-acceleration) Spaghetti code broke the brakes 🚗
-- ### Key Point Complex systems fail in ways we can't predict. -- ### Good enough is
not
good enough At least, not in critical systems -- > “Do you code with your
loved ones in mind?”
― Emily Durie-Johnson, [Strategies for Developing Safety-Critical Software in C++](https://www.youtube.com/watch?v=VJ6HrRtrbr8)
-- ## When the stakes are this high... Where does that leave AI in
critical
systems? Is it really a good idea? -- ### Traditional software ```mermaid flowchart LR Input --> Code Code --> Output style Input fill:green,color:#fff style Output fill:red,color:#fff ``` -- #### It does exactly what you tell it to do... - Same input, same ouput... always - Rules are explicit and readable - Bugs have clear causes and fixes -- You write the rules You know what it will do You know why it broke -- #### What is
determinism
?
**Source:** [Andersson *et al.* (2024)](https://ieeexplore.ieee.org/document/10748739)
-- #### [Defeating Nondeterminism in LLM Inference](https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/)
**Source:** He *et al.* (2025)
-- ## ML Systems ```mermaid flowchart LR Data --> Training Training --> Model Input --> Inference Model --> Inference Inference --> Output style Data fill:blue,color:#fff style Model fill:orange,color:#fff style Input fill:green,color:#fff style Output fill:red,color:#fff ``` -- You shift the agency to **data**: - The data wrote the rules - Change the data, change the behavior - Garbage in, garbage out -- You
didn't
write the rules You
don't
always know what it will do You
don't
always know why it broke -- ## AI amplifies complexity... and complexity
breaks
things. -- ## S*** happens! Models *will* make
mistakes
-- ### [Just stick something to it...](https://spectrum.ieee.org/slight-street-sign-modifications-can-fool-machine-learning-algorithms) or when is a stop sign not like a stop sign?
-- ### [Nissan's Emergency Braking](https://incidentdatabase.ai/cite/341/) False positives posed traffic risks to drivers
-- ### [Waymo School Bus Problem](https://philkoopman.substack.com/p/the-waymo-school-bus-problem) Polite software that 'moved out of the way'
by illegal passing. 🚌
-- ### Even great models *eventually* fail... often in **strange** and **unpredictable** ways -- ## How can we fight this? Let's turn to the [ECSS ML handbook](https://ecss.nl/home/ecss-e-hb-40-02a-15-november-2024/)... --
-- **Golden Rule #1** > Do
**NOT**
build AI
just because you have data. -- **Golden Rule #2** > Do
**NOT**
use AI
just because you can. -- ### Safety Cage Architecture
-- #### Key Idea
Don't
try to prove that ML is safe. Instead, **constrain** it so it can't be
un
safe. --
**Source:** [Delseny *et al.* (2021)](https://arxiv.org/abs/2310.06506) / DEEL
-- ### Safety Envelope > Doer/Checker
-- ### Safety Envelope > Doer/Checker The **doer** optimizes for performance. The **checker** handles
**safety**
. -- ### Doer/Checker > Automotive The **doer** can be low SIL ⬇️ The **checker**
*must*
be **high** SIL 🚨 -- #### Automotive > ISO26262 Safety Integrity Levels (SIL)
-- #### Aerospace > DO-178C Development Assurance Levels (DAL)
-- ##### [Runway Sign Classifier](https://www.mathworks.com/help/deeplearning/ug/verify-an-airborne-deep-learning-system.html) Is this application DAL-C or DAL-D?
**Source:** Adapted from [Dimitriev *et al.* (2023)](https://arxiv.org/abs/2310.06506)
-- ##### [NASA on using LLMs for Assurance](https://ntrs.nasa.gov/citations/20250001849)
-- #### (Neural) Simplex Architecture
**Source:** [Phan *et al.* (2019)](https://arxiv.org/abs/1908.00528)
-- #### Simplex Architecture > Automotive
-- ##### Patent: [US10962972B2](https://patents.google.com/patent/US10962972B2/en) Safety Architecture for Autonomous Vehicles
-- #### Saab / Helsing Collaboration
> "While all of Helsing’s work primarily focused on software model training, integration with Gripen E APIs and testing, Saab actually set the groundwork for operating a software-defined aircraft several years ago with an overhaul to the Gripen’s avionics."
-- ### Saab's [Split Avionics](https://www.mobilityengineeringtech.com/component/content/article/53597-are-military-avionics-systems-ready-for-artificial-intelligence)
-- #### Tactical vs Flight Critical
> "Gripen’s avionics system separates 10% of the aircraft's flight critical management codebase from 90% of its tactical management code, resulting in avionics that are 'hardware agnostic'."
-- #### [Software-Defined Assurance](https://helsing.ai/newsroom/helsing-white-paper-software-defined-assurance) / Helsing
> "**Many of the well-known approaches used to ensure the reliability of software are difficult or impossible to apply to AI-based software**, where models are created from data rather than hand-coded by software developers. This creates friction in the commissioning and development of AI-based software, because it is unclear what criteria will be used to assure it. The potential worst case is that assurance of systems involving AI are subject to a matrix of both poorly-fitting existing requirements and new but underspecified AI-related requirements."
-- ### Airborne AI/ML Assurance Lifecycle
-- ### Testing AI is part of the system. Test it like it is. --
-- The [ECSS ML handbook](https://ecss.nl/home/ecss-e-hb-40-02a-15-november-2024/) suggests checking: - Known cases (the expected) - Coverage (the internals) - Edge cases (the unknown) - Adversarial cases (the hostile) -- ### V-Cycle $\rightarrow$ W-Cycle
**Source:** [EASA / Daedalean (2024)](https://www.easa.europa.eu/en/document-library/general-publications/concepts-design-assurance-neural-networks-codann)
-- ### Formal Verification *Mathematically* prove that
certain behaviors
cannot
happen. -- Here's a crash course on **formal methods**
for software verification... --
\* Oldie, but goodie! -- #### Reactive System Systems that maintain an ongoing interaction
with the environment, as opposed to computing
some final value on termination. -- ##### Concurrent programs
-- ##### Embedded and process control programs
-- ##### Perpetually ongoing processes
-- ##### Operating systems
-- ### These systems are not
defined by
**what**
they do but
**when**
they do it. -- There's a saying at Google... > "Software engineering is programming integrated over **time**."
Winters, Manshreck & Wright (2020)
-- If you take this *literally*... $$\texttt{SWE} = \int \texttt{Programming} ~dt$$ -- Then engineering itself is just... $$f \mapsto \texttt{E}[f] = \int^{\min\[\text{EOL}, ~+\infty\]}_{\max\[-\infty, ~\text{idea}\]} f ~dt$$ -- ##### Our Mission Ensure that certain properties hold **at all times**. -- #####
Safety property
> bad thing never happens $$\square ~\neg \texttt{bad}$$ -- ####
Liveness property
> good thing eventually happens $$\diamond ~\texttt{good}$$ -- ### Formal Methods $\rightarrow$ AI --
--
--
-- #### ONNX $\rightarrow$
Safe
ONNX
When AI fails, lives are at stake... So let's fix the language first! -- ##### Docs $\rightarrow \cdots \rightarrow$ [Why3](https://why3.org/)
-- ##### Property-Based Testing / [Hypothesis](https://hypothesis.readthedocs.io/en/latest/) ```python """ This file uses the Hypothesis library to generate a wide range of test cases for the Flatten operation in ONNX. """ import os import json import numpy as np import ml_dtypes from hypothesis import given, settings import hypothesis.extra.numpy as hnp from hypothesis import strategies as st from hypothesis import assume from onnx import helper import onnx.checker from onnxruntime import InferenceSession import onnx.reference from onnx import helper import tensorflow as tf if os.path.exists("generated_data.json"): os.remove("generated_data.json") """ Inputs/attributes for Flatten operation """ inputs_attributes = { "min_rank_input": 1, #Adjust as needed "max_rank_input": 10, #Adjust as needed "min_dim_size_input": 1, #Adjust as needed "max_dim_size_input": 5, #Adjust as needed "ONNXRuntime_Provider": "CPUExecutionProvider" # available providers are CPUExecutionProvider, CUDAExecutionProvider, DmlExecutionProvider } """ Flatten supported types, organized by ONNXRuntime_Provider """ flatten_types = { "CPUExecutionProvider": { "INT8": np.int8, "INT16": np.int16, "INT32": np.int32, "INT64": np.int64, "UINT8": np.uint8, "UINT16": np.uint16, "UINT32": np.uint32, "UINT64": np.uint64, "FP16": np.float16, "FP32": np.float32, "FP64": np.float64, "STRING": np.str_, "BOOL": np.bool_, "BFLOAT16": ml_dtypes.bfloat16 }, "CUDAExecutionProvider": { "INT8": np.int8, "INT16": np.int16, "INT32": np.int32, "INT64": np.int64, "UINT8": np.uint8, "UINT16": np.uint16, "UINT32": np.uint32, "UINT64": np.uint64, "FP16": np.float16, "FP32": np.float32, "FP64": np.float64, "BOOL": np.bool_, "BFLOAT16": ml_dtypes.bfloat16 }, "DmlExecutionProvider": { "INT8": np.int8, "INT16": np.int16, "INT32": np.int32, "INT64": np.int64, "UINT8": np.uint8, "UINT16": np.uint16, "UINT32": np.uint32, "UINT64": np.uint64, "FP16": np.float16, "FP32": np.float32, "FP64": np.float64, "BOOL": np.bool_ } } dtype_to_key = {v: k for k, v in flatten_types.get(inputs_attributes["ONNXRuntime_Provider"]).items()} """ Store generated data """ generated_data = { "rank_input_tensor": [], "shape_input_tensor": [], "x_type": [], "axis": [] } """ Function to generate valid flatten arguments """ @st.composite @settings() def valid_slice_args(draw): #--------------------------------------------------- # Restrictions #--------------------------------------------------- # X [C2] - Input/Output Types Consistency all_valid_types = list(flatten_types.get(inputs_attributes["ONNXRuntime_Provider"]).keys()) input_type = draw(st.sampled_from(all_valid_types)) input_dtype = flatten_types.get(inputs_attributes["ONNXRuntime_Provider"])[input_type] if np.issubdtype(input_dtype, np.integer): min_val = np.iinfo(input_dtype).min max_val = np.iinfo(input_dtype).max input_strategy = st.integers(min_value=min_val, max_value=max_val) elif np.issubdtype(input_dtype, np.floating): min_val = np.finfo(input_dtype).min max_val = np.finfo(input_dtype).max input_strategy = st.floats(min_value=min_val, max_value=max_val) elif np.issubdtype(input_dtype, np.bool_): input_strategy = st.booleans() elif np.issubdtype(input_dtype, np.str_): input_strategy = st.text( alphabet=st.characters(codec="utf-8", blacklist_characters='\x00') ) elif input_type == "BFLOAT16": min_bfloat16 = float(ml_dtypes.finfo(flatten_types.get(inputs_attributes["ONNXRuntime_Provider"])["BFLOAT16"]).min) max_bfloat16 = float(ml_dtypes.finfo(flatten_types.get(inputs_attributes["ONNXRuntime_Provider"])["BFLOAT16"]).max) input_strategy = st.floats(min_value=min_bfloat16, max_value=max_bfloat16) #--------------------------------------------------- # Input X #--------------------------------------------------- rank_input_tensor = draw(st.integers( min_value=inputs_attributes["min_rank_input"], max_value=inputs_attributes["max_rank_input"] )) shape_input_tensor = [] for _ in range(rank_input_tensor): dim_size = draw(st.integers( min_value=inputs_attributes["min_dim_size_input"], max_value=inputs_attributes["max_dim_size_input"] )) shape_input_tensor.append(dim_size) if input_type == "BFLOAT16": temp_tensor = draw(hnp.arrays(dtype=np.float32, shape=shape_input_tensor, elements=input_strategy)) tf_tensor = tf.cast(tf.constant(temp_tensor), tf.bfloat16) x = tf_tensor.numpy() else: x = draw(hnp.arrays(dtype=input_dtype, shape=shape_input_tensor, elements=input_strategy)) #--------------------------------------------------- # Attribute axis #--------------------------------------------------- # axis [C1] -> X [C1], axis [C2] - Axis Range axis = draw(st.integers( min_value=-(rank_input_tensor), max_value=rank_input_tensor )) #--------------------------------------------------- # Output y #--------------------------------------------------- # Y [C1] y_shape = [] dy0 = np.prod(shape_input_tensor[:axis]) dy1 = np.prod(shape_input_tensor[axis:]) y_shape = [int(dy0), int(dy1)] return x, axis, y_shape """ Function that runs the test """ @settings(max_examples=10000, deadline=None) @given(valid_slice_args()) def test_flatten(args): x, axis, y_shape = args generated_data["rank_input_tensor"].append(len(x.shape)) generated_data["shape_input_tensor"].append(list(x.shape)) x_type_key = dtype_to_key.get(x.dtype.type, str(x.dtype)) generated_data["x_type"].append(x_type_key) generated_data["axis"].append(axis) y = run_onnx_flatten_test(x, axis, y_shape, inputs_attributes["ONNXRuntime_Provider"]) if axis < 0: axis += len(x.shape) check_constraints(y_shape, y, x, axis) def teardown_module(): """ Function to write generated data to a json file """ data = { "title": "Data generated by Hypothesis for Flatten operation tests", "min_rank_input": inputs_attributes["min_rank_input"], "max_rank_input": inputs_attributes["max_rank_input"], "rank_input_tensor": generated_data["rank_input_tensor"], "min_dim_size_input": inputs_attributes["min_dim_size_input"], "max_dim_size_input": inputs_attributes["max_dim_size_input"], "shape_input_tensor": generated_data["shape_input_tensor"], "x_type": generated_data["x_type"], "axis": generated_data["axis"], "ONNXRuntime_Provider": inputs_attributes["ONNXRuntime_Provider"] } with open("generated_data.json", "w", encoding="utf-8") as f: json.dump(data, f, indent=4) def run_onnx_flatten_test(x, axis, y_shape, provider): """ Function that runs the ONNX Slice operation """ x_onnx = helper.make_tensor_value_info('x', helper.np_dtype_to_tensor_dtype(x.dtype), x.shape) # Y [C3] -> X [C2] - Input/Output Types Consistency y_onnx = helper.make_tensor_value_info('y', helper.np_dtype_to_tensor_dtype(x.dtype), y_shape) node_def = helper.make_node( 'Flatten', inputs=['x'], outputs=['y'], axis=axis ) # Create the graph graph_def = helper.make_graph( [node_def], 'test_flatten', [x_onnx], [y_onnx], ) onnx_model = helper.make_model(graph_def) #Let's freeze the opset. del onnx_model.opset_import[:] opset = onnx_model.opset_import.add() opset.domain = '' opset.version = 22 onnx_model.ir_version = 10 # Verify the model onnx.checker.check_model(onnx_model) if str(x.dtype) == "bfloat16": # Use ONNX Reference Implementation for bfloat16 # BFLOAT16 is not supported by ONNX Runtime while using numpy # An alternative is to use torch tensores and CUDAProvider sess = onnx.reference.ReferenceEvaluator(onnx_model) else: # Use ONNX Runtime for other types sess = InferenceSession(onnx_model.SerializeToString(), providers=[provider]) y = sess.run(None, {'x': x})[0] print("y shape:", y.shape) print("y dtype:", y.dtype) print("y:", y) return y def check_constraints(y_shape, y, x, axis): """ Check constraints for generated data """ # X[C1] assert axis <= len(x.shape) # X[C2] x_is_string = np.issubdtype(x.dtype, np.str_) or np.issubdtype(x.dtype, np.object_) y_is_string = np.issubdtype(y.dtype, np.str_) or np.issubdtype(y.dtype, np.object_) if x_is_string and y_is_string: pass else: assert x.dtype == y.dtype # axis[C1] -> X[C1] # axis[C2] assert -len(x.shape) <= axis <= len(x.shape) # Y assert (len (y_shape) == 2) # Y[C1] assert y_shape == list(y.shape) # Y[C2] assert check_coords_value(x, y, axis) # Y[C3] -> X[C2] def calculate_y_coords(x_coords, x_shape, axis): """ Calculate the corresponding coordinates in Y given coordinates in X """ n = len(x_shape) a = 0 for z in range(0, axis): prod = 1 for k in range(z + 1, axis): prod *= x_shape[k] a += x_coords[z] * prod b = 0 for z in range(axis, n): prod = 1 for k in range(z + 1, n): prod *= x_shape[k] b += x_coords[z] * prod return (a, b) def check_coords_value(x, y, axis): """ Check if there is a valid correspondence between input and output values """ result = [] it = np.nditer(x, flags=['multi_index']) for x_value in it: coords = it.multi_index y_coords = calculate_y_coords(coords, list(x.shape), axis) y_value = y[y_coords] result.append(x_value == y_value) return all(result) ``` -- ##### Why3 $\rightarrow \cdots \rightarrow$ C
-- ### AI $\rightarrow$ Formal Methods -- #### [Natural Language
$\downarrow$
Temporal Logic Formulas](https://conformalnl2ltl.github.io/)
-- #### [Minimize Hallucinations
with Automated Reasoning](https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-and-deliver-up-to-99-verification-accuracy-with-automated-reasoning-checks-now-available/)
-- ### When AI writes
most of the software
in the world... who
verifies
it? -- > Most people think of verification as a cost, a tax on development, justified only for safety-critical systems. **That framing is outdated.** When AI can generate verified software as easily as unverified software, verification is no longer a cost. It is a catalyst.
― Leonardo de Moura, creator of [Lean](https://lean-lang.org/)
-- #### [Verina](https://verina.io/): Benchmarking Verifiable Code Generation
-- **Write** the code
-- **Prove** the code
-- **Test** the code
-- ### 🔴 Breakpoint And know for a word from our sponsors! --- #
Dependable
AI -- ### Intelligent or not... Building systems that *last* is
**HARD**
-- ### When it comes to AI... The
real
challenge isn't model accuracy. It's system reliability under **UNCERTAINTY**. -- ### Typical ML focuses on ```mermaid flowchart LR Data --> Model Model --> Metrics ``` -- ### But the model is only the
beginning
We need to move from models to systems! --
-- This thread reveals 3 things: - Engineers don't know their history - Tool creators have massive egos - The importance of **modelling the model** -- ##
Dependable
AI Mindset 1. Expect failure 2. Design for recovery 3. Monitor everything 4. Keep humans around -- ## Engineering Best Practices Because good intentions are not enough! -- ### Data > Garbage in, garbage out -- #### AI systems learn from data If the data is wrong, incomplete, or drifting, the system will
fail
. -- #### Your model is only as good as your data Focus on: - data validation - dataset versioning - distribution monitoring - label quality checks -- #### You don’t control your model
**your data does**
-- ## Model > Accuracy isn't reliability -- A high benchmark score does not guarantee
**safe real-world behavior** -- #### Good numbers are not enough Evaluate for: - robustness - edge cases - distribution shift - calibration -- #### Test the failure modes **not** just the average case. -- ## Observability > If you can’t see it, you can’t trust it. -- #### Watch everything, don't fly blind Track: - data drift - prediction drift - system health - anomaly signals -- #### Dogs not barking? Silent failures are the most dangerous failures. -- ## Guardrails > Expect failure. Design for safety. -- #### Models will eventually fail. Systems must handle that *safely*. -- #### Build the safety net Common patterns: - confidence thresholds - fallback logic - human escalation - policy checks -- #### Reliable systems don't fail silenty... They fail *gracefully*. -- ## Humans > AI works best when we are around -- #### What machines can't replace (yet!) Humans provide: - context - judgment - accountability -- Design systems that allow: - review - intervention - override -- ```python # Predict: AI takes a shot... result, confidence = model.predict(input_data) # Check: Too unsure? Don't guess! if confidence < threshold: result = route_to_fallback() or route_to_human() # Log: Always leave a trail log_decision(input_data, result) ``` -- ### Human
in
the loop AI acts only when a
human approves each decision. -- ### Human
on
the loop AI acts autonomously, but humans
monitor and can intervene. -- ### Human
over
the loop AI operates independently, while humans
set goals and review outcomes. -- ### Humans are
not
the weakness. We are part of the safety system. -- ## Dependability is
not
a feature It's engineering discipline. --- # AI that (actually)
matters
-- ## AI
where
it matters most
-- ### NOT > Build smarter AI -- ### BUT > Build trustworthy systems > that safely amplify our capabilities. -- ## AI needs to
pivot
model accuracy $\rightarrow$ system reliability benchmarks $\rightarrow$ real-world impact research $\rightarrow$ engineering -- #### Engineering is about solving
real problems for real people
$$\cdots$$ #### Engineering does not stop at *it works*
it begins at
**it lasts**
-- ## Build AI that
matters
AI first, human always! --- # PR
FAQ
--
-- ### LISBON – (Mar 2026) A new talk titled 'Build AI That Matters'
introduces a practical framework for designing
dependable AI systems that deliver
real-world impact. -- ### Why isn't model accuracy enough? -- Production failures rarely originate
from the model itself. $$\cdots$$ Dependability requires addressing
the **entire system**. -- ### Doesn't adding reliability slow innovation? -- No, it makes deployments **sustainable**. $$\cdots$$ Without it, repeated failures erode trust
and slow adoption. -- ### What role do humans play? -- Humans are not replaced by AI. $$\cdots$$ They are part of the system that ensures
**safety and accountability**. -- ### What is the key takeaway? -- AI creates value only when it is
**reliable** enough to be **trusted**. $$\cdots$$ The future of AI will be shaped
not just by better models, but by better
**engineering** of the systems around them. --- ## Thank you! 🙏