
About the Series
One AI model reportedly scored in the top 10 percent on the Uniform Bar Exam. Other models boast impressive scores on medical licensing exams, graduate-level science questions, and coding competitions. But if you're a state agency evaluating whether AI can help residents apply for benefits, assist caseworkers, review permits, answer questions accurately, or improve service delivery, those scores tell you almost nothing.
As government agencies, we need to measure how an AI tool helps us perform a task effectively, reliably, and safely over time.
Measuring success is harder than it sounds. Many organizations focus on simple metrics such as time saved, but measuring efficiency tells us too little about effectiveness: Are residents getting better outcomes? Are we working in ways that enhance democracy? Improve governance? Make the lives of those we serve and our staff better?
Complicating matters further, AI systems are not static. Performance can change both as models evolve and as staff learn new ways of using tools. A pilot that appears successful may perform very differently months later when deployed at scale.
This five-part webinar series introduces practical approaches to AI benchmarking and performance monitoring in government.
Designed for state and local government program managers, analysts, and operational leaders, the series will provide practical tools for evaluating whether AI is delivering public value.
By the end of this series, participants will be able to:
- Explain what AI benchmarking is, why it matters, and why it is hard
- How agencies are identifying meaningful measures of effectiveness
- Compare performance on real government tasks and workflows with and without AI.
- How to use evidence to guide decisions about adoption, implementation, scaling, redesign, or retirement.
- Distinguish between pre-adoption testing, pilot evaluation, and ongoing post-deployment monitoring.
- Identify risks, limitations, and unacceptable errors to assess when evaluating AI systems.