Ben Murdoch

i. About

I'm a researcher and engineer working at the intersection of large language models, agentic systems, and AI safety. At Google Labs I work on agent evaluation, safety, and architecture for language applications: designing benchmarks and autoraters, finding launch-blocking exploits, and building agents that reason their way through real, multi-hop problems.

I've spent the last eleven years working across robotics, reinforcement learning, and language models. Along the way I've helped ship some of the largest GPT2 deployments into production, and built the first product to do a billion inferences on GPT-3 at Latitude, where I was a founding-team member; the first deep-learning architecture in Microsoft Cortana to meet its 5 ms CPU budget; and the safety launch path for code assist features used inside Colab, Android Studio, and CodeTips on google.com.

I want to spend the next decade making sure the fourth industrial revolution goes well for life on earth. In practice that means caring as much about red-teaming and rater agreement as I do about model architecture, and talking to users early and often, because the systems we build only matter if real people can rely on them. The era of self-improving products is here and I want to help point them in the right direction.

ii. Experience

2023 — Present
Google

Google Labs
1. Senior Research Engineer
  Mar 2025 — Present
  
  Project CC · Language Applications
  - Founding member of Project CC, a proactive, helpful AI assistant. Built the initial user-modeling layer: preference extraction, to-do detection, and the day-to-day prioritization that drives the flagship Your Day Ahead feature.
  - Enabled Google Workspace function calling for the agent and made tool-calling reliable across the Google surface area.
  - Currently working on agent evaluation and quality hill-climbing, including using our autorater for recursive self-improvement of agent architectures and prompts.
  - Validating agent reliability inside the sandboxed runtime environment.
2. Research SDE, Search Agent
  Jan 2024 — Mar 2025
  
  Language Applications
  - Received an Outstanding Impact rating (top 22% of Googlers) for work on agent evaluation and architectures.
  - Designed novel agent architectures for multi-hop question answering, plus the autoraters used to evaluate them. The autoraters were upstreamed to One Recipe as tracking evals for Gemini checkpoint quality on multi-hop QA and web search.
  - Built a multi-modal hierarchical planning agent that achieved a new SOTA (+4.7%) on internal search benchmarks.
  - Set internal SOTA for social-engineering attack classification using a self-attention JAX DNN discovered via architecture search.
  - Built the first multi-hop RAG agent for NotebookLM and exposed Notebook's internal APIs as MCP servers for agent loops to use.
  - Collaborated with Google DeepMind to enable RL hill-climbing for multi-hop search in base tool-use Gemini.
3. Research SDE, AIDA Code Generation
  Jan 2023 — Jan 2024
  
  AIDA · Code Generation · Magi
  - Trained foundation models for code completion shipped in Colab and Android Studio.
  - Led safety for the 2023 Google I/O launch of CodeTips on google.com. Identified launch-blocking exploits, partnered with RAI to retrain code-domain classifiers, and tuned precision-recall thresholds for production.
  - Worked with the Google Search Magi team on prompt improvements and on distilling large transformer models into deployable form.
  - Redesigned the team's annotation process, raising inter-rater agreement on human evaluation.
2021 — 2023
Machine Learning Engineer

Armorblox
- Built classifiers for generative email attacks: extortion, romance fraud, and other social-engineering scams.
- Pioneered the use of attention-activation visualization inside the company. The system surfaced why a classifier fired by highlighting the spans it attended to, so customer security teams could review and give feedback.
- Built infrastructure for cluster-level hyperparameter tuning of RoBERTa with Ray Tune, Kubernetes, and Comet ML.
- Improved production highlighting accuracy by 25% via a RoBERTa-based SQuAD2.0 extractive summarization model.
- Designed an end-to-end text-classification pipeline that improved architecture-search speed by 4× across RoBERTa, DeBERTa, ALBERT, and GPT-2.
- Sped up maximum inner-product search over MPNet document embeddings by over 4000× using FAISS.
- First in the company to adopt JAX for production neural networks.
- Spoke at RSAC 2022 on “Generative Email Attacks and How to Defend Against Them.”
2020
Research Scientist · Founding Team

Latitude (AI Dungeon)
- Built one of the world's largest production GPT-2 deployments, with over 600 million model calls per month on GPT-2 XL.
- One of the first companies in OpenAI's GPT-3 API beta cohort, and likely the first to cross one billion GPT-3 API calls.
- Led safety research, exploring methods to reduce harassment and unwanted sexual encounters in-game.
- Reduced model-serving costs by 65% (~$200K/year) by migrating the production GPU cluster from AWS to CoreWeave.
- Designed an end-to-end AutoML pipeline for T5 used across the product for tasks like quest detection and fine-grained classification.
- Built a community-driven Mechanical-Turk equivalent inside the product, collecting 35K+ free labels from users.
2019
Applied Science Intern

Amazon Alexa AI
- Trained a billion-parameter BERT-style domain-specific conversational language model for Alexa.
- Studied the drawbacks of repeated transfer learning on large language models.
2019
Intern, Substrate

Microsoft
- Used Bing query data to learn the relationship between documents and free-form natural-language queries.
- Built a reverse query generator for Teams personalized search using LSTM, GRU, and NAS-discovered RNNs.
2018
Cortana Core Science Intern

Microsoft AI & Research
- Built the first deep-learning architecture in Microsoft Cortana to meet the production speed constraint of 5 ms on CPU.
- Designed a novel end-to-end log-scaling embeddings + CNN architecture for domain, intent, and slot recognition.
2016 — 2019
Research Scientist

BYU Perception, Control & Cognition Lab
- Won the $250K Amazon Alexa Prize Challenge grant (1-in-9 spot) for our team's research on conversational AI; designed the NLG pipeline for our 2018 entry, EVE.
- Led safety work on Alexa Prize: developed early embedding-based filters for cleaning unsafe content (NSFW, hate speech) out of pre-training corpora. An early version of what's now standard practice.
- Won 1st place at the IEEE CIG Artificial Text Player Competition in both 2016 and 2017, and 1st & 2nd place in 2017 (with two model generations).
- First in the lab to train a deep RL model at scale on the school's supercomputer. Pre-PyTorch, all OG Lua Torch on CPU.
- Awarded a Google Mentored Research Grant (2017) for work on text-based RL.
2015 — 2016
Undergraduate Researcher · Underwater Robotics

BYU Office of Research and Creative Activities
- Wrote the grant proposal, won a $1,500 ORCA hardware grant, and assembled a multidisciplinary team of seven mechanical, electrical, and computer engineering students.
- Built a functioning robotic finger for underwater use with advisor Mark Killpack.
- Designed a novel deep-learning method for approximate inverse kinematics on low-cost robotics, learning control from motion-capture data and point clouds.

iii. Publications

What can you do with a rock? Affordance extraction via word embeddings

N. Fulda, D. Ricks, B. Murdoch, D. Wingate · IJCAI · 2017

Text-based adventure games are a brutal sparse-reward setting: random search over the noun space is computationally intractable for an RL agent. We scoped the agent's action space using cosine similarity over analogy-derived relationships in a Word2Vec embedding space. The method was general enough to find a relation for any object, and became the basis of our SOTA on text-based RL. Cited by authors at Microsoft Research and Google DeepMind.

arXiv PDF DOI
Informing action primitives through free-form text

N. Fulda, B. Murdoch, D. Ricks, D. Wingate · ViGIL Workshop, NeurIPS · 2017 · Spotlight

A method for adjusting the scope of natural-language action spaces based on free-form human input.

PDF
Semantically-informed algorithm for extracting interaction modes

B. Murdoch et al. · KEG Workshop, AAAI · 2018

A zero-shot classification algorithm for threat detection in games using high-dimensional word embeddings.
Mixed-initiative dialog via structured knowledge graph traversal

B. Murdoch et al. · Alexa Prize Proceedings · 2018

EVE, our ensemble of generators built for the Amazon Alexa Prize Challenge.
Getting it right the fourth time: Goal-driven behavior using vector space models

N. Fulda, B. Murdoch, D. Ricks · IEEE SAMI · 2020

A framework for communication between humans and reinforcement-learning agents.

PDF

iv. Projects

EVE

2018 · Amazon Alexa Prize Challenge

A social conversational agent built with the BYU PCC Lab for the Alexa Prize Challenge. Our team won the $250K grant and a 1-in-9 spot to compete. I designed the NLG pipeline and led safety work on the dialogue corpus. Coverage in KSL, Fox 13, and BYU News.
CARL

2017 · Artificial Text Adventurer

Our text-based RL agent that won 1st place at the IEEE CIG Artificial Text Player Competition in 2016 and 2017. CARL used affordance-based action selection over Word2Vec embeddings to navigate sparse-reward text worlds. Coverage in TechCrunch.
Underwater Robotic Hand

2016 · BYU ORCA Grant

Wrote the grant, hired the team, and built a robotic finger for underwater use. The novel piece: a deep-learning method for approximate inverse kinematics on low-cost robotics, trained on motion-capture and point-cloud data.

v. Teaching & Community

Prompt Engineering for Creative Story Writing

Christa McAuliffe Space Center · 2024

A workshop on prompt chaining for creative writing and an introduction to LLM-as-judge evaluation.
Armorblox Deep Learning Mentorship Program

Armorblox · 2022

Designed and led the program. Taught deep-learning fundamentals and PyTorch to software engineers and data analysts, retraining the cohort to take on junior ML-engineering responsibilities.
Generative Email Attacks and How to Defend Against Them

RSA Conference (RSAC) · 2022

Invited talk on the AI & ML track. Demonstrated how the GPT-3 API could enable large-scale whaling attacks, and presented mitigations.
Deep Learning Reading Group

Armorblox · 2021–2022

Organized and ran a weekly reading group covering transformer architectures, retrieval, and training infrastructure.
Mechatronics & C++ for Kids

Three Utah Valley schools · 2016

Founded and ran robotics clubs at three schools: Walden School of Liberal Arts, Renaissance Academy at the Christa McAuliffe Space Education Center, and Greenwood Elementary in American Fork. Taught Arduinos, mechatronics, and basic C++. The favorite memory: a Mario Kart competition where the kids built and drove their own cars, with balloons on the back to pop.
Flight Director

Christa McAuliffe Space Education Center · 2008–2013

Long before the AI work: ran the stealth shuttle craft simulator, teaching kids to coordinate under simulated mission pressure. I came back in 2024 to teach prompt engineering at the same place (one of those rare full-circle moments).

vi. Contact

I'm always open to thoughtful conversations about AI safety, agent evaluation, and how to build language systems people can actually trust.

i. About

ii. Experience

Google

Senior Research Engineer

Research SDE, Search Agent

Research SDE, AIDA Code Generation

Machine Learning Engineer

Research Scientist · Founding Team

Applied Science Intern

Intern, Substrate

Cortana Core Science Intern

Research Scientist

Undergraduate Researcher · Underwater Robotics

iii. Publications

iv. Projects

EVE

CARL

Underwater Robotic Hand

v. Teaching & Community

Prompt Engineering for Creative Story Writing

Armorblox Deep Learning Mentorship Program

Generative Email Attacks and How to Defend Against Them

Deep Learning Reading Group

Mechatronics & C++ for Kids

Flight Director

vi. Contact