Ben Murdoch

Ben Murdoch

Senior Research Engineer at Google Labs. Working on agent architecture, evaluation, and the responsible-AI launch path for language applications.

San Francisco, CA

i. About

I'm a researcher and engineer working at the intersection of large language models, agentic systems, and AI safety. At Google Labs I work on agent evaluation, safety, and architecture for language applications: designing benchmarks and autoraters, finding launch-blocking exploits, and building agents that reason their way through real, multi-hop problems.

I've spent the last eleven years working across robotics, reinforcement learning, and language models. Along the way I've helped ship some of the largest GPT2 deployments into production, and built the first product to do a billion inferences on GPT-3 at Latitude, where I was a founding-team member; the first deep-learning architecture in Microsoft Cortana to meet its 5 ms CPU budget; and the safety launch path for code assist features used inside Colab, Android Studio, and CodeTips on google.com.

I want to spend the next decade making sure the fourth industrial revolution goes well for life on earth. In practice that means caring as much about red-teaming and rater agreement as I do about model architecture, and talking to users early and often, because the systems we build only matter if real people can rely on them. The era of self-improving products is here and I want to help point them in the right direction.

ii. Experience

  1. Google

    Google Labs

    1. Senior Research Engineer

      Mar 2025 — Present

      Project CC · Language Applications

      • Founding member of Project CC, a proactive, helpful AI assistant. Built the initial user-modeling layer: preference extraction, to-do detection, and the day-to-day prioritization that drives the flagship Your Day Ahead feature.
      • Enabled Google Workspace function calling for the agent and made tool-calling reliable across the Google surface area.
      • Currently working on agent evaluation and quality hill-climbing, including using our autorater for recursive self-improvement of agent architectures and prompts.
      • Validating agent reliability inside the sandboxed runtime environment.
    2. Research SDE, Search Agent

      Jan 2024 — Mar 2025

      Language Applications

      • Received an Outstanding Impact rating (top 22% of Googlers) for work on agent evaluation and architectures.
      • Designed novel agent architectures for multi-hop question answering, plus the autoraters used to evaluate them. The autoraters were upstreamed to One Recipe as tracking evals for Gemini checkpoint quality on multi-hop QA and web search.
      • Built a multi-modal hierarchical planning agent that achieved a new SOTA (+4.7%) on internal search benchmarks.
      • Set internal SOTA for social-engineering attack classification using a self-attention JAX DNN discovered via architecture search.
      • Built the first multi-hop RAG agent for NotebookLM and exposed Notebook's internal APIs as MCP servers for agent loops to use.
      • Collaborated with Google DeepMind to enable RL hill-climbing for multi-hop search in base tool-use Gemini.
    3. Research SDE, AIDA Code Generation

      Jan 2023 — Jan 2024

      AIDA · Code Generation · Magi

      • Trained foundation models for code completion shipped in Colab and Android Studio.
      • Led safety for the 2023 Google I/O launch of CodeTips on google.com. Identified launch-blocking exploits, partnered with RAI to retrain code-domain classifiers, and tuned precision-recall thresholds for production.
      • Worked with the Google Search Magi team on prompt improvements and on distilling large transformer models into deployable form.
      • Redesigned the team's annotation process, raising inter-rater agreement on human evaluation.
  2. Machine Learning Engineer

    Armorblox

    • Built classifiers for generative email attacks: extortion, romance fraud, and other social-engineering scams.
    • Pioneered the use of attention-activation visualization inside the company. The system surfaced why a classifier fired by highlighting the spans it attended to, so customer security teams could review and give feedback.
    • Built infrastructure for cluster-level hyperparameter tuning of RoBERTa with Ray Tune, Kubernetes, and Comet ML.
    • Improved production highlighting accuracy by 25% via a RoBERTa-based SQuAD2.0 extractive summarization model.
    • Designed an end-to-end text-classification pipeline that improved architecture-search speed by 4× across RoBERTa, DeBERTa, ALBERT, and GPT-2.
    • Sped up maximum inner-product search over MPNet document embeddings by over 4000× using FAISS.
    • First in the company to adopt JAX for production neural networks.
    • Spoke at RSAC 2022 on “Generative Email Attacks and How to Defend Against Them.”
  3. Research Scientist · Founding Team

    Latitude (AI Dungeon)

    • Built one of the world's largest production GPT-2 deployments, with over 600 million model calls per month on GPT-2 XL.
    • One of the first companies in OpenAI's GPT-3 API beta cohort, and likely the first to cross one billion GPT-3 API calls.
    • Led safety research, exploring methods to reduce harassment and unwanted sexual encounters in-game.
    • Reduced model-serving costs by 65% (~$200K/year) by migrating the production GPU cluster from AWS to CoreWeave.
    • Designed an end-to-end AutoML pipeline for T5 used across the product for tasks like quest detection and fine-grained classification.
    • Built a community-driven Mechanical-Turk equivalent inside the product, collecting 35K+ free labels from users.
  4. Applied Science Intern

    Amazon Alexa AI

    • Trained a billion-parameter BERT-style domain-specific conversational language model for Alexa.
    • Studied the drawbacks of repeated transfer learning on large language models.
  5. Intern, Substrate

    Microsoft

    • Used Bing query data to learn the relationship between documents and free-form natural-language queries.
    • Built a reverse query generator for Teams personalized search using LSTM, GRU, and NAS-discovered RNNs.
  6. Cortana Core Science Intern

    Microsoft AI & Research

    • Built the first deep-learning architecture in Microsoft Cortana to meet the production speed constraint of 5 ms on CPU.
    • Designed a novel end-to-end log-scaling embeddings + CNN architecture for domain, intent, and slot recognition.
  7. Research Scientist

    BYU Perception, Control & Cognition Lab

    • Won the $250K Amazon Alexa Prize Challenge grant (1-in-9 spot) for our team's research on conversational AI; designed the NLG pipeline for our 2018 entry, EVE.
    • Led safety work on Alexa Prize: developed early embedding-based filters for cleaning unsafe content (NSFW, hate speech) out of pre-training corpora. An early version of what's now standard practice.
    • Won 1st place at the IEEE CIG Artificial Text Player Competition in both 2016 and 2017, and 1st & 2nd place in 2017 (with two model generations).
    • First in the lab to train a deep RL model at scale on the school's supercomputer. Pre-PyTorch, all OG Lua Torch on CPU.
    • Awarded a Google Mentored Research Grant (2017) for work on text-based RL.
  8. Undergraduate Researcher · Underwater Robotics

    BYU Office of Research and Creative Activities

    • Wrote the grant proposal, won a $1,500 ORCA hardware grant, and assembled a multidisciplinary team of seven mechanical, electrical, and computer engineering students.
    • Built a functioning robotic finger for underwater use with advisor Mark Killpack.
    • Designed a novel deep-learning method for approximate inverse kinematics on low-cost robotics, learning control from motion-capture data and point clouds.

iii. Publications

  1. What can you do with a rock? Affordance extraction via word embeddings

    N. Fulda, D. Ricks, B. Murdoch, D. Wingate · IJCAI · 2017

    Text-based adventure games are a brutal sparse-reward setting: random search over the noun space is computationally intractable for an RL agent. We scoped the agent's action space using cosine similarity over analogy-derived relationships in a Word2Vec embedding space. The method was general enough to find a relation for any object, and became the basis of our SOTA on text-based RL. Cited by authors at Microsoft Research and Google DeepMind.

    arXiv PDF DOI

  2. Informing action primitives through free-form text

    N. Fulda, B. Murdoch, D. Ricks, D. Wingate · ViGIL Workshop, NeurIPS · 2017 · Spotlight

    A method for adjusting the scope of natural-language action spaces based on free-form human input.

    PDF

  3. Semantically-informed algorithm for extracting interaction modes

    B. Murdoch et al. · KEG Workshop, AAAI · 2018

    A zero-shot classification algorithm for threat detection in games using high-dimensional word embeddings.

  4. Mixed-initiative dialog via structured knowledge graph traversal

    B. Murdoch et al. · Alexa Prize Proceedings · 2018

    EVE, our ensemble of generators built for the Amazon Alexa Prize Challenge.

  5. Getting it right the fourth time: Goal-driven behavior using vector space models

    N. Fulda, B. Murdoch, D. Ricks · IEEE SAMI · 2020

    A framework for communication between humans and reinforcement-learning agents.

    PDF

iv. Projects

  1. EVE

    2018 · Amazon Alexa Prize Challenge

    A social conversational agent built with the BYU PCC Lab for the Alexa Prize Challenge. Our team won the $250K grant and a 1-in-9 spot to compete. I designed the NLG pipeline and led safety work on the dialogue corpus. Coverage in KSL, Fox 13, and BYU News.

  2. CARL

    2017 · Artificial Text Adventurer

    Our text-based RL agent that won 1st place at the IEEE CIG Artificial Text Player Competition in 2016 and 2017. CARL used affordance-based action selection over Word2Vec embeddings to navigate sparse-reward text worlds. Coverage in TechCrunch.

  3. Underwater Robotic Hand

    2016 · BYU ORCA Grant

    Wrote the grant, hired the team, and built a robotic finger for underwater use. The novel piece: a deep-learning method for approximate inverse kinematics on low-cost robotics, trained on motion-capture and point-cloud data.

v. Teaching & Community

  1. Prompt Engineering for Creative Story Writing

    Christa McAuliffe Space Center · 2024

    A workshop on prompt chaining for creative writing and an introduction to LLM-as-judge evaluation.

  2. Armorblox Deep Learning Mentorship Program

    Armorblox · 2022

    Designed and led the program. Taught deep-learning fundamentals and PyTorch to software engineers and data analysts, retraining the cohort to take on junior ML-engineering responsibilities.

  3. Generative Email Attacks and How to Defend Against Them

    RSA Conference (RSAC) · 2022

    Invited talk on the AI & ML track. Demonstrated how the GPT-3 API could enable large-scale whaling attacks, and presented mitigations.

  4. Deep Learning Reading Group

    Armorblox · 2021–2022

    Organized and ran a weekly reading group covering transformer architectures, retrieval, and training infrastructure.

  5. Mechatronics & C++ for Kids

    Three Utah Valley schools · 2016

    Founded and ran robotics clubs at three schools: Walden School of Liberal Arts, Renaissance Academy at the Christa McAuliffe Space Education Center, and Greenwood Elementary in American Fork. Taught Arduinos, mechatronics, and basic C++. The favorite memory: a Mario Kart competition where the kids built and drove their own cars, with balloons on the back to pop.

  6. Flight Director

    Christa McAuliffe Space Education Center · 2008–2013

    Long before the AI work: ran the stealth shuttle craft simulator, teaching kids to coordinate under simulated mission pressure. I came back in 2024 to teach prompt engineering at the same place (one of those rare full-circle moments).

vi. Contact

I'm always open to thoughtful conversations about AI safety, agent evaluation, and how to build language systems people can actually trust.