Case Study

How Laude Institute Scales AI Agent Benchmarking With Daytona

Case Study

How Laude Institute Scales AI Agent Benchmarking With Daytona

37k

Sandboxes created in one week with Daytona

100x

Experiment throughput compared to Docker

30%

Faster task completion with optimized cloud orchestration

Laude Institute supports AI researchers with funding and operational help. Co-founded by Andy Konwinski, it’s guided by AI leaders including Jeff Dean and David Patterson. Its investment arm, Laude Ventures, invests $150M to turn research into scalable companies.

Headquarters

San Francisco, CA

Industry

Computer Science

Department

Research and Development Engineering and Operations

Key Features

Sandbox Creation Speed Sandbox Statefulness Long‑Running Sandboxes

Laude.org

Learn how this leading AI research institute partnered with Daytona to replace their local sandbox infrastructure with a managed runtime that provisions thousands of environments on demand.

Daytona enabled us to scale our AI evaluation framework from running 4-6 containers locally to thousands in parallel across the cloud. It completely transformed how we support AI labs benchmarking their agents.

Alex Shaw

Founding Member of Technical Staff at Laude Institute

01 -- CHALLENGE

Scaling AI Agent Experiments Beyond Local Machine Capacity

When Laude Institute launched Terminal Bench, a joint project with Stanford University that benchmarks AI agents' terminal mastery, it quickly became a key performance metric for frontier labs like Anthropic, OpenAI, and Google. But as adoption surged ahead of infrastructure capacity, running thousands of parallel agent-model experiments created compute and orchestration demands that Laude wasn’t yet equipped to support.

Initially, Laude’s Founding Member of Technical Staff, Alex Shaw, conducted experiments using Docker sandboxes on Laude’s local machines. While this setup supported four to six containers, CPU, memory, and I/O constraints made it impractical for running the thousands of tests on Terminal Bench tasks.

Sandbox volume was only part of the equation. Each task had unique dependencies and specifications, such as custom frameworks and CLIs, calling for clean sandboxes provisioned on demand. Experiments ran for hours or days at a time, so the sandboxes needed reliable uptime. At this scale, provisioning speed was also essential to keep testing cycles on track and increase throughput.

None of these capabilities could come at the expense of benchmark accuracy. Because researchers needed valid results without risking unpredictable agent behavior, each environment had to securely run the same way it originally did on Laude’s local machines.

Building and maintaining infrastructure that met all these demands was a non-starter. Managing containerization and cloud provider intricacies would divert resources away from key research initiatives. Plus, Terminal Bench’s async Python architecture required careful integration to avoid pre-configuration overhead. So Alex set out to find a managed cloud runtime that could create thousands of isolated, parallel environments.

While several solutions appeared promising on the surface, each had critical limitations. Some installed pre-configured dependencies that compromised unique task execution, while others required manual container configuration, all with minimal customer support.

That’s when Alex discovered Daytona. Their agent-native runtime platform was exactly what he needed to support AI research teams relying on Terminal Bench.

“We were bottlenecked by the resource constraints of our local computers. We could only run four to six Docker containers simultaneously. For research teams running thousands of experiments, that wasn’t scalable.”

Alex Shaw

Founding Member of Technical Staff at Laude Institute

02 -- SOLUTION

A Managed Runtime Platform That Provisions Thousands of Parallel Containers On-Demand

Daytona provided clear documentation and an async Python SDK with type hints, enabling smooth integration into Terminal Bench’s existing workflows.

With Daytona’s managed runtime, Alex and other leading AI lab teams now provision long-running, reproducible sandboxes at scale. For tasks with unique specifications, research teams submit Dockerfiles to Daytona's Declarative Image Builder, which builds and executes the container on demand. If the same task needs to run repeatedly for agent-model combination testing or model reinforcement learning, Daytona's Snapshots feature enables faster sandbox setup through pre-configured templates.

Isolation is foundational to this workflow. Each sandbox ensures code runs securely without risk to the underlying infrastructure. As a result, Terminal Bench users safely test how agents execute bash commands, run shell scripts, and manipulate files. Supported by Daytona’s parallelization and optimized container orchestration, these isolated tests also run at unprecedented speed, enabling rapid iteration on agent development.

Beyond these technical benefits, Alex discovered an unexpected workflow advantage: Daytona's directory upload and download capabilities. Through the platform’s filesystem API, he exports all experiment outputs and logs, evaluating results at scale to discover deeper patterns across agents’ terminal performance.

And while the platform exceeded expectations, Daytona's customer support sealed the partnership. With direct Slack access to Daytona's engineers, Alex receives quick, hands-on assistance whenever he has a question or needs to troubleshoot a networking issue. The team even made SDK and infrastructure adjustments tailored to Terminal Bench's specific use cases, ensuring high experiment success rates with minimal lift from Alex and his team.

Daytona was the only provider that could instantly build unique environments at scale. We ran nearly 40,000 sandboxes in a week.

Alex Shaw

Founding Member of Technical Staff at Laude Institute

03 -- RESULT

Terminal Bench Runs 100x More Experiments With Daytona

With Daytona, Laude scaled Terminal Bench's infrastructure to match its rapid industry adoption. AI research teams worldwide now run thousands of parallel experiments simultaneously, validating their agents faster to accelerate breakthroughs and drive innovation across industries.

37k sandboxes created in one week with Daytona
100x experiment throughput compared to Docker
30% faster task completion with optimized cloud orchestration

With Terminal Bench 2.0 on the horizon, Daytona will be a first-class integration from day one. Alex is particularly excited to see how the platform will help drive adoption for researchers focused on training models through reinforcement learning and prompt optimization.