Testing code you didn't write: the trust framework for agentic development

Here’s a true, illustrative story. I asked an agent to write a caching layer for user profiles. The specification was clear: cache profiles in memory with a 5-minute TTL, evict on memory warnings, thread-safe via actor isolation. The agent produced 120 lines of clean Swift. The code compiled. The tests passed — all fourteen of them.

Imagine shipping this. Four days later, users report that profile changes aren’t showing up. Not a stale-data bug in the traditional sense — the cache is working too well. The agent had implemented the TTL check correctly, but it cached the profile immediately after a successful edit. The user would update their name, the edit would succeed, the response would be cached, and then the profile screen would show the cached old profile because the cache key was userId, and the cache already had an entry for that key with time remaining.

The bug was architectural: the cache should have been invalidated on a successful edit, or the edit response should have been used to update the cache entry. The agent didn’t implement either because it wasn’t specified. The tests didn’t catch it because the tests validated the cache’s behaviour in isolation.

A bug like this takes two minutes to fix, but four days to discover. It teaches the most important lesson of agentic development: the testing problem isn’t code coverage. It’s interaction coverage.

The Trust Spectrum

Not all agent-generated code requires the same level of scrutiny. I’ve developed a mental framework with three tiers:

Tier 1: Mechanical Code (Trust After Glance)

Code where correctness is deterministic and verification is instant:

Codable conformances
Equatable / Hashable implementations
Boilerplate reducers with no business logic
Standard SwiftUI view layouts from design specs
Unit test setup/teardown

I scan this code for about 10 seconds. If the structure matches what I expected, I move on. Agents are consistently strong at mechanical code because the solution space is narrow.

Tier 2: Logic Code (Trust After Review)

Code where correctness depends on business rules:

Reducer logic (state transitions based on actions)
Validation functions
Data transformation and mapping
API request construction
Error handling and recovery flows

I review this line by line. I check edge cases. I read the tests to verify they cover the cases I’d worry about. This takes 5-15 minutes per feature, which is still dramatically faster than writing it from scratch.

Tier 3: System-Interaction Code (Trust After Integration Testing)

Code where correctness depends on behaviour in context:

Caching strategies
Concurrency patterns (actor isolation, task cancellation)
Navigation flows that span multiple features
Data persistence and migration
Anything involving timing, ordering, or resource lifecycle

This is where agents burn you. The code looks right. The unit tests pass. But the interactions with the rest of the system produce bugs that only surface under real usage patterns. I do not ship Tier 3 agent-generated code without dedicated integration tests.

My Review Checklist

I’ve formalized my review process into a checklist that I go through for every agent-generated PR. Not because I enjoy process — because the bugs I’ve missed all fell into the same categories:

1. Dependency Verification

Does the code use the correct existing dependencies, or did it create new ones?

The most common agent failure: creating a new NetworkClient when AuthenticatedNetworkClient already exists three modules away. Or instantiating JSONDecoder() inline instead of using the project’s configured decoder that handles date formatting and snake_case key mapping.

// 🔍 Check: is this the right decoder for this endpoint?
let decoder = JSONDecoder() // Agent default

// ✅ Should be:
let decoder = APIDecoder.shared // Project convention with dateDecodingStrategy

This class of bug is invisible to tests because both decoders parse the test fixtures correctly — the test fixtures use ISO 8601 dates, which both handle. Production responses use a different date format that only the configured decoder handles.

2. Concurrency Correctness

Does the code respect the project’s concurrency patterns?

Agents love Task { }. Unstructured, fire-and-forget, no cancellation handling. I’ve covered this in depth in my concurrency post, but it bears repeating in the agent context: every Task { } in agent-generated code is suspect until proven otherwise.

I check:

Is there a reason this needs to be unstructured? (Usually not — .task { } modifier is almost always better)
Is cancellation handled? (try Task.checkCancellation() in loops)
Does the task capture self or dependencies with appropriate lifetime semantics?

3. State Ownership

Who owns this state, and is it the right owner?

Agents create new @State properties liberally. Sometimes the state should live in the parent, or in a shared store. The agent doesn’t know your state hierarchy — it knows the file it’s editing.

// 🔍 Agent created local state for something that should be shared
struct ProfileView: View {
    @State private var isFollowing = false  // Agent default
    
    // ✅ Should be derived from the shared social graph store
    // var isFollowing: Bool { socialGraph.isFollowing(userId) }
}

4. Edge Case Enumeration

Did the agent handle the edges that matter for this specific feature?

Agents handle textbook edge cases well: empty arrays, nil optionals, network errors. They handle domain-specific edge cases poorly: what happens when the user’s subscription expires mid-session? What happens when the server returns a 200 with a soft-error in the response body? What happens when the user has 10,000 items instead of 10?

An experienced engineer maintains a list of domain-specific edge cases. When reviewing agent code, each one must be checked. The agent has no intuition for which edge cases matter — but an engineer does, because they understand the users who hit every edge case imaginable.

The Testing Strategy That Works

After extended agent-assisted development, here’s the testing approach I’ve landed on:

Let the Agent Write Unit Tests (Then Add Yours)

Agent-generated unit tests are surprisingly good at covering the happy path and standard error cases. They’re bad at covering interaction points and temporal edge cases. I let the agent write the initial test suite, then I add:

Interaction tests: What happens when Feature A’s output feeds into Feature B?
Temporal tests: What happens when operations complete in an unexpected order?
Resource lifecycle tests: What happens when the object is deallocated during an async operation?

@Test
func profileEdit_invalidatesCache_beforeFetchingUpdatedProfile() async {
    // The agent wrote tests for cache read/write/evict independently.
    // This test covers the INTERACTION between edit and cache:
    
    let cache = TestProfileCache()
    let store = TestStore(
        initialState: ProfileFeature.State(profile: .mock)
    ) {
        ProfileFeature()
    } withDependencies: {
        $0.profileCache = cache
        $0.profileClient.updateProfile = { _ in .updatedMock }
    }
    
    // Pre-populate cache
    await cache.set(.mock, for: "user-1")
    
    // Edit profile
    await store.send(.editProfile(.nameChanged("New Name"))) {
        $0.isEditing = true
    }
    await store.send(.editProfile(.saveTapped))
    
    // Verify cache was invalidated BEFORE the new profile was cached
    let cacheLog = await cache.operationLog
    let invalidateIndex = cacheLog.firstIndex(of: .remove("user-1"))!
    let setCacheIndex = cacheLog.firstIndex(of: .set("user-1"))!
    #expect(invalidateIndex < setCacheIndex)
}

This test catches the exact bug from my opening story. It’s the kind of test an agent won’t write because it requires understanding how two features interact — knowledge that lives in the system design, not in any single file.

Snapshot Testing for UI Regressions

Agent-generated SwiftUI views are generally correct, but they often have subtle layout differences from what I’d build: different padding values, missing alignment modifiers, defaulting to .body font where the design system specifies .callout. These aren’t bugs — they’re taste differences that compound into a visually inconsistent app.

I use snapshot tests for every agent-generated view. The first render becomes the reference snapshot (after I’ve approved the layout), and any future changes that alter the visual output trigger a test failure with a diff image. This catches drift.

Linting as Architecture Enforcement

SwiftLint custom rules are an underappreciated tool for agent-generated code. I’ve added rules for the specific anti-patterns I keep seeing:

custom_rules:
  no_inline_json_decoder:
    regex: 'JSONDecoder\(\)'
    message: "Use APIDecoder.shared instead of inline JSONDecoder()"
    severity: error
    
  no_unstructured_task:
    regex: 'Task\s*\{'
    message: "Prefer .task modifier over unstructured Task. Disable with swiftlint:disable:next if intentional."
    severity: warning
    
  no_direct_userdefaults:
    regex: 'UserDefaults\.standard'
    message: "Use AppSettings dependency instead of direct UserDefaults access"
    severity: error

A note on the no_unstructured_task rule: SwiftLint custom rules are line-based regex matches, so this flags all Task { usage, not just the ones in views. That’s intentional — every Task { should be a deliberate decision, and the // swiftlint:disable:next annotation forces the developer to acknowledge it. The false positives are a feature, not a bug.

These rules catch the most frequent agent mistakes at build time. The agent learns nothing between sessions — it’ll make the same mistakes next time. The linter catches them every time, instantly, before I read a single line.

The Uncomfortable Truth About Quality

Here’s what I’ve found after tracking this over time: agent-generated code that passes my full review process has been comparable in quality to code I write myself. The bugs are different — agents make integration mistakes where I make logic typos — but the overall bar is similar once review discipline is high.

The uncomfortable corollary: code that I review casually — where I skip the checklist because the feature seems simple — has a meaningfully higher defect rate. The agent’s mistakes are predictable and systematic. My review laziness is random and hard to predict.

This means the quality bottleneck isn’t the agent. It’s my review discipline. The tool produces output of consistent quality. I’m the variable.

What I’ve Stopped Doing

A few habits I’ve consciously dropped:

Reading every line of agent-generated test code. If the test names describe meaningful scenarios, and the test count is reasonable (I look at the number, not the content), I trust them and move on. If a test suite has 8 well-named tests for a simple reducer, the agent almost certainly covered the cases that matter. I add my interaction tests on top.

Rewriting for style. The agent’s naming conventions are 90% aligned with mine. The 10% that differs — it might name something fetchUserProfileData where I’d say loadProfile — isn’t worth changing. Consistency within agent-generated code is fine. Consistency with the codebase matters more, and that’s what the architecture context file handles.

Explaining agent-generated code in PRs. I used to add comments justifying the agent’s decisions. Now I treat the code as mine. If I reviewed it, approved it, and shipped it, it’s my code. The origin doesn’t matter. The quality does.

The Framework, Summarised

Classify the code into mechanical, logic, or system-interaction tiers
Escalate review intensity with each tier
Let the agent write unit tests, then add interaction and temporal tests yourself
Automate invariant checking with linting rules targeting known agent anti-patterns
Snapshot test UI to catch visual drift
Don’t trust the happy path — agents nail it every time, which makes it easy to forget that the edge cases are where production breaks

The tools will get better. The agents will learn project context more deeply. But the fundamental dynamic won’t change: generated code requires judgement that can’t itself be generated. The reviewer is not disposable. The reviewer is the product.