Over the past year, I conducted a deliberate experiment with large language models (LLMs) in my development workflow. I explored two methods: agentic editing, where the model iteratively modifies code with limited human input, and LLM-assisted coding, where the model generates new code from prompts.

This is a self case study, grounded in real experiences. The conclusion is clear. LLMs are not suitable for writing production code in security-sensitive systems. They are highly effective for documentation when used with strict human oversight.

To see how I used LLMs outside of work, refer to this link.

Experiment Scope

I allowed LLMs to:

  • Generate new code from prompts
  • Modify existing code iteratively
  • Refactor logic across multiple files
  • Produce documentation from notes and repository context

The goal was to test whether the promised productivity gains could hold up under real engineering and security constraints. I tracked not only whether code “worked” superficially but also whether it met security and maintainability standards.

Observed Pain Points and Patterns

Over the course of the year, the following patterns emerged across multiple tasks and iterations.

1. Repeated False Information

LLMs repeatedly produced incorrect code with high confidence, giving the impression that the code was safe or correct. The false information was often subtle and difficult to detect without careful review.

Example: Role check function

Iteration 1

func CheckUserRole(user User, role string) bool {
    if user.Role == role || user.Role == "admin" {
        return true
    }
    return false
}

Problem: Assumes user.Role is sanitized. No input validation. Could be exploited if role comes from untrusted sources.

Iteration 2

func CheckUserRole(user User, role string) bool {
    return user.Role == role || user.Role == "superuser"
}

Problem: Introduces inconsistent logic. Role hierarchy is now undefined. Still no input validation.

Iteration 3

func CheckUserRole(user User, role string) bool {
    if role == "" {
        return false
    }
    return user.Role == role
}

Problem: Admin or superuser roles ignored. Model “corrects” one flaw while introducing another.

Iteration 4

func CheckUserRole(user User, role string) bool {
    validRoles := []string{"user", "admin", "superuser"}
    if !contains(validRoles, role) {
        return false
    }
    return user.Role == role
}

Problem: Model now hardcodes valid roles instead of using dynamic configuration. Introduces maintenance risk if role definitions change. Each iteration fixes one problem while adding another layer of assumptions.

Insight: LLMs do not recognize the concept of trust boundaries or proper input validation. They simply suggest patterns, which is why repeated false information accumulates.

2. Hidden Assumptions

Generated code frequently embedded assumptions that were invisible until runtime. These included input types, initialization states, and system behavior.

Example: Session token storage

Iteration 1

func StoreSession(token string) {
    sessionMemory[token] = true
}

Problem: Tokens are stored in plain memory. Volatile and insecure.

Iteration 2

func StoreSession(token string) error {
    sessionMemory[token] = struct{}{}
    return nil
}

Problem: No improvement in security; still assumes memory store is sufficient for all environments.

Iteration 3

func StoreSession(token string) {
    encrypted := Encrypt(token)
    sessionMemory[encrypted] = true
}

Problem: Assumes Encrypt exists and is secure. The model does not define it. Hidden dependency is introduced.

Iteration 4

func StoreSession(token string, key []byte) error {
    cipher, err := aes.NewCipher(key)
    if err != nil {
        return err
    }
    encrypted := make([]byte, len(token))
    cipher.Encrypt(encrypted, []byte(token))
    sessionMemory[string(encrypted)] = true
    return nil
}

Problem: Improper use of AES; ECB mode used by default if not specified. Model does not handle IVs or authentication. Hidden assumptions escalate into real vulnerabilities.

Insight: Each iteration shows how the model produces confident, plausible-looking code that silently violates security best practices.

3. Logical Flaws

Even compilable code frequently contained subtle logical errors.

Example: JWT signing

Iteration 1

func GenerateJWT(userID string) string {
    token := jwt.NewWithClaims(jwt.SigningMethodHS256, jwt.MapClaims{
        "sub": userID,
        "exp": time.Now().Add(time.Hour * 24),
    })
    tokenString, _ := token.SignedString([]byte("default-secret"))
    return tokenString
}

Problem: Hardcoded secret, no key rotation.

Iteration 2

func GenerateJWT(userID string, secret string) string {
    token := jwt.NewWithClaims(jwt.SigningMethodHS256, jwt.MapClaims{
        "sub": userID,
        "exp": time.Now().Add(time.Hour * 24),
    })
    tokenString, _ := token.SignedString([]byte(secret))
    return tokenString
}

Problem: Parameterized secret but no validation or algorithm negotiation. Still unsafe for production.

Iteration 3

func GenerateJWT(userID string, secret string) (string, error) {
    token := jwt.NewWithClaims(jwt.SigningMethodHS512, jwt.MapClaims{
        "sub": userID,
        "exp": time.Now().Add(time.Minute * 30),
    })
    return token.SignedString([]byte(secret))
}

Problem: Algorithm changed without reasoning. Shortened expiration inconsistent with other services. Subtle logic issues persist.

Iteration 4

func GenerateJWT(userID string, secret []byte) (string, error) {
    key := hmac.New(sha256.New, secret)
    token := jwt.NewWithClaims(jwt.SigningMethodHS256, jwt.MapClaims{
        "sub": userID,
        "exp": time.Now().Add(time.Hour),
    })
    return token.SignedString(key.Sum(nil))
}

Problem: Model still assumes hashing will solve key management issues. Introduces dependency on correct HMAC usage without explanation.

Insight: LLMs fix one problem but create new logical assumptions that could lead to vulnerabilities.


4. Debugging Becomes Inefficient

LLMs cannot reason about why code fails. Suggested fixes are pattern-based, not analytical. Debugging often requires fully re-deriving logic manually.

Example: USB device reading

Iteration 1

func ReadDevice(device *USBDevice) []byte {
    buffer := make([]byte, 1024)
    device.Read(buffer)
    return buffer
}

Iteration 2

func ReadDevice(device *USBDevice) []byte {
    buffer := make([]byte, 2048)
    n, _ := device.Read(buffer)
    return buffer[:n]
}

Iteration 3

func ReadDevice(device *USBDevice) ([]byte, error) {
    buffer := make([]byte, device.MaxPacketSize())
    n, err := device.Read(buffer)
    if err != nil {
        return nil, err
    }
    return buffer[:n], nil
}

Iteration 4

func ReadDevice(device *USBDevice) ([]byte, error) {
    buffer := make([]byte, device.MaxPacketSize())
    total := 0
    for total < len(buffer) {
        n, err := device.Read(buffer[total:])
        if err != nil {
            return nil, err
        }
        total += n
    }
    return buffer[:total], nil
}

Problem: Each iteration fixes one crash but introduces new assumptions about device state. Debugging requires reconstructing the correct logic manually. LLM cannot explain reasoning.

5. Increased Review Overhead

When refactoring across multiple files, LLMs introduced inconsistent assumptions and hidden dependencies.

Example: Multi-file session management

// File 1
func ValidateSession(token string) bool {
    return sessionStore[token]
}

// File 2
func ExpireSession(token string) {
    delete(sessionMemory, token)
}

// File 3
func RefreshSession(token string) {
    sessionStore[token] = true
}

Problem: sessionStore and sessionMemory inconsistent. Hidden dependencies require tracing across files, often exceeding the time needed to write manually.

Insight: Each iteration compounds risk. LLMs do not understand global invariants or system design, so repeated hallucinations accumulate into maintainability and security hazards.

Core Findings

LLM-assisted coding and agentic editing create the illusion of productivity. Code appears quickly, but momentum collapses under scrutiny. The main issues are:

  • Repeated false information
  • Hidden assumptions about input and system state
  • Logical flaws even in compilable code
  • Debugging inefficiency due to lack of reasoning
  • Increased overhead for verification and review

From a security perspective, even one hallucinated assumption can introduce a critical vulnerability.

Documentation Benefits

Unlike executable code, documentation does not directly impact security. Using LLMs to generate documentation from notes, context, and repository insights saved hours of work. I could focus on:

  • Editing and verifying accuracy
  • Structuring sections logically
  • Maintaining consistent terminology

Even when initial output contained errors, reviewing and correcting was faster than writing from scratch. Documentation became more maintainable and readable.

Freeing Time for Core Engineering

Offloading the mechanical drafting of documentation allowed me to:

  • Write production code manually and securely
  • Review security-sensitive paths thoroughly
  • Perform threat modeling
  • Reason deeply about system behavior

Without LLM assistance, this time would have been consumed by formatting and structuring text, and documentation would have been delayed or skipped entirely.

Human Oversight Remains Mandatory

At no point did I treat the LLM as a source of truth. Every document required verification, and every assertion had to be checked. Exaggeration, ambiguity, and omissions had to be corrected.

Reviewing and editing a generated draft is faster than writing one from scratch, but human oversight is non-negotiable.

Conclusion

This self case study reshaped how I use LLMs. I no longer allow them to write production code. LLM-assisted coding and agentic editing repeatedly produced false information, hidden assumptions, and logical flaws. Debugging generated code is extremely inefficient because the model cannot reason about why the code fails. From a security engineering perspective, LLM-generated code is a vulnerability waiting to happen. One hallucination can introduce a critical failure.

At the same time, I actively use LLMs for documentation. With strict human oversight, they enforce a minimum quality bar, save hours of effort, and prevent documentation from being deprioritized.

The takeaway is not that LLMs are useless. It is that they must be placed in the correct role: they can describe systems, but they must never be trusted to build, enforce, or debug them.