Introduction

The emergence of ChatGPT and other Large Language Models (LLMs) in 2022-2023 created a new wave of hype around AI in testing. The promises are tempting: AI generates tests, data, finds bugs, writes documentation. But what actually works now, and what remains marketing?

In this article, we’ll explore practical application of LLMs in QA: where they genuinely help, how to use them effectively, and — critically important — where they can cause harm. After a year of actively using GPT-4, Claude, and other models in production QA processes, we have enough data for an honest analysis.

To expand your AI testing knowledge, see our guide on AI-powered test generation tools, explore Claude and GPT-4 integration cases, and learn about AI copilot strategies for test automation.

What are LLMs and Why They Matter for QA

Large Language Models Explained

LLM (Large Language Model) is a neural network trained on massive amounts of text to predict the next word.

Key models in 2025:

  • GPT-4 (OpenAI): Most powerful, best reasoning
  • Claude 3 (Anthropic): Large context window (200K tokens), good for code
  • Gemini Pro (Google): Multimodal, Google Workspace integration
  • Llama 3 (Meta): Open-source, can run locally

Why LLMs = Game Changer for QA

1. Natural Language Understanding:

Human: "Create tests for login flow with email validation"

LLM: ✓ Understands context ("login flow")
     ✓ Knows what email validation is
     ✓ Can generate code in any language

2. Code Generation:

  • Writes tests in Selenium, Cypress, Playwright
  • Generates test data
  • Creates mock responses

3. Context-aware Analysis:

  • Analyzes code and suggests edge cases
  • Finds potential bugs in logic
  • Generates documentation

But: LLMs don’t “think”, they statistically predict. Important to remember.

Test Data Generation

Realistic Test Data Generation

Problem: Creating realistic test data manually takes hours.

Solution with LLM:

# Prompt for ChatGPT
"""
Generate 50 realistic test users for an e-commerce platform.
Include: name, email, age, country, payment_method, loyalty_tier.
Make data diverse (different countries, age groups).
Output as JSON array.
"""

# GPT-4 response:
[
  {
    "name": "Maria Garcia",
    "email": "m.garcia82@gmail.com",
    "age": 34,
    "country": "Spain",
    "payment_method": "credit_card",
    "loyalty_tier": "gold"
  },
  {
    "name": "Kenji Tanaka",
    "email": "kenji.t@yahoo.co.jp",
    "age": 28,
    "country": "Japan",
    "payment_method": "paypal",
    "loyalty_tier": "silver"
  },
  // ... 48 more realistic users
]

Benefits:

  • Data is diverse and realistic
  • Considers cultural specifics (Japanese names, Spanish surnames)
  • Generation in seconds instead of hours

Advanced: Contextual test data

# Using LangChain for programmatic generation
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate

llm = ChatOpenAI(model="gpt-4", temperature=0.7)

prompt = PromptTemplate(
    input_variables=["scenario", "count"],
    template="""
    Generate {count} test data entries for scenario: {scenario}

    Requirements:
    - Data must be realistic and diverse
    - Include edge cases (empty strings, special chars, very long values)
    - Cover positive and negative test cases
    - Output as JSON
    """
)

# Generate data for specific scenario
test_data = llm.predict(
    prompt.format(
        scenario="User registration with international phone numbers",
        count=30
    )
)

Result:

[
  {"phone": "+1-555-0123", "country": "US", "valid": true},
  {"phone": "+44 20 7946 0958", "country": "UK", "valid": true},
  {"phone": "+81-3-1234-5678", "country": "Japan", "valid": true},
  {"phone": "not-a-phone", "country": "US", "valid": false},
  {"phone": "", "country": "US", "valid": false},
  {"phone": "+1" + "5"*100, "country": "US", "valid": false},  // Edge case
  // ...
]

Domain-specific Data

E-commerce product data:

Prompt: Generate 20 product listings for outdoor gear store.
Include: name, description, price, category, stock, ratings.
Make descriptions SEO-friendly and realistic.

Financial test data:

"""
Generate 15 bank transaction records for fraud detection testing.
Include:
- 10 legitimate transactions
- 5 suspicious transactions (unusual amounts, locations, patterns)
Mark each with 'is_fraud' flag and reason.
"""

LLM understands what “suspicious pattern” means and generates:

  • $10,000 transaction at 3am from new location
  • Multiple small consecutive transactions (card testing)
  • Purchases in two different countries in short time

Creating Test Cases with AI

From Requirements to Tests

Traditional process:

  1. Read requirements (30 min)
  2. Think up test cases (1-2 hours)
  3. Write tests (2-3 hours)

With LLM:

Prompt:
---
User Story:
"As a user, I want to reset my password via email so that I can regain access if I forget it."

Acceptance Criteria:
- User enters email address
- System sends reset link to email
- Link expires after 24 hours
- User sets new password (min 8 chars, 1 uppercase, 1 number)
- Old password no longer works

Generate comprehensive test cases covering positive, negative, and edge cases.
---

GPT-4 Output:

Test Cases:

POSITIVE SCENARIOS:
TC1: Valid password reset flow
  Steps:
    1. Navigate to "Forgot Password" page
    2. Enter valid registered email
    3. Click "Send Reset Link"
    4. Check email inbox
    5. Click reset link
    6. Enter new valid password: "NewPass123"
    7. Confirm password
    8. Submit
  Expected: Password reset successful, can login with new password

TC2: Reset link works within 24h window
  Steps: [Similar to TC1, but verify link works at 23:59 mark]
  Expected: Link still functional

NEGATIVE SCENARIOS:
TC3: Unregistered email
  Steps: Enter email not in system
  Expected: Generic message "If email exists, reset link sent" (security)

TC4: Expired reset link (>24h)
  Steps: Use reset link after 24 hour period
  Expected: Error "Link expired. Request new reset link"

TC5: Weak password validation
  Test data:
    - "pass" (too short)
    - "password" (no uppercase/number)
    - "PASSWORD" (no lowercase/number)
    - "12345678" (no letters)
  Expected: Error message for each

EDGE CASES:
TC6: Multiple reset requests
  Steps: Request reset 5 times for same email
  Expected: Only latest link works, previous invalidated

TC7: Reset link reuse attempt
  Steps: Use same link twice
  Expected: Second attempt fails "Link already used"

SECURITY:
TC9: Brute force protection
  Steps: Attempt 10 password resets in 1 minute
  Expected: Rate limiting applied

Value proposition:

  • 10 comprehensive test cases in 30 seconds
  • Covers scenarios easy to miss
  • Can use as base and expand

Code Review with AI

Automated Code Review

Use case: Pull request with new tests

# Prompt for GPT-4
"""
Review this Selenium test code for issues:

```python
def test_login():
    driver.get("http://app.com/login")
    driver.find_element(By.ID, "user").send_keys("admin")
    driver.find_element(By.ID, "pass").send_keys("admin123")
    driver.find_element(By.ID, "btn").click()
    time.sleep(5)
    assert "Dashboard" in driver.page_source

Focus on:

  • Test reliability (flakiness potential)
  • Best practices violations
  • Maintainability issues """

**GPT-4 Review:**

```markdown
## Code Review Issues

### 🔴 Critical Issues

1. **Hard-coded sleep (line 5)**
   - Problem: `time.sleep(5)` causes flaky tests
   - Impact: Test may fail on slow systems or pass on fast ones
   - Fix:
   ```python
   from selenium.webdriver.support.ui import WebDriverWait
   from selenium.webdriver.support import expected_conditions as EC

   WebDriverWait(driver, 10).until(
       EC.presence_of_element_located((By.ID, "dashboard"))
   )
  1. Fragile assertion (line 6)
    • Problem: Checking entire page_source is unreliable
    • Impact: May break with whitespace/formatting changes
    • Fix:
    dashboard_element = driver.find_element(By.ID, "dashboard")
    assert dashboard_element.is_displayed()
    

🟡 Medium Issues

  1. Hard-coded URL

    • Problem: URL embedded in test
    • Fix: Use config/environment variables
    driver.get(f"{config.BASE_URL}/login")
    
  2. No error handling

    • Missing: What if login fails?
    • Add: Explicit wait for success/error message
  3. Weak selectors

    • By.ID, "btn" - very generic
    • Better: By.CSS_SELECTOR, "[data-testid='login-button']"

**Value:**
- Finds issues humans easily miss
- Suggests specific fixes
- Teaches best practices

## Risks and Limitations

### Hallucinations

**Problem #1:** LLM "invents" non-existent APIs

```python
# Prompt: "Generate Cypress test for file upload"

# GPT might generate:
cy.uploadFile('test.pdf')  // ❌ This method doesn't exist!

# Correct:
cy.get('input[type="file"]').selectFile('test.pdf')  // ✅

Why it’s dangerous:

  • Code looks plausible
  • Junior QA might not notice
  • Waste of time debugging

Mitigation:

  • Always verify generated code
  • Use IDE with autocomplete for validation
  • Code review is mandatory

Outdated Knowledge

Problem #2: Knowledge cutoff date

GPT-4 trained on data until Apr 2023. New frameworks/libraries it doesn’t know.

# Prompt: "Generate Playwright test"

# GPT might use old syntax:
await page.click('#button')  // Deprecated

# New syntax (2024):
await page.locator('#button').click()  // Current best practice

Mitigation:

  • Specify versions in prompt: “Generate Playwright 1.40 test”
  • Verify against current documentation
  • Use plugins with current data (if available)

Security Risks

Problem #3: Sensitive data leakage

# ❌ NEVER do this:
prompt = f"""
Review this code:
{code_with_api_keys}  # Sending secrets to OpenAI!
"""

Best practices:

  • Sanitize code before sending to LLM
  • Use local LLM for sensitive code (Llama 3)
  • Redact credentials/secrets
# ✅ Correct:
import re

def sanitize_code(code):
    # Remove API keys
    code = re.sub(r'api_key\s*=\s*["\'][^"\']+["\']', 'api_key="REDACTED"', code)
    # Remove passwords
    code = re.sub(r'password\s*=\s*["\'][^"\']+["\']', 'password="REDACTED"', code)
    return code

clean_code = sanitize_code(original_code)
# Now safe to send to LLM

Quality Consistency

Problem #4: Quality varies

Same prompt → different results due to temperature parameter.

# Temperature = 0.0 → Deterministic (same output)
# Temperature = 0.7 → Creative (varied output)
# Temperature = 1.0+ → Chaotic

For tests:

  • Use temperature=0 for consistency
  • Verify results multiple times
  • Don’t trust blindly

Over-reliance Danger

Problem #5: “AI wrote test, so it’s correct”

# AI generated test
def test_user_registration():
    response = api.register(email="test@test.com", password="pass")
    assert response.status_code == 200  # ❌ Not enough!

What’s missing:

  • Verify user created in DB
  • Email verification sent
  • Password properly hashed
  • No duplicates

Rule: AI is assistant, not replacement for critical thinking

Best Practices for Using LLMs in QA

1. Effective Prompting

Bad prompt:

Generate tests for login

Good prompt:

Generate Cypress tests for login functionality.

Context:
- App: E-commerce site
- Framework: Cypress 13.x
- Pattern: Page Object Model
- Authentication: JWT tokens

Requirements:
- Cover positive and negative scenarios
- Include edge cases (special chars in password, etc)
- Add proper waits (no hard-coded sleeps)
- Use data-testid selectors
- Add clear comments

Output: Complete test file with imports and fixtures

Result: Significantly better code quality

2. Iterative Refinement

User: Generate API test for user registration

GPT: [generates basic test]

User: Add validation for:
- Email format
- Password strength requirements (8+ chars, 1 uppercase, 1 number)
- Duplicate email handling

GPT: [refines test]

User: Convert to pytest with fixtures and parametrize for multiple test data

GPT: [final version]

Each iteration improves result.

3. Use LLM as Learning Tool

Prompt: Explain what this test code does, line by line:

[paste complex test]

Then suggest improvements and explain why they're better.

Value: Learning + code review in one

4. Human-in-the-loop

Workflow:
1. LLM generates test → Draft
2. QA reviewer → Adjusts & approves
3. CI/CD runs test → Validates
4. Feedback loop → Improves prompts

Never fully automated — always human review.

Real-world Use Cases

Case 1: Regression Test Suite Generation

Company: E-commerce SaaS (500K LOC)

Challenge: Legacy code without tests, need 80% coverage

Solution:

  1. Extracted list of all API endpoints
  2. For each endpoint → GPT-4 prompt:
    Generate comprehensive API tests for:
    POST /api/orders
    
    [Include Swagger spec]
    
    Cover: CRUD operations, validation, auth, edge cases
    
  3. Generated 2,300 tests in 2 days
  4. Human review + fixes → 1 week
  5. Final: 1,800 working tests (78% auto-generated)

ROI:

  • Manual writing: ~6 months
  • With GPT-4: 2 weeks
  • Savings: ~$120K

Case 2: Test Data Generation for ML

Company: Fintech (fraud detection ML)

Challenge: Need realistic fraudulent transaction patterns

Solution:

prompt = """
Generate 100 realistic credit card transaction records.
Include 20 fraudulent patterns:
- Card testing (multiple small charges)
- Account takeover (sudden large purchases)
- Geographic anomalies (purchases in 2 countries within hours)
- Unusual merchant categories

Make legitimate transactions realistic too.
Output: CSV
"""

Result:

  • ML model learned to recognize more subtle patterns
  • Precision increased 12%
  • False positives decreased 8%

The Future of LLMs in Testing

1. Specialized QA LLMs:

  • Models trained specifically on QA data
  • Better understanding of test patterns
  • Fewer hallucinations for testing tasks

2. Agentic workflows:

# AI agent autonomously:
1. Analyze requirements
2. Generate tests
3. Run tests
4. Analyze failures
5. Fix flaky tests
6. Report results

# Human only approves/guides

3. Multi-modal testing:

  • LLM + Computer Vision for UI testing
  • “Look at screenshot and tell me what’s broken”

4. Real-time test generation:

# During exploratory testing:
QA action  LLM generates test  Auto-adds to suite

# Converts manual testing to automated

Conclusion

ChatGPT and LLMs are powerful tools for QA, but not a silver bullet.

Where LLMs are genuinely useful:

✅ Test data generation (90% time saved)

✅ Creating basic test cases (70% faster)

✅ Code review (finds 60-70% of obvious issues)

✅ Documentation generation (95% automation)

✅ Learning & upskilling (endless mentor)

Where LLMs DON’T replace humans:

❌ Critical thinking (edge cases require domain knowledge)

❌ Test strategy (what to test and why)

❌ Bug investigation (root cause analysis)

❌ Context understanding (business specifics)

Golden Rule:

LLM is a super-smart junior QA. Generates quickly, but requires supervision. Don’t trust blindly. Always verify.

Practical recommendations:

  1. Start small: Use for test data generation
  2. Build prompts library: Save successful prompts
  3. Set up guardrails: Sanitization, review process
  4. Measure impact: Track time saved, quality metrics
  5. Train team: Not everyone knows how to prompt effectively

LLMs in testing — this is the future that’s already here. The question isn’t “whether to use”, but “how to use effectively and safely”.

See Also