Introduction

The emergence of ChatGPT and other Large Language Models (LLMs) in 2022-2023 created a new wave of hype around AI in testing. The promises are tempting: AI (as discussed in AI Copilot for Test Automation: GitHub Copilot, Amazon CodeWhisperer and the Future of QA) generates tests, data, finds bugs, writes documentation. But what actually works now, and what remains (as discussed in Prompt Engineering for QA: Mastering Effective AI Queries) marketing?

In this article, we’ll explore practical application of LLMs (as discussed in Test Automation with Claude and GPT-4: Real Integration Cases and Practical Implementation) in QA: where they genuinely help, how to use them effectively, and — critically important — where they can cause harm. After a year of actively using GPT-4, Claude, and other models in production QA processes, we have enough data for an honest analysis.

What are LLMs and Why They Matter for QA

Large Language Models Explained

LLM (Large Language Model) is a neural network trained on massive amounts of text to predict the next word.

Key models in 2025:

  • GPT-4 (OpenAI): Most powerful, best reasoning
  • Claude 3 (Anthropic): Large context window (200K tokens), good for code
  • Gemini Pro (Google): Multimodal, Google Workspace integration
  • Llama 3 (Meta): Open-source, can run locally

Why LLMs = Game Changer for QA

1. Natural Language Understanding:

Human: "Create tests for login flow with email validation"

LLM: ✓ Understands context ("login flow")
     ✓ Knows what email validation is
     ✓ Can generate code in any language

2. Code Generation:

  • Writes tests in Selenium, Cypress, Playwright
  • Generates test data
  • Creates mock responses

3. Context-aware Analysis:

  • Analyzes code and suggests edge cases
  • Finds potential bugs in logic
  • Generates documentation

But: LLMs don’t “think”, they statistically predict. Important to remember.

Test Data Generation

Realistic Test Data Generation

Problem: Creating realistic test data manually takes hours.

Solution with LLM:

# Prompt for ChatGPT
"""
Generate 50 realistic test users for an e-commerce platform.
Include: name, email, age, country, payment_method, loyalty_tier.
Make data diverse (different countries, age groups).
Output as JSON array.
"""

# GPT-4 response:
[
  {
    "name": "Maria Garcia",
    "email": "m.garcia82@gmail.com",
    "age": 34,
    "country": "Spain",
    "payment_method": "credit_card",
    "loyalty_tier": "gold"
  },
  {
    "name": "Kenji Tanaka",
    "email": "kenji.t@yahoo.co.jp",
    "age": 28,
    "country": "Japan",
    "payment_method": "paypal",
    "loyalty_tier": "silver"
  },
  // ... 48 more realistic users
]

Benefits:

  • Data is diverse and realistic
  • Considers cultural specifics (Japanese names, Spanish surnames)
  • Generation in seconds instead of hours

Advanced: Contextual test data

# Using LangChain for programmatic generation
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate

llm = ChatOpenAI(model="gpt-4", temperature=0.7)

prompt = PromptTemplate(
    input_variables=["scenario", "count"],
    template="""
    Generate {count} test data entries for scenario: {scenario}

    Requirements:
    - Data must be realistic and diverse
    - Include edge cases (empty strings, special chars, very long values)
    - Cover positive and negative test cases
    - Output as JSON
    """
)

# Generate data for specific scenario
test_data = llm.predict(
    prompt.format(
        scenario="User registration with international phone numbers",
        count=30
    )
)

Result:

[
  {"phone": "+1-555-0123", "country": "US", "valid": true},
  {"phone": "+44 20 7946 0958", "country": "UK", "valid": true},
  {"phone": "+81-3-1234-5678", "country": "Japan", "valid": true},
  {"phone": "not-a-phone", "country": "US", "valid": false},
  {"phone": "", "country": "US", "valid": false},
  {"phone": "+1" + "5"*100, "country": "US", "valid": false},  // Edge case
  // ...
]

Domain-specific Data

E-commerce product data:

Prompt: Generate 20 product listings for outdoor gear store.
Include: name, description, price, category, stock, ratings.
Make descriptions SEO-friendly and realistic.

Financial test data:

"""
Generate 15 bank transaction records for fraud detection testing.
Include:
- 10 legitimate transactions
- 5 suspicious transactions (unusual amounts, locations, patterns)
Mark each with 'is_fraud' flag and reason.
"""

LLM understands what “suspicious pattern” means and generates:

  • $10,000 transaction at 3am from new location
  • Multiple small consecutive transactions (card testing)
  • Purchases in two different countries in short time

Creating Test Cases with AI

From Requirements to Tests

Traditional process:

  1. Read requirements (30 min)
  2. Think up test cases (1-2 hours)
  3. Write tests (2-3 hours)

With LLM:

Prompt:
---
User Story:
"As a user, I want to reset my password via email so that I can regain access if I forget it."

Acceptance Criteria:
- User enters email address
- System sends reset link to email
- Link expires after 24 hours
- User sets new password (min 8 chars, 1 uppercase, 1 number)
- Old password no longer works

Generate comprehensive test cases covering positive, negative, and edge cases.
---

GPT-4 Output:

Test Cases:

POSITIVE SCENARIOS:
TC1: Valid password reset flow
  Steps:
    1. Navigate to "Forgot Password" page
    2. Enter valid registered email
    3. Click "Send Reset Link"
    4. Check email inbox
    5. Click reset link
    6. Enter new valid password: "NewPass123"
    7. Confirm password
    8. Submit
  Expected: Password reset successful, can login with new password

TC2: Reset link works within 24h window
  Steps: [Similar to TC1, but verify link works at 23:59 mark]
  Expected: Link still functional

NEGATIVE SCENARIOS:
TC3: Unregistered email
  Steps: Enter email not in system
  Expected: Generic message "If email exists, reset link sent" (security)

TC4: Expired reset link (>24h)
  Steps: Use reset link after 24 hour period
  Expected: Error "Link expired. Request new reset link"

TC5: Weak password validation
  Test data:
    - "pass" (too short)
    - "password" (no uppercase/number)
    - "PASSWORD" (no lowercase/number)
    - "12345678" (no letters)
  Expected: Error message for each

EDGE CASES:
TC6: Multiple reset requests
  Steps: Request reset 5 times for same email
  Expected: Only latest link works, previous invalidated

TC7: Reset link reuse attempt
  Steps: Use same link twice
  Expected: Second attempt fails "Link already used"

SECURITY:
TC9: Brute force protection
  Steps: Attempt 10 password resets in 1 minute
  Expected: Rate limiting applied

Value proposition:

  • 10 comprehensive test cases in 30 seconds
  • Covers scenarios easy to miss
  • Can use as base and expand

Code Review with AI

Automated Code Review

Use case: Pull request with new tests

# Prompt for GPT-4
"""
Review this Selenium test code for issues:

```python
def test_login():
    driver.get("http://app.com/login")
    driver.find_element(By.ID, "user").send_keys("admin")
    driver.find_element(By.ID, "pass").send_keys("admin123")
    driver.find_element(By.ID, "btn").click()
    time.sleep(5)
    assert "Dashboard" in driver.page_source

Focus on:

  • Test reliability (flakiness potential)
  • Best practices violations
  • Maintainability issues """

**GPT-4 Review:**

```markdown
## Code Review Issues

### 🔴 Critical Issues

1. **Hard-coded sleep (line 5)**
   - Problem: `time.sleep(5)` causes flaky tests
   - Impact: Test may fail on slow systems or pass on fast ones
   - Fix:
   ```python
   from selenium.webdriver.support.ui import WebDriverWait
   from selenium.webdriver.support import expected_conditions as EC

   WebDriverWait(driver, 10).until(
       EC.presence_of_element_located((By.ID, "dashboard"))
   )
  1. Fragile assertion (line 6)
    • Problem: Checking entire page_source is unreliable
    • Impact: May break with whitespace/formatting changes
    • Fix:
    dashboard_element = driver.find_element(By.ID, "dashboard")
    assert dashboard_element.is_displayed()
    

🟡 Medium Issues

  1. Hard-coded URL

    • Problem: URL embedded in test
    • Fix: Use config/environment variables
    driver.get(f"{config.BASE_URL}/login")
    
  2. No error handling

    • Missing: What if login fails?
    • Add: Explicit wait for success/error message
  3. Weak selectors

    • By.ID, "btn" - very generic
    • Better: By.CSS_SELECTOR, "[data-testid='login-button']"

**Value:**
- Finds issues humans easily miss
- Suggests specific fixes
- Teaches best practices

## Risks and Limitations

### Hallucinations

**Problem #1:** LLM "invents" non-existent APIs

```python
# Prompt: "Generate Cypress test for file upload"

# GPT might generate:
cy.uploadFile('test.pdf')  // ❌ This method doesn't exist!

# Correct:
cy.get('input[type="file"]').selectFile('test.pdf')  // ✅

Why it’s dangerous:

  • Code looks plausible
  • Junior QA might not notice
  • Waste of time debugging

Mitigation:

  • Always verify generated code
  • Use IDE with autocomplete for validation
  • Code review is mandatory

Outdated Knowledge

Problem #2: Knowledge cutoff date

GPT-4 trained on data until Apr 2023. New frameworks/libraries it doesn’t know.

# Prompt: "Generate Playwright test"

# GPT might use old syntax:
await page.click('#button')  // Deprecated

# New syntax (2024):
await page.locator('#button').click()  // Current best practice

Mitigation:

  • Specify versions in prompt: “Generate Playwright 1.40 test”
  • Verify against current documentation
  • Use plugins with current data (if available)

Security Risks

Problem #3: Sensitive data leakage

# ❌ NEVER do this:
prompt = f"""
Review this code:
{code_with_api_keys}  # Sending secrets to OpenAI!
"""

Best practices:

  • Sanitize code before sending to LLM
  • Use local LLM for sensitive code (Llama 3)
  • Redact credentials/secrets
# ✅ Correct:
import re

def sanitize_code(code):
    # Remove API keys
    code = re.sub(r'api_key\s*=\s*["\'][^"\']+["\']', 'api_key="REDACTED"', code)
    # Remove passwords
    code = re.sub(r'password\s*=\s*["\'][^"\']+["\']', 'password="REDACTED"', code)
    return code

clean_code = sanitize_code(original_code)
# Now safe to send to LLM

Quality Consistency

Problem #4: Quality varies

Same prompt → different results due to temperature parameter.

# Temperature = 0.0 → Deterministic (same output)
# Temperature = 0.7 → Creative (varied output)
# Temperature = 1.0+ → Chaotic

For tests:

  • Use temperature=0 for consistency
  • Verify results multiple times
  • Don’t trust blindly

Over-reliance Danger

Problem #5: “AI wrote test, so it’s correct”

# AI generated test
def test_user_registration():
    response = api.register(email="test@test.com", password="pass")
    assert response.status_code == 200  # ❌ Not enough!

What’s missing:

  • Verify user created in DB
  • Email verification sent
  • Password properly hashed
  • No duplicates

Rule: AI is assistant, not replacement for critical thinking

Best Practices for Using LLMs in QA

1. Effective Prompting

Bad prompt:

Generate tests for login

Good prompt:

Generate Cypress tests for login functionality.

Context:
- App: E-commerce site
- Framework: Cypress 13.x
- Pattern: Page Object Model
- Authentication: JWT tokens

Requirements:
- Cover positive and negative scenarios
- Include edge cases (special chars in password, etc)
- Add proper waits (no hard-coded sleeps)
- Use data-testid selectors
- Add clear comments

Output: Complete test file with imports and fixtures

Result: Significantly better code quality

2. Iterative Refinement

User: Generate API test for user registration

GPT: [generates basic test]

User: Add validation for:
- Email format
- Password strength requirements (8+ chars, 1 uppercase, 1 number)
- Duplicate email handling

GPT: [refines test]

User: Convert to pytest with fixtures and parametrize for multiple test data

GPT: [final version]

Each iteration improves result.

3. Use LLM as Learning Tool

Prompt: Explain what this test code does, line by line:

[paste complex test]

Then suggest improvements and explain why they're better.

Value: Learning + code review in one

4. Human-in-the-loop

Workflow:
1. LLM generates test → Draft
2. QA reviewer → Adjusts & approves
3. CI/CD runs test → Validates
4. Feedback loop → Improves prompts

Never fully automated — always human review.

Real-world Use Cases

Case 1: Regression Test Suite Generation

Company: E-commerce SaaS (500K LOC)

Challenge: Legacy code without tests, need 80% coverage

Solution:

  1. Extracted list of all API endpoints
  2. For each endpoint → GPT-4 prompt:
    Generate comprehensive API tests for:
    POST /api/orders
    
    [Include Swagger spec]
    
    Cover: CRUD operations, validation, auth, edge cases
    
  3. Generated 2,300 tests in 2 days
  4. Human review + fixes → 1 week
  5. Final: 1,800 working tests (78% auto-generated)

ROI:

  • Manual writing: ~6 months
  • With GPT-4: 2 weeks
  • Savings: ~$120K

Case 2: Test Data Generation for ML

Company: Fintech (fraud detection ML)

Challenge: Need realistic fraudulent transaction patterns

Solution:

prompt = """
Generate 100 realistic credit card transaction records.
Include 20 fraudulent patterns:
- Card testing (multiple small charges)
- Account takeover (sudden large purchases)
- Geographic anomalies (purchases in 2 countries within hours)
- Unusual merchant categories

Make legitimate transactions realistic too.
Output: CSV
"""

Result:

  • ML model learned to recognize more subtle patterns
  • Precision increased 12%
  • False positives decreased 8%

The Future of LLMs in Testing

1. Specialized QA LLMs:

  • Models trained specifically on QA data
  • Better understanding of test patterns
  • Fewer hallucinations for testing tasks

2. Agentic workflows:

# AI agent autonomously:
1. Analyze requirements
2. Generate tests
3. Run tests
4. Analyze failures
5. Fix flaky tests
6. Report results

# Human only approves/guides

3. Multi-modal testing:

  • LLM + Computer Vision for UI testing
  • “Look at screenshot and tell me what’s broken”

4. Real-time test generation:

# During exploratory testing:
QA action  LLM generates test  Auto-adds to suite

# Converts manual testing to automated

Conclusion

ChatGPT and LLMs are powerful tools for QA, but not a silver bullet.

Where LLMs are genuinely useful:

✅ Test data generation (90% time saved)

✅ Creating basic test cases (70% faster)

✅ Code review (finds 60-70% of obvious issues)

✅ Documentation generation (95% automation)

✅ Learning & upskilling (endless mentor)

Where LLMs DON’T replace humans:

❌ Critical thinking (edge cases require domain knowledge)

❌ Test strategy (what to test and why)

❌ Bug investigation (root cause analysis)

❌ Context understanding (business specifics)

Golden Rule:

LLM is a super-smart junior QA. Generates quickly, but requires supervision. Don’t trust blindly. Always verify.

Practical recommendations:

  1. Start small: Use for test data generation
  2. Build prompts library: Save successful prompts
  3. Set up guardrails: Sanitization, review process
  4. Measure impact: Track time saved, quality metrics
  5. Train team: Not everyone knows how to prompt effectively

LLMs in testing — this is the future that’s already here. The question isn’t “whether to use”, but “how to use effectively and safely”.