Introduction
The emergence of ChatGPT and other Large Language Models (LLMs) in 2022-2023 created a new wave of hype around AI in testing. The promises are tempting: AI generates tests, data, finds bugs, writes documentation. But what actually works now, and what remains marketing?
In this article, we’ll explore practical application of LLMs in QA: where they genuinely help, how to use them effectively, and — critically important — where they can cause harm. After a year of actively using GPT-4, Claude, and other models in production QA processes, we have enough data for an honest analysis.
To expand your AI testing knowledge, see our guide on AI-powered test generation tools, explore Claude and GPT-4 integration cases, and learn about AI copilot strategies for test automation.
What are LLMs and Why They Matter for QA
Large Language Models Explained
LLM (Large Language Model) is a neural network trained on massive amounts of text to predict the next word.
Key models in 2025:
- GPT-4 (OpenAI): Most powerful, best reasoning
- Claude 3 (Anthropic): Large context window (200K tokens), good for code
- Gemini Pro (Google): Multimodal, Google Workspace integration
- Llama 3 (Meta): Open-source, can run locally
Why LLMs = Game Changer for QA
1. Natural Language Understanding:
Human: "Create tests for login flow with email validation"
LLM: ✓ Understands context ("login flow")
✓ Knows what email validation is
✓ Can generate code in any language
2. Code Generation:
- Writes tests in Selenium, Cypress, Playwright
- Generates test data
- Creates mock responses
3. Context-aware Analysis:
- Analyzes code and suggests edge cases
- Finds potential bugs in logic
- Generates documentation
But: LLMs don’t “think”, they statistically predict. Important to remember.
Test Data Generation
Realistic Test Data Generation
Problem: Creating realistic test data manually takes hours.
Solution with LLM:
# Prompt for ChatGPT
"""
Generate 50 realistic test users for an e-commerce platform.
Include: name, email, age, country, payment_method, loyalty_tier.
Make data diverse (different countries, age groups).
Output as JSON array.
"""
# GPT-4 response:
[
{
"name": "Maria Garcia",
"email": "m.garcia82@gmail.com",
"age": 34,
"country": "Spain",
"payment_method": "credit_card",
"loyalty_tier": "gold"
},
{
"name": "Kenji Tanaka",
"email": "kenji.t@yahoo.co.jp",
"age": 28,
"country": "Japan",
"payment_method": "paypal",
"loyalty_tier": "silver"
},
// ... 48 more realistic users
]
Benefits:
- Data is diverse and realistic
- Considers cultural specifics (Japanese names, Spanish surnames)
- Generation in seconds instead of hours
Advanced: Contextual test data
# Using LangChain for programmatic generation
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
llm = ChatOpenAI(model="gpt-4", temperature=0.7)
prompt = PromptTemplate(
input_variables=["scenario", "count"],
template="""
Generate {count} test data entries for scenario: {scenario}
Requirements:
- Data must be realistic and diverse
- Include edge cases (empty strings, special chars, very long values)
- Cover positive and negative test cases
- Output as JSON
"""
)
# Generate data for specific scenario
test_data = llm.predict(
prompt.format(
scenario="User registration with international phone numbers",
count=30
)
)
Result:
[
{"phone": "+1-555-0123", "country": "US", "valid": true},
{"phone": "+44 20 7946 0958", "country": "UK", "valid": true},
{"phone": "+81-3-1234-5678", "country": "Japan", "valid": true},
{"phone": "not-a-phone", "country": "US", "valid": false},
{"phone": "", "country": "US", "valid": false},
{"phone": "+1" + "5"*100, "country": "US", "valid": false}, // Edge case
// ...
]
Domain-specific Data
E-commerce product data:
Prompt: Generate 20 product listings for outdoor gear store.
Include: name, description, price, category, stock, ratings.
Make descriptions SEO-friendly and realistic.
Financial test data:
"""
Generate 15 bank transaction records for fraud detection testing.
Include:
- 10 legitimate transactions
- 5 suspicious transactions (unusual amounts, locations, patterns)
Mark each with 'is_fraud' flag and reason.
"""
LLM understands what “suspicious pattern” means and generates:
- $10,000 transaction at 3am from new location
- Multiple small consecutive transactions (card testing)
- Purchases in two different countries in short time
Creating Test Cases with AI
From Requirements to Tests
Traditional process:
- Read requirements (30 min)
- Think up test cases (1-2 hours)
- Write tests (2-3 hours)
With LLM:
Prompt:
---
User Story:
"As a user, I want to reset my password via email so that I can regain access if I forget it."
Acceptance Criteria:
- User enters email address
- System sends reset link to email
- Link expires after 24 hours
- User sets new password (min 8 chars, 1 uppercase, 1 number)
- Old password no longer works
Generate comprehensive test cases covering positive, negative, and edge cases.
---
GPT-4 Output:
Test Cases:
POSITIVE SCENARIOS:
TC1: Valid password reset flow
Steps:
1. Navigate to "Forgot Password" page
2. Enter valid registered email
3. Click "Send Reset Link"
4. Check email inbox
5. Click reset link
6. Enter new valid password: "NewPass123"
7. Confirm password
8. Submit
Expected: Password reset successful, can login with new password
TC2: Reset link works within 24h window
Steps: [Similar to TC1, but verify link works at 23:59 mark]
Expected: Link still functional
NEGATIVE SCENARIOS:
TC3: Unregistered email
Steps: Enter email not in system
Expected: Generic message "If email exists, reset link sent" (security)
TC4: Expired reset link (>24h)
Steps: Use reset link after 24 hour period
Expected: Error "Link expired. Request new reset link"
TC5: Weak password validation
Test data:
- "pass" (too short)
- "password" (no uppercase/number)
- "PASSWORD" (no lowercase/number)
- "12345678" (no letters)
Expected: Error message for each
EDGE CASES:
TC6: Multiple reset requests
Steps: Request reset 5 times for same email
Expected: Only latest link works, previous invalidated
TC7: Reset link reuse attempt
Steps: Use same link twice
Expected: Second attempt fails "Link already used"
SECURITY:
TC9: Brute force protection
Steps: Attempt 10 password resets in 1 minute
Expected: Rate limiting applied
Value proposition:
- 10 comprehensive test cases in 30 seconds
- Covers scenarios easy to miss
- Can use as base and expand
Code Review with AI
Automated Code Review
Use case: Pull request with new tests
# Prompt for GPT-4
"""
Review this Selenium test code for issues:
```python
def test_login():
driver.get("http://app.com/login")
driver.find_element(By.ID, "user").send_keys("admin")
driver.find_element(By.ID, "pass").send_keys("admin123")
driver.find_element(By.ID, "btn").click()
time.sleep(5)
assert "Dashboard" in driver.page_source
Focus on:
- Test reliability (flakiness potential)
- Best practices violations
- Maintainability issues """
**GPT-4 Review:**
```markdown
## Code Review Issues
### 🔴 Critical Issues
1. **Hard-coded sleep (line 5)**
- Problem: `time.sleep(5)` causes flaky tests
- Impact: Test may fail on slow systems or pass on fast ones
- Fix:
```python
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "dashboard"))
)
- Fragile assertion (line 6)
- Problem: Checking entire page_source is unreliable
- Impact: May break with whitespace/formatting changes
- Fix:
dashboard_element = driver.find_element(By.ID, "dashboard") assert dashboard_element.is_displayed()
🟡 Medium Issues
Hard-coded URL
- Problem: URL embedded in test
- Fix: Use config/environment variables
driver.get(f"{config.BASE_URL}/login")No error handling
- Missing: What if login fails?
- Add: Explicit wait for success/error message
Weak selectors
By.ID, "btn"- very generic- Better:
By.CSS_SELECTOR, "[data-testid='login-button']"
**Value:**
- Finds issues humans easily miss
- Suggests specific fixes
- Teaches best practices
## Risks and Limitations
### Hallucinations
**Problem #1:** LLM "invents" non-existent APIs
```python
# Prompt: "Generate Cypress test for file upload"
# GPT might generate:
cy.uploadFile('test.pdf') // ❌ This method doesn't exist!
# Correct:
cy.get('input[type="file"]').selectFile('test.pdf') // ✅
Why it’s dangerous:
- Code looks plausible
- Junior QA might not notice
- Waste of time debugging
Mitigation:
- Always verify generated code
- Use IDE with autocomplete for validation
- Code review is mandatory
Outdated Knowledge
Problem #2: Knowledge cutoff date
GPT-4 trained on data until Apr 2023. New frameworks/libraries it doesn’t know.
# Prompt: "Generate Playwright test"
# GPT might use old syntax:
await page.click('#button') // Deprecated
# New syntax (2024):
await page.locator('#button').click() // Current best practice
Mitigation:
- Specify versions in prompt: “Generate Playwright 1.40 test”
- Verify against current documentation
- Use plugins with current data (if available)
Security Risks
Problem #3: Sensitive data leakage
# ❌ NEVER do this:
prompt = f"""
Review this code:
{code_with_api_keys} # Sending secrets to OpenAI!
"""
Best practices:
- Sanitize code before sending to LLM
- Use local LLM for sensitive code (Llama 3)
- Redact credentials/secrets
# ✅ Correct:
import re
def sanitize_code(code):
# Remove API keys
code = re.sub(r'api_key\s*=\s*["\'][^"\']+["\']', 'api_key="REDACTED"', code)
# Remove passwords
code = re.sub(r'password\s*=\s*["\'][^"\']+["\']', 'password="REDACTED"', code)
return code
clean_code = sanitize_code(original_code)
# Now safe to send to LLM
Quality Consistency
Problem #4: Quality varies
Same prompt → different results due to temperature parameter.
# Temperature = 0.0 → Deterministic (same output)
# Temperature = 0.7 → Creative (varied output)
# Temperature = 1.0+ → Chaotic
For tests:
- Use temperature=0 for consistency
- Verify results multiple times
- Don’t trust blindly
Over-reliance Danger
Problem #5: “AI wrote test, so it’s correct”
# AI generated test
def test_user_registration():
response = api.register(email="test@test.com", password="pass")
assert response.status_code == 200 # ❌ Not enough!
What’s missing:
- Verify user created in DB
- Email verification sent
- Password properly hashed
- No duplicates
Rule: AI is assistant, not replacement for critical thinking
Best Practices for Using LLMs in QA
1. Effective Prompting
Bad prompt:
Generate tests for login
Good prompt:
Generate Cypress tests for login functionality.
Context:
- App: E-commerce site
- Framework: Cypress 13.x
- Pattern: Page Object Model
- Authentication: JWT tokens
Requirements:
- Cover positive and negative scenarios
- Include edge cases (special chars in password, etc)
- Add proper waits (no hard-coded sleeps)
- Use data-testid selectors
- Add clear comments
Output: Complete test file with imports and fixtures
Result: Significantly better code quality
2. Iterative Refinement
User: Generate API test for user registration
GPT: [generates basic test]
User: Add validation for:
- Email format
- Password strength requirements (8+ chars, 1 uppercase, 1 number)
- Duplicate email handling
GPT: [refines test]
User: Convert to pytest with fixtures and parametrize for multiple test data
GPT: [final version]
Each iteration improves result.
3. Use LLM as Learning Tool
Prompt: Explain what this test code does, line by line:
[paste complex test]
Then suggest improvements and explain why they're better.
Value: Learning + code review in one
4. Human-in-the-loop
Workflow:
1. LLM generates test → Draft
2. QA reviewer → Adjusts & approves
3. CI/CD runs test → Validates
4. Feedback loop → Improves prompts
Never fully automated — always human review.
Real-world Use Cases
Case 1: Regression Test Suite Generation
Company: E-commerce SaaS (500K LOC)
Challenge: Legacy code without tests, need 80% coverage
Solution:
- Extracted list of all API endpoints
- For each endpoint → GPT-4 prompt:
Generate comprehensive API tests for: POST /api/orders [Include Swagger spec] Cover: CRUD operations, validation, auth, edge cases - Generated 2,300 tests in 2 days
- Human review + fixes → 1 week
- Final: 1,800 working tests (78% auto-generated)
ROI:
- Manual writing: ~6 months
- With GPT-4: 2 weeks
- Savings: ~$120K
Case 2: Test Data Generation for ML
Company: Fintech (fraud detection ML)
Challenge: Need realistic fraudulent transaction patterns
Solution:
prompt = """
Generate 100 realistic credit card transaction records.
Include 20 fraudulent patterns:
- Card testing (multiple small charges)
- Account takeover (sudden large purchases)
- Geographic anomalies (purchases in 2 countries within hours)
- Unusual merchant categories
Make legitimate transactions realistic too.
Output: CSV
"""
Result:
- ML model learned to recognize more subtle patterns
- Precision increased 12%
- False positives decreased 8%
The Future of LLMs in Testing
Trends 2025-2027
1. Specialized QA LLMs:
- Models trained specifically on QA data
- Better understanding of test patterns
- Fewer hallucinations for testing tasks
2. Agentic workflows:
# AI agent autonomously:
1. Analyze requirements
2. Generate tests
3. Run tests
4. Analyze failures
5. Fix flaky tests
6. Report results
# Human only approves/guides
3. Multi-modal testing:
- LLM + Computer Vision for UI testing
- “Look at screenshot and tell me what’s broken”
4. Real-time test generation:
# During exploratory testing:
QA action → LLM generates test → Auto-adds to suite
# Converts manual testing to automated
Conclusion
ChatGPT and LLMs are powerful tools for QA, but not a silver bullet.
Where LLMs are genuinely useful:
✅ Test data generation (90% time saved)
✅ Creating basic test cases (70% faster)
✅ Code review (finds 60-70% of obvious issues)
✅ Documentation generation (95% automation)
✅ Learning & upskilling (endless mentor)
Where LLMs DON’T replace humans:
❌ Critical thinking (edge cases require domain knowledge)
❌ Test strategy (what to test and why)
❌ Bug investigation (root cause analysis)
❌ Context understanding (business specifics)
Golden Rule:
LLM is a super-smart junior QA. Generates quickly, but requires supervision. Don’t trust blindly. Always verify.
Practical recommendations:
- Start small: Use for test data generation
- Build prompts library: Save successful prompts
- Set up guardrails: Sanitization, review process
- Measure impact: Track time saved, quality metrics
- Train team: Not everyone knows how to prompt effectively
LLMs in testing — this is the future that’s already here. The question isn’t “whether to use”, but “how to use effectively and safely”.
See Also
- AI-powered Test Generation: The Future Is Already Here - Review of Testim, Applitools, and Functionize
- Test Automation with Claude and GPT-4 - Real integration cases and practical implementation
- AI Copilot for Test Automation - GitHub Copilot and CodeWhisperer productivity gains
- AI Test Data Generation - Synthetic data creation strategies
- AI Security Testing - Finding vulnerabilities with AI-powered tools