Introduction
The emergence of ChatGPT and other Large Language Models (LLMs) in 2022-2023 created a new wave of hype around AI in testing. The promises are tempting: AI (as discussed in AI Copilot for Test Automation: GitHub Copilot, Amazon CodeWhisperer and the Future of QA) generates tests, data, finds bugs, writes documentation. But what actually works now, and what remains (as discussed in Prompt Engineering for QA: Mastering Effective AI Queries) marketing?
In this article, we’ll explore practical application of LLMs (as discussed in Test Automation with Claude and GPT-4: Real Integration Cases and Practical Implementation) in QA: where they genuinely help, how to use them effectively, and — critically important — where they can cause harm. After a year of actively using GPT-4, Claude, and other models in production QA processes, we have enough data for an honest analysis.
What are LLMs and Why They Matter for QA
Large Language Models Explained
LLM (Large Language Model) is a neural network trained on massive amounts of text to predict the next word.
Key models in 2025:
- GPT-4 (OpenAI): Most powerful, best reasoning
- Claude 3 (Anthropic): Large context window (200K tokens), good for code
- Gemini Pro (Google): Multimodal, Google Workspace integration
- Llama 3 (Meta): Open-source, can run locally
Why LLMs = Game Changer for QA
1. Natural Language Understanding:
Human: "Create tests for login flow with email validation"
LLM: ✓ Understands context ("login flow")
✓ Knows what email validation is
✓ Can generate code in any language
2. Code Generation:
- Writes tests in Selenium, Cypress, Playwright
- Generates test data
- Creates mock responses
3. Context-aware Analysis:
- Analyzes code and suggests edge cases
- Finds potential bugs in logic
- Generates documentation
But: LLMs don’t “think”, they statistically predict. Important to remember.
Test Data Generation
Realistic Test Data Generation
Problem: Creating realistic test data manually takes hours.
Solution with LLM:
# Prompt for ChatGPT
"""
Generate 50 realistic test users for an e-commerce platform.
Include: name, email, age, country, payment_method, loyalty_tier.
Make data diverse (different countries, age groups).
Output as JSON array.
"""
# GPT-4 response:
[
{
"name": "Maria Garcia",
"email": "m.garcia82@gmail.com",
"age": 34,
"country": "Spain",
"payment_method": "credit_card",
"loyalty_tier": "gold"
},
{
"name": "Kenji Tanaka",
"email": "kenji.t@yahoo.co.jp",
"age": 28,
"country": "Japan",
"payment_method": "paypal",
"loyalty_tier": "silver"
},
// ... 48 more realistic users
]
Benefits:
- Data is diverse and realistic
- Considers cultural specifics (Japanese names, Spanish surnames)
- Generation in seconds instead of hours
Advanced: Contextual test data
# Using LangChain for programmatic generation
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
llm = ChatOpenAI(model="gpt-4", temperature=0.7)
prompt = PromptTemplate(
input_variables=["scenario", "count"],
template="""
Generate {count} test data entries for scenario: {scenario}
Requirements:
- Data must be realistic and diverse
- Include edge cases (empty strings, special chars, very long values)
- Cover positive and negative test cases
- Output as JSON
"""
)
# Generate data for specific scenario
test_data = llm.predict(
prompt.format(
scenario="User registration with international phone numbers",
count=30
)
)
Result:
[
{"phone": "+1-555-0123", "country": "US", "valid": true},
{"phone": "+44 20 7946 0958", "country": "UK", "valid": true},
{"phone": "+81-3-1234-5678", "country": "Japan", "valid": true},
{"phone": "not-a-phone", "country": "US", "valid": false},
{"phone": "", "country": "US", "valid": false},
{"phone": "+1" + "5"*100, "country": "US", "valid": false}, // Edge case
// ...
]
Domain-specific Data
E-commerce product data:
Prompt: Generate 20 product listings for outdoor gear store.
Include: name, description, price, category, stock, ratings.
Make descriptions SEO-friendly and realistic.
Financial test data:
"""
Generate 15 bank transaction records for fraud detection testing.
Include:
- 10 legitimate transactions
- 5 suspicious transactions (unusual amounts, locations, patterns)
Mark each with 'is_fraud' flag and reason.
"""
LLM understands what “suspicious pattern” means and generates:
- $10,000 transaction at 3am from new location
- Multiple small consecutive transactions (card testing)
- Purchases in two different countries in short time
Creating Test Cases with AI
From Requirements to Tests
Traditional process:
- Read requirements (30 min)
- Think up test cases (1-2 hours)
- Write tests (2-3 hours)
With LLM:
Prompt:
---
User Story:
"As a user, I want to reset my password via email so that I can regain access if I forget it."
Acceptance Criteria:
- User enters email address
- System sends reset link to email
- Link expires after 24 hours
- User sets new password (min 8 chars, 1 uppercase, 1 number)
- Old password no longer works
Generate comprehensive test cases covering positive, negative, and edge cases.
---
GPT-4 Output:
Test Cases:
POSITIVE SCENARIOS:
TC1: Valid password reset flow
Steps:
1. Navigate to "Forgot Password" page
2. Enter valid registered email
3. Click "Send Reset Link"
4. Check email inbox
5. Click reset link
6. Enter new valid password: "NewPass123"
7. Confirm password
8. Submit
Expected: Password reset successful, can login with new password
TC2: Reset link works within 24h window
Steps: [Similar to TC1, but verify link works at 23:59 mark]
Expected: Link still functional
NEGATIVE SCENARIOS:
TC3: Unregistered email
Steps: Enter email not in system
Expected: Generic message "If email exists, reset link sent" (security)
TC4: Expired reset link (>24h)
Steps: Use reset link after 24 hour period
Expected: Error "Link expired. Request new reset link"
TC5: Weak password validation
Test data:
- "pass" (too short)
- "password" (no uppercase/number)
- "PASSWORD" (no lowercase/number)
- "12345678" (no letters)
Expected: Error message for each
EDGE CASES:
TC6: Multiple reset requests
Steps: Request reset 5 times for same email
Expected: Only latest link works, previous invalidated
TC7: Reset link reuse attempt
Steps: Use same link twice
Expected: Second attempt fails "Link already used"
SECURITY:
TC9: Brute force protection
Steps: Attempt 10 password resets in 1 minute
Expected: Rate limiting applied
Value proposition:
- 10 comprehensive test cases in 30 seconds
- Covers scenarios easy to miss
- Can use as base and expand
Code Review with AI
Automated Code Review
Use case: Pull request with new tests
# Prompt for GPT-4
"""
Review this Selenium test code for issues:
```python
def test_login():
driver.get("http://app.com/login")
driver.find_element(By.ID, "user").send_keys("admin")
driver.find_element(By.ID, "pass").send_keys("admin123")
driver.find_element(By.ID, "btn").click()
time.sleep(5)
assert "Dashboard" in driver.page_source
Focus on:
- Test reliability (flakiness potential)
- Best practices violations
- Maintainability issues """
**GPT-4 Review:**
```markdown
## Code Review Issues
### 🔴 Critical Issues
1. **Hard-coded sleep (line 5)**
- Problem: `time.sleep(5)` causes flaky tests
- Impact: Test may fail on slow systems or pass on fast ones
- Fix:
```python
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "dashboard"))
)
- Fragile assertion (line 6)
- Problem: Checking entire page_source is unreliable
- Impact: May break with whitespace/formatting changes
- Fix:
dashboard_element = driver.find_element(By.ID, "dashboard") assert dashboard_element.is_displayed()
🟡 Medium Issues
Hard-coded URL
- Problem: URL embedded in test
- Fix: Use config/environment variables
driver.get(f"{config.BASE_URL}/login")
No error handling
- Missing: What if login fails?
- Add: Explicit wait for success/error message
Weak selectors
By.ID, "btn"
- very generic- Better:
By.CSS_SELECTOR, "[data-testid='login-button']"
**Value:**
- Finds issues humans easily miss
- Suggests specific fixes
- Teaches best practices
## Risks and Limitations
### Hallucinations
**Problem #1:** LLM "invents" non-existent APIs
```python
# Prompt: "Generate Cypress test for file upload"
# GPT might generate:
cy.uploadFile('test.pdf') // ❌ This method doesn't exist!
# Correct:
cy.get('input[type="file"]').selectFile('test.pdf') // ✅
Why it’s dangerous:
- Code looks plausible
- Junior QA might not notice
- Waste of time debugging
Mitigation:
- Always verify generated code
- Use IDE with autocomplete for validation
- Code review is mandatory
Outdated Knowledge
Problem #2: Knowledge cutoff date
GPT-4 trained on data until Apr 2023. New frameworks/libraries it doesn’t know.
# Prompt: "Generate Playwright test"
# GPT might use old syntax:
await page.click('#button') // Deprecated
# New syntax (2024):
await page.locator('#button').click() // Current best practice
Mitigation:
- Specify versions in prompt: “Generate Playwright 1.40 test”
- Verify against current documentation
- Use plugins with current data (if available)
Security Risks
Problem #3: Sensitive data leakage
# ❌ NEVER do this:
prompt = f"""
Review this code:
{code_with_api_keys} # Sending secrets to OpenAI!
"""
Best practices:
- Sanitize code before sending to LLM
- Use local LLM for sensitive code (Llama 3)
- Redact credentials/secrets
# ✅ Correct:
import re
def sanitize_code(code):
# Remove API keys
code = re.sub(r'api_key\s*=\s*["\'][^"\']+["\']', 'api_key="REDACTED"', code)
# Remove passwords
code = re.sub(r'password\s*=\s*["\'][^"\']+["\']', 'password="REDACTED"', code)
return code
clean_code = sanitize_code(original_code)
# Now safe to send to LLM
Quality Consistency
Problem #4: Quality varies
Same prompt → different results due to temperature parameter.
# Temperature = 0.0 → Deterministic (same output)
# Temperature = 0.7 → Creative (varied output)
# Temperature = 1.0+ → Chaotic
For tests:
- Use temperature=0 for consistency
- Verify results multiple times
- Don’t trust blindly
Over-reliance Danger
Problem #5: “AI wrote test, so it’s correct”
# AI generated test
def test_user_registration():
response = api.register(email="test@test.com", password="pass")
assert response.status_code == 200 # ❌ Not enough!
What’s missing:
- Verify user created in DB
- Email verification sent
- Password properly hashed
- No duplicates
Rule: AI is assistant, not replacement for critical thinking
Best Practices for Using LLMs in QA
1. Effective Prompting
Bad prompt:
Generate tests for login
Good prompt:
Generate Cypress tests for login functionality.
Context:
- App: E-commerce site
- Framework: Cypress 13.x
- Pattern: Page Object Model
- Authentication: JWT tokens
Requirements:
- Cover positive and negative scenarios
- Include edge cases (special chars in password, etc)
- Add proper waits (no hard-coded sleeps)
- Use data-testid selectors
- Add clear comments
Output: Complete test file with imports and fixtures
Result: Significantly better code quality
2. Iterative Refinement
User: Generate API test for user registration
GPT: [generates basic test]
User: Add validation for:
- Email format
- Password strength requirements (8+ chars, 1 uppercase, 1 number)
- Duplicate email handling
GPT: [refines test]
User: Convert to pytest with fixtures and parametrize for multiple test data
GPT: [final version]
Each iteration improves result.
3. Use LLM as Learning Tool
Prompt: Explain what this test code does, line by line:
[paste complex test]
Then suggest improvements and explain why they're better.
Value: Learning + code review in one
4. Human-in-the-loop
Workflow:
1. LLM generates test → Draft
2. QA reviewer → Adjusts & approves
3. CI/CD runs test → Validates
4. Feedback loop → Improves prompts
Never fully automated — always human review.
Real-world Use Cases
Case 1: Regression Test Suite Generation
Company: E-commerce SaaS (500K LOC)
Challenge: Legacy code without tests, need 80% coverage
Solution:
- Extracted list of all API endpoints
- For each endpoint → GPT-4 prompt:
Generate comprehensive API tests for: POST /api/orders [Include Swagger spec] Cover: CRUD operations, validation, auth, edge cases
- Generated 2,300 tests in 2 days
- Human review + fixes → 1 week
- Final: 1,800 working tests (78% auto-generated)
ROI:
- Manual writing: ~6 months
- With GPT-4: 2 weeks
- Savings: ~$120K
Case 2: Test Data Generation for ML
Company: Fintech (fraud detection ML)
Challenge: Need realistic fraudulent transaction patterns
Solution:
prompt = """
Generate 100 realistic credit card transaction records.
Include 20 fraudulent patterns:
- Card testing (multiple small charges)
- Account takeover (sudden large purchases)
- Geographic anomalies (purchases in 2 countries within hours)
- Unusual merchant categories
Make legitimate transactions realistic too.
Output: CSV
"""
Result:
- ML model learned to recognize more subtle patterns
- Precision increased 12%
- False positives decreased 8%
The Future of LLMs in Testing
Trends 2025-2027
1. Specialized QA LLMs:
- Models trained specifically on QA data
- Better understanding of test patterns
- Fewer hallucinations for testing tasks
2. Agentic workflows:
# AI agent autonomously:
1. Analyze requirements
2. Generate tests
3. Run tests
4. Analyze failures
5. Fix flaky tests
6. Report results
# Human only approves/guides
3. Multi-modal testing:
- LLM + Computer Vision for UI testing
- “Look at screenshot and tell me what’s broken”
4. Real-time test generation:
# During exploratory testing:
QA action → LLM generates test → Auto-adds to suite
# Converts manual testing to automated
Conclusion
ChatGPT and LLMs are powerful tools for QA, but not a silver bullet.
Where LLMs are genuinely useful:
✅ Test data generation (90% time saved)
✅ Creating basic test cases (70% faster)
✅ Code review (finds 60-70% of obvious issues)
✅ Documentation generation (95% automation)
✅ Learning & upskilling (endless mentor)
Where LLMs DON’T replace humans:
❌ Critical thinking (edge cases require domain knowledge)
❌ Test strategy (what to test and why)
❌ Bug investigation (root cause analysis)
❌ Context understanding (business specifics)
Golden Rule:
LLM is a super-smart junior QA. Generates quickly, but requires supervision. Don’t trust blindly. Always verify.
Practical recommendations:
- Start small: Use for test data generation
- Build prompts library: Save successful prompts
- Set up guardrails: Sanitization, review process
- Measure impact: Track time saved, quality metrics
- Train team: Not everyone knows how to prompt effectively
LLMs in testing — this is the future that’s already here. The question isn’t “whether to use”, but “how to use effectively and safely”.