5xx Server Error Troubleshooting Guide
A systematic approach to diagnosing and resolving server-side HTTP errors. From unhandled exceptions to gateway timeouts, this guide covers each 5xx code with real-world solutions.
Understanding 5xx Errors
When a client receives a 5xx status code, it means the server acknowledged that it has a problem and cannot fulfill the request. Unlike 4xx errors, which indicate something wrong with the client's request, 5xx errors point squarely at the server infrastructure, application code, or the communication between services. The server knows the request was valid; it simply could not handle it.
The challenge with 5xx errors is that a single status code can have dozens of root causes. A 500 Internal Server Error might be caused by an unhandled null pointer exception, a database connection pool exhaustion, a misconfigured environment variable, a corrupted file upload, or any number of other issues. Effective troubleshooting requires a systematic approach that narrows down the layer where the failure occurs: infrastructure, network, application, or data.
This guide walks through each 5xx code with a structured diagnostic process. For a quick lookup of any HTTP status code, use the HTTP Status Code Reference tool.
500 Internal Server Error
The 500 Internal Server Error is the catch-all server error. It tells the client "something went wrong on our end, but we are not going to be more specific about what." This is both the most common and the most frustrating 5xx code to debug because it covers any unhandled server-side failure.
Common Causes
- Unhandled exceptions: The most frequent cause. An application throws an error that is not caught by any error handler, and the framework returns a generic 500. This includes null reference errors, type errors, division by zero, and assertion failures.
- Database errors: Connection pool exhaustion, query timeouts, deadlocks, schema mismatches after a migration, or the database server itself being unreachable.
- Configuration errors: Missing or incorrect environment variables, invalid configuration files, or misconfigured service credentials that only manifest at runtime.
- Dependency failures: A third-party API or internal microservice that the application depends on returns an unexpected response or is unreachable, and the error is not handled gracefully.
- File system issues: Permission errors when trying to read or write files, full disk preventing log writes, or missing template files.
- Memory exhaustion: The application runs out of memory (OOM) while processing a request, particularly with large file uploads, unbounded data processing, or memory leaks over time.
Diagnostic Steps
- Check application logs first. The stack trace in your application logs is the single most valuable piece of information. Look at the error message, the file and line number, and the call stack. In most cases, the logs tell you exactly what went wrong.
- Reproduce the error. Try to trigger the exact same request. If the error is intermittent, check whether it correlates with specific input data, time of day (traffic load), or a recent deployment.
- Check recent deployments. If the error started suddenly, the most likely cause is a recent code change. Review the last few commits or releases for the root cause.
- Inspect database connectivity. Run a health check query against the database. Check connection pool metrics for exhaustion. Review slow query logs for queries that might be timing out.
- Verify environment variables. A missing or incorrect environment variable is a surprisingly common cause. Verify that all required variables are set in the running environment (not just your local setup).
Example: Debugging a 500 in Node.js
// Add a global error handler to catch unhandled errors
app.use((err, req, res, next) => {
// Log the full error with stack trace
console.error('Unhandled error:', {
message: err.message,
stack: err.stack,
path: req.path,
method: req.method,
query: req.query,
timestamp: new Date().toISOString(),
});
// Return a structured error response
res.status(500).json({
error: 'Internal Server Error',
requestId: req.id, // Include for support reference
});
});
// Handle unhandled promise rejections
process.on('unhandledRejection', (reason, promise) => {
console.error('Unhandled Rejection:', reason);
});502 Bad Gateway
A 502 Bad Gateway means that a server acting as a gateway or proxy received an invalid response from an upstream server. This is a network-layer error that occurs between servers, not within your application code.
Common Causes
- Upstream server crashed: The application server (Node.js, Python, Java, etc.) behind Nginx or a load balancer has crashed and is not responding.
- Upstream server is not running: The application process was not started, failed to bind to the expected port, or was killed by the OS (OOM killer on Linux).
- Proxy misconfiguration: The reverse proxy (Nginx, Apache, HAProxy) is configured to forward requests to the wrong host, port, or socket.
- SSL/TLS mismatch: The proxy expects HTTP but the upstream speaks HTTPS, or vice versa. Protocol mismatches produce invalid responses.
- DNS resolution failure: The proxy cannot resolve the hostname of the upstream server, common in containerized environments where service names change during deployments.
Diagnostic Steps
- Check if the upstream server is running. SSH into the application server and verify the process is alive:
systemctl status your-appordocker ps. - Test the upstream directly. Bypass the proxy and make a request directly to the application server:
curl http://localhost:3000/health. - Check proxy error logs. Nginx logs to
/var/log/nginx/error.logby default. Look for "upstream prematurely closed connection" or "connect() failed" messages. - Verify proxy configuration. Confirm the upstream block in your Nginx config points to the correct host and port where your application is actually listening.
- Check network connectivity between servers. In cloud environments, security groups, network ACLs, or firewall rules may block traffic between the proxy and application servers.
503 Service Unavailable
A 503 Service Unavailable indicates that the server is temporarily unable to handle the request. Unlike 500 (which implies something broke unexpectedly), 503 communicates a known, usually temporary condition such as overload or scheduled maintenance.
Common Causes
- Server overload: The server is receiving more requests than it can handle. This can be caused by traffic spikes, DDoS attacks, or insufficient scaling.
- Scheduled maintenance: The server is intentionally taken offline for updates, migrations, or other maintenance tasks.
- Resource exhaustion: CPU, memory, or file descriptor limits have been reached, preventing the server from accepting new connections.
- Dependency unavailability: A critical dependency (database, cache, external API) is down, and the server cannot function without it.
- Deployment in progress: During rolling deployments, old instances are being shut down and new ones have not yet passed health checks.
Diagnostic Steps
- Check server resource utilization. Use
top,htop, or cloud monitoring dashboards to check CPU, memory, and disk usage. - Check for recent scaling events. If you use auto-scaling, check whether the scaling policy has been triggered and whether new instances are being provisioned.
- Look for the Retry-After header. A well-configured 503 response includes a
Retry-Afterheader that tells the client when to try again. Check if your application sets this header. - Review load balancer health checks. Instances failing health checks are removed from the load balancer, reducing capacity and potentially causing more 503s in a cascading failure.
Best Practice: Graceful Degradation
// Return 503 with Retry-After during maintenance or overload
app.use((req, res, next) => {
if (isMaintenanceMode()) {
res.set('Retry-After', '300'); // Retry after 5 minutes
return res.status(503).json({
error: 'Service Unavailable',
message: 'We are performing scheduled maintenance.',
retryAfter: 300,
});
}
next();
});504 Gateway Timeout
A 504 Gateway Timeout occurs when a server acting as a gateway or proxy does not receive a timely response from the upstream server. The key distinction from 502 is that with 504, the upstream server is reachable but did not respond within the configured timeout period.
Common Causes
- Slow backend processing: The application is taking too long to generate the response. This is often caused by expensive database queries, external API calls, or complex computations.
- Database query timeout: A query is locked waiting for another transaction, is scanning too many rows without an index, or the database server is overloaded.
- External API latency: The application makes a synchronous call to an external API that is slow or unresponsive, blocking the response.
- Proxy timeout too short: The proxy's timeout is configured shorter than the expected response time for certain endpoints (e.g., report generation, file processing).
- Network latency: High latency between the proxy and application servers, particularly in multi-region or cross-cloud deployments.
Diagnostic Steps
- Identify which requests are timing out. Check access logs for requests with long response times. Look for patterns: specific endpoints, certain users, or particular input sizes.
- Check slow query logs. Enable and review database slow query logs to find queries that take longer than the proxy timeout.
- Review proxy timeout configuration. In Nginx, check
proxy_read_timeout,proxy_connect_timeout, andproxy_send_timeout. Increase them for endpoints that legitimately need more time. - Profile the application. Use profiling tools (Node.js:
clinic.js; Python:cProfile; Java:async-profiler) to identify slow code paths. - Consider asynchronous processing. For long-running operations, return a 202 Accepted immediately and process the work in the background, allowing the client to poll for results.
General Troubleshooting Checklist
When you encounter any 5xx error, work through this checklist in order. The goal is to quickly narrow down the layer where the failure occurs:
- Check application logs for stack traces and error messages
- Check infrastructure metrics (CPU, memory, disk, network)
- Check recent deployments and configuration changes
- Check dependency health (database, cache, external APIs)
- Check proxy/load balancer logs for upstream connection errors
- Check DNS resolution between services
- Check TLS certificates for expiration or misconfiguration
- Reproduce with a minimal request to isolate the trigger
For a complete reference of all HTTP status codes including the full 5xx range, see our HTTP Status Code Cheat Sheet. To look up any individual code with code snippets and RFC references, use the HTTP Status Code Reference tool.
Further Reading
- RFC 9110 — HTTP Semantics (Server Error)
The IETF specification defining 5xx server error status codes.
- MDN HTTP status codes
Complete reference for all HTTP status codes with descriptions.
- Cloudflare 5xx errors guide
Troubleshooting guide for 5xx errors behind Cloudflare's reverse proxy.