Silent Failures in WordPress Production: 15 Annotated Incidents

Production WordPress doesn’t fail loudly. The sites that have taught me the most about infrastructure didn’t crash with red errors, they degraded quietly. Action Scheduler queues ballooned to 50,000 items before anyone noticed the queue depth. Redis crashed at 2am and PHP-FPM served stale data for six hours before a client called. A Cloudflare ruleset got wiped in under ten seconds with a single PUT request. The pattern across every incident: no alert, no log line, no obvious symptom until the damage was already done.

What follows are 15 annotated incidents from production WordPress environments, multi-vendor stores, multisites, agency-managed sites, plugin SaaS. Each one is structured the same way: what the symptom looked like, what was actually wrong, how to fix it, and what to set up so you catch it before users do. None of these are hypothetical. All of them could be running right now on a site you manage.

Incident 1: WC Analytics “Immediately” Mode Floods Action Scheduler

Symptom

Admin panel is sluggish. Action Scheduler shows 2,000-plus pending wc-admin_import_orders jobs. Redis memory usage climbs steadily until the server OOMs or the cache daemon restarts. On a Dokan multi-vendor store, this can happen within a few hours of enabling the WC Analytics module.

Root Cause

The WC Analytics import feature has two modes: “Immediately” and “Scheduled.” The “Immediately” setting fires wc-admin_import_orders on every order lifecycle hook, including every Dokan, LearnDash, and Automator status transition. On an active multi-vendor site, a single order can trigger four or five status hooks. Each one queues a fresh import job. The queue doesn’t drain because new jobs arrive faster than WP-Cron can process them.

Fix

Set the analytics data import to “Scheduled” via the Store admin: Settings > Advanced > Data Import. Select the Scheduled option. Clear the existing queue with WP-CLI:

Then restart Redis and verify the queue stays below 100 items after 30 minutes of normal traffic.

Early Detection

Add an Action Scheduler depth alert. If wc-admin_import_orders pending count crosses 500, send a Slack notification. The WP-CLI command is wp action-scheduler status, schedule it via system cron every 15 minutes on busy stores.

Incident 2: Cloudways Smart Cron Stacking on Large Multisites

Symptom

CPU spikes every five minutes on the app server, even during off-peak hours. top shows multiple php-fpm processes running WP-Cron jobs in parallel. On a 50-plus subsite network, these stacks can hold 30-plus concurrent PHP processes for minutes at a time.

Root Cause

Cloudways Smart Cron runs /var/cw/scripts/bash/wp_cron_smart.sh, which iterates every subsite in the network and fires wp-cron.php for each one on a 5-minute interval. If the full iteration takes longer than 5 minutes, which it will on a 50-plus site network, the next Smart Cron run starts before the first finishes. Instances stack.

Fix

Switch from Smart Cron to Normal Cron in the Cloudways application settings. Then set a single system cron entry that calls wp cron event run --due-now --network --allow-root every 5 minutes. This serializes execution and prevents the stack.

Early Detection

Check with ps aux | grep wp-cron | wc -l. More than 3 concurrent processes on a single WordPress install is a red flag. Add this check to your monitoring rotation, especially on networks that grow past 30 subsites.

Incident 3: Cloudflare Cache Serving Laravel @csrf Tokens to the Wrong Users

Symptom

Guest users hit a checkout or contact form and get a 419 error: “Page Expired.” The error appears intermittently, some users hit it, others don’t. Support tickets start coming in, but the issue is impossible to reproduce in a logged-in browser session.

Root Cause

A Laravel @csrf directive generates a session-specific token on every page render. If that page is edge-cached by Cloudflare, the first user’s CSRF token gets served to subsequent users. When those users submit the form, their token doesn’t match any active session, and Laravel returns 419. The cache was set intentionally for performance, but nobody added a cache bypass rule for pages with CSRF forms.

Fix

Add a Cloudflare Page Rule or Cache Rule that bypasses cache for any page containing a form with a CSRF token. The safest pattern is to bypass on Cookie: if a session cookie is present, or if the URL matches your form pages. Also add Cache-Control: no-store, private server-side on any route that renders @csrf.

Early Detection

Set up a synthetic monitor that submits your checkout form as an anonymous user every 15 minutes. A 419 response fires an alert. Cloudflare Analytics’ Cache Rate metric will also show suspiciously high cache-hit rates on pages that should never be cached.

Incident 4: CF Bot Management Silently Blocking a Custom WAF Allow Rule

Symptom

A specific IP or user agent that should be allowed through your custom WAF rules keeps getting blocked. You added the allow rule, confirmed it in the dashboard, but the blocked requests still appear in Firewall Events. The logs show a “Bot Management” block, not your custom rule.

Root Cause

Cloudflare Bot Management (and the simpler Bot Fight Mode) runs at a higher priority than custom WAF rules. When ai_bots_protection=block or the general Bot Management feature is active, it fires before your custom rules execute. Your allow rule never gets evaluated for requests that Bot Management already decided to block.

Fix

If you need to allow specific bots, user agents, or IPs, you must either disable the conflicting Bot Management setting for that scope or create a WAF Exception that explicitly skips Bot Management. Go to Security > Bots and set the specific managed bot setting to “Allow” for the user agent you need to pass. The simpler fix: disable ai_bots_protection=block if your allow list needs to include AI crawlers like ClaudeBot or GPTBot.

Early Detection

Check Firewall Events filtered by “Bot Management” and compare against your allow list. If you see any allowed IP or UA appearing in Bot Management blocks, your rule ordering is wrong.

Incident 5: EDD-SL License Activation POSTs Blocked by Cloudflare WAF

Symptom

Customers report that their license key activation fails silently. The plugin shows “Connection error” or just spins. In Cloudflare Firewall Events, you see blocked POST requests to the site root (/) with a WAF-matched rule as the reason.

Root Cause

Easy Digital Downloads Software Licensing (EDD-SL) sends activation requests as POST to the site root URL, with the action (edd_action=activate_license) in the POST body, not in the query string. Cloudflare’s Pro plan and above can inspect request bodies. The WAF often flags these as suspicious POST-to-root requests, especially if the rule matches on body content that looks like parameter injection. A query-string-based bypass rule (if URI query contains edd_action) won’t catch it because the parameter isn’t in the URL.

Fix

Create a WAF skip rule that matches on the WordPress logged-in cookie or, more broadly, on the User-Agent: WordPress/ string. EDD-SL’s activation requests always come from WP’s wp_remote_post(), which sends a WordPress user agent. The rule: if UA matches WordPress/, skip all WAF managed rules for that request.

Early Detection

Monitor Firewall Events for POST requests to / that are blocked. Run a test activation from a fresh WP install on a staging site and watch Cloudflare logs in real time. Any block here is almost always the EDD-SL pattern.

Incident 6: Cloudways Deletes wp-salt.php During Core Reinstall

Symptom

You run a WordPress core reinstall via the Cloudways dashboard or WP-CLI to clean up a suspected hack. The reinstall completes successfully. Then the site goes white-screen. WP_DEBUG shows a fatal: wp-salt.php not found or “Cannot redeclare” errors on constants that Cloudways defines there.

Root Cause

Cloudways injects its own wp-salt.php in the WordPress root to manage salts without touching wp-config.php directly. During a core reinstall, WordPress deletes all files in the root that don’t belong to core, which includes Cloudways’ custom wp-salt.php. The reinstall treats it as a foreign injection and removes it. The site can’t boot without those salt constants.

Fix

Before any core reinstall: copy wp-salt.php to a safe location outside the WordPress root. After reinstall: copy it back. On the Cloudways console, you can also regenerate it via the application’s “Reset Salt Keys” option, but that invalidates all active sessions.

Early Detection

Add a post-deploy check that verifies wp-salt.php exists and is readable after any maintenance operation. A simple [ -f /path/to/wp-salt.php ] && echo "OK" || echo "MISSING" in your runbook catches this before the site goes live.

Incident 7: Cloudflare Managed Rule Blocking .zip Uploads via wp-admin

Symptom

Plugin or theme uploads via wp-admin fail silently or return a 403. The upload UI shows a progress bar that gets to 100% then resets with no error message. Cloudflare Firewall Events shows a blocked request to wp-admin/async-upload.php.

Root Cause

Cloudflare’s OWASP Core Rule Set includes rules that flag file uploads containing certain binary patterns or MIME types. The specific managed rule 0f2da91cec674eb58006929e824b817c blocks .zip uploads through async-upload.php when the request body matches its pattern criteria. This fires for any admin uploading a plugin .zip via the standard WP interface.

Fix

Create a WAF Exception that skips the specific managed rule for authenticated admins. The cleanest approach: skip the rule when the request URL is wp-admin/async-upload.php AND the request contains a wordpress_logged_in_* cookie. This limits the bypass to authenticated admin sessions only, not the general public.

Early Detection

After any Cloudflare plan change or rule update, test a .zip upload on a staging copy with the same WAF config. It takes 30 seconds and catches this class of block before it hits a client trying to update a plugin in production.

Incident 8: Astro Worker on Apex Route Swallows All POST Requests

Symptom

License activation, webhook delivery, or any POST endpoint at the root domain (https://domain.com/) fails with a 405 Method Not Allowed. GET requests to the same URL work fine. The issue appears only on the apex, subdomains or subdirectory paths are unaffected.

Root Cause

A Cloudflare Worker (in this case, an Astro-based Worker handling the marketing site) owns the apex route and handles all incoming requests. The Worker is configured to serve GET requests and return a static response, but it has no POST handler. Cloudflare’s fetch event returns 405 for any non-GET method that the Worker doesn’t explicitly handle, before the request ever reaches the origin WordPress or PHP application.

Fix

Either add a POST passthrough in the Worker (if (request.method === 'POST') return fetch(request)) to forward POST requests to the origin, or move license/webhook endpoints off the apex to a subpath that the Worker doesn’t intercept. The second option is safer for EDD-SL because you control the endpoint without touching the Worker routing logic.

Early Detection

Any time you deploy or update a Worker that handles the apex route, run a POST test from a separate server: curl -X POST https://domain.com/ -d "test=1". If it returns 405, your Worker is swallowing POSTs. Don’t wait for a license activation failure report.

Systems that fail silently are harder to operate than systems that fail loudly, because the impact accumulates undetected until it's customer-visible. — *Systems that fail silently are harder to operate than systems that fail loudly, because the impact accumulates undetected until it’s customer-visible.*

Incident 9: EDD Checkout 503 Loop from PHP-FPM Exhaustion

Symptom

Checkout intermittently fails with a “Service Unavailable” error or the page loops back to the email address field. AJAX calls to wp-admin/admin-ajax.php during checkout return 503. The issue appears under moderate traffic, not just during spikes.

Root Cause

Two factors combine. First, PHP-FPM pool is undersized, not enough child processes to handle concurrent checkout AJAX calls, especially when each call involves payment gateway API requests that take 2-4 seconds. Second, Cloudflare’s managed bot protection is flagging some checkout requests as bot traffic and rate-limiting or challenging them, which causes the front-end JS to retry, further multiplying the AJAX load. The 503s come from PHP-FPM worker exhaustion, but the retry storm from CF bot challenges is what tips it over.

Fix

Increase PHP-FPM pm.max_children to match your expected concurrent checkout capacity. Then add a CF Firewall Exception for wp-admin/admin-ajax.php requests that originate from non-bot user agents. On Cloudways: Application Management > PHP-FPM Settings > increase max_children. Start at 20 for mid-traffic stores and tune upward.

Early Detection

Set up uptime monitoring for your checkout AJAX endpoint specifically, not just the homepage. A synthetic POST to admin-ajax.php?action=edd_get_cart_details every 5 minutes catches FPM exhaustion before real users hit it during a sale.

Incident 10: Cloudflare Ruleset PUT Wipes Everything

Symptom

After running a script or API call that updates a Cloudflare WAF ruleset, all existing rules are gone. The dashboard shows an empty ruleset. Traffic that was previously blocked is now passing through. Security posture drops to zero without any error or warning during the API call.

Root Cause

The Cloudflare Rulesets API uses PUT as a full-replace operation, not a patch. A PUT to /rulesets/phases/{phase}/entrypoint with an empty rules: [] array replaces the entire ruleset with nothing. This is not a bug, it’s documented behavior. But scripts that are built to “update” a ruleset by constructing a full body and sending PUT will silently wipe anything they don’t include, including rules added manually in the dashboard.

Fix

Before any PUT: snapshot the current ruleset with a GET and save it locally. Prefer POST to /entrypoint/rules for adding individual rules, it appends rather than replaces. If you must use PUT, always build your request body from the current state (GET first, mutate, PUT back). Add a rule count assertion after every write: if the count drops by more than you expected, roll back immediately.

Early Detection

Run a rule count check after every Cloudflare API operation in your deployment scripts. A drop from 15 rules to 0 should fire an alert, not pass silently. Version-control your ruleset state alongside your infrastructure code.

Incident 11: Stale Object Cache Serving Pre-Update Data After Plugin Change

Symptom

You update a plugin that changes options or post meta schema. The admin panel shows the new settings. But the front-end keeps serving the old behavior, wrong prices, wrong menu items, old feature flags. Hard refreshes don’t help. The problem resolves itself after a few hours, or after the cache server restarts.

Root Cause

WordPress’s object cache (Redis or Memcached) stores option values and post meta with TTLs. When a plugin update changes the stored value, the cache still holds the old version. Because wp_cache_set() writes are only invalidated by explicit flush calls or TTL expiry, stale data persists until the cache key expires naturally. Most plugins don’t call wp_cache_delete() on their options keys during updates.

Fix

After any plugin update that touches site options or feature flags, run wp cache flush immediately. Make this part of your deployment runbook, not optional. If you’re on a persistent object cache (Redis), also confirm the flush actually cleared the relevant key groups, some Redis configurations partition cache by site and require group-specific flushes. This is particularly relevant when you’re managing media-heavy WordPress installs, a topic covered in detail in the guide on shrinking uploads and moving media to S3.

Early Detection

Compare option values between what get_option() returns and what wp cache get( 'alloptions' ) returns after any update. A mismatch means stale cache is in play. Automate this check in your deployment pipeline.

Incident 12: WP-Cron Failing Silently on Shared Hosting

Symptom

Scheduled emails stop sending. Subscription renewals don’t fire. Scheduled posts sit in “Scheduled” status past their publish time. The site works fine otherwise. No error log entries. Users eventually notice that time-based features just stopped working.

Root Cause

WP-Cron is not a real cron system, it piggybacks on site traffic to trigger scheduled events. On shared hosting with low traffic, the cron hook never fires because no page requests arrive to trigger it. On some shared hosts, PHP execution time limits also abort long-running cron jobs mid-execution, leaving tasks in a half-run state. The silence is by design: WP-Cron failures produce no log output unless WP_DEBUG is on.

Fix

Disable the default WP-Cron in wp-config.php (DISABLE_WP_CRON = true) and replace it with a real system cron entry that calls wp cron event run --due-now --allow-root every 5 minutes. On shared hosting where you can’t add system crons, use a free external service like EasyCron or a UptimeRobot HTTP monitor that hits wp-cron.php on a schedule.

Early Detection

Run wp cron event list and check the “next run” timestamps. If any event is more than 2x its scheduled interval past due, cron is falling behind. Add this to a weekly health check script. It’s one of the ops patterns that production-grade agencies have systematized, covered in the context of what running a WordPress agency looks like in 2027.

Incident 13: Autoload Table Bloat Slowing Every Admin Page

Symptom

The admin dashboard is slow, 4-6 second load times on every page. Front-end is fine. Query Monitor shows a single database query taking 800ms or more. The slow query is a SELECT on wp_options filtered by autoload = 'yes'.

Root Cause

WordPress loads all autoloaded options on every request via a single query. Plugins that store transients in wp_options with autoload = 'yes' contribute to this payload. Over months or years of plugin churn, installs, uninstalls, updates, the autoload set grows. When it exceeds a few megabytes, every admin page load is fetching and deserializing that entire dataset before doing anything else.

Fix

Identify heavy hitters with this query:

Delete expired transients with wp transient delete --expired --allow-root. For persistent non-transient bloat, set autoload = 'no' on options that plugins check only occasionally. A clean autoload set should be under 800KB total.

Early Detection

Run the autoload size query monthly: SELECT SUM(LENGTH(option_value)) FROM wp_options WHERE autoload = 'yes'. If it crosses 1MB, investigate. If it crosses 3MB, it’s already affecting performance and needs immediate cleanup.

Incident 14: try-catch in Plugin Code Swallowing Form Submission Errors

Symptom

A contact form, registration form, or checkout step stops working for users. Submissions appear to go through, no error shown, but nothing happens on the backend. No email sent, no record created, no confirmation. Support tickets trickle in. PHP logs are clean. JavaScript console is clean.

Root Cause

A plugin wraps its form-handling AJAX callback in a broad try { ... } catch (Exception $e) {} block (PHP) or a try { ... } catch(e) {} block (JavaScript) that catches exceptions and silently returns a success response instead of re-throwing or logging. The actual error, a missing table, a changed API signature after a dependency update, a null-pointer from a third-party integration, is caught, swallowed, and the user sees a fake success. This pattern is especially common in form plugin integrations and payment gateway wrappers.

Fix

In PHP: replace bare catch (Exception $e) {} with at minimum catch (Exception $e) { error_log($e->getMessage()); wp_send_json_error(['message' => 'Internal error']); }. In JavaScript: always re-throw or console.error(e) in catch blocks that handle user-facing operations. In the short term, turn on WP_DEBUG_LOG and watch debug.log while reproducing the issue.

Early Detection

Add a synthetic end-to-end test for every form that submits to an AJAX endpoint. Check that the response includes expected fields (confirmation ID, redirect URL), not just an HTTP 200. A 200 with an empty body is a silent failure in disguise. The broader pattern of automated quality gates for WordPress code is covered in the WPCS + PHPStan pre-commit stack for 2026.

Incident 15: Edge Caching Authenticated Pages Leaks User-Specific Data

Symptom

Logged-in users occasionally see another user’s name, cart contents, order history, or profile data on the front-end. The data appears correct on page refresh. Reports are intermittent and hard to reproduce. The issue gets blamed on browser cache, but clearing cookies doesn’t stop it.

Root Cause

A page that renders user-specific data (cart, account dashboard, member area) is being cached at the CDN edge. The first logged-in user’s request warms the cache. Subsequent requests, even from different authenticated users, get served the cached version, which contains the first user’s personalized data. This happens when Cache-Control headers are misconfigured (missing private), when the CDN cache rule doesn’t vary on the session cookie, or when a caching plugin aggressively caches pages without checking authentication state.

Fix

Every page that renders user-specific content must send Cache-Control: no-store, private. On Cloudflare, add a Cache Rule that explicitly sets “Bypass Cache” for requests containing a wordpress_logged_in_* cookie. Audit every caching plugin setting to confirm logged-in users are excluded from page cache. Test by creating two accounts, logging into each on separate browsers, and loading authenticated pages, the responses should always differ.

Early Detection

Check your CDN’s cache-hit rate for authenticated URLs. Any cache hit percentage above 0% on a member-only or cart page is a data leak risk. Tools like Cloudflare Analytics’ Cache Analytics, filtered to your /account/ or /cart/ paths, surface this in seconds.

The Pattern Across All 15

Reading through these incidents together, a few root causes repeat. The most common: a system that returns success when it should return an error. WP-Cron doesn’t log missed schedules. The CF ruleset PUT returns 200 even after wiping your security config. The form plugin AJAX handler catches the exception and sends back a fake success JSON. Systems that fail silently are harder to operate than systems that fail loudly, because the impact accumulates undetected until it’s customer-visible.

The second pattern: defaults that work at small scale become footguns at production scale. WC Analytics “Immediately” mode is fine for a 10-order/day store. WP-Cron is fine for a low-traffic site. Smart Cron works until your network crosses 30 subsites. These aren’t bugs in the software, they’re settings that require different defaults as sites grow, and nobody tells you when you’ve crossed the threshold.

The third pattern: infrastructure layers that interact in undocumented ways. Cloudflare Bot Management fires before WAF rules. Cloudways Smart Cron iterates subsites serially. An Astro Worker owns the apex route. Each layer works exactly as documented. The failures happen at the intersections.

The takeaway for agencies and plugin developers running production WordPress at scale: invest in detection before you need it. Monitoring an Action Scheduler queue depth costs nothing. A weekly autoload size check is a 2-minute script. A synthetic checkout monitor catches FPM exhaustion before a sale. The incidents above were all preventable, not with better software, but with earlier visibility into what was already going wrong.

At Wbcom Designs, the infrastructure patterns behind these incidents directly shaped how we approach plugin architecture, hosting configuration, and site operations for client projects. If you’re running WordPress at scale and want to pressure-test your stack before it fails, reach out, this is the kind of audit that prevents 2am calls.

Incident 1: WC Analytics “Immediately” Mode Floods Action Scheduler

Symptom

Root Cause

Fix

Early Detection

Incident 2: Cloudways Smart Cron Stacking on Large Multisites

Symptom

Root Cause

Fix

Early Detection

Incident 3: Cloudflare Cache Serving Laravel @csrf Tokens to the Wrong Users

Symptom

Root Cause

Fix

Early Detection

Incident 4: CF Bot Management Silently Blocking a Custom WAF Allow Rule

Symptom

Root Cause

Fix

Early Detection

Incident 5: EDD-SL License Activation POSTs Blocked by Cloudflare WAF

Symptom

Root Cause

Fix

Early Detection

Incident 6: Cloudways Deletes wp-salt.php During Core Reinstall

Symptom

Root Cause

Fix

Early Detection

Incident 7: Cloudflare Managed Rule Blocking .zip Uploads via wp-admin

Symptom

Root Cause

Fix

Early Detection

Incident 8: Astro Worker on Apex Route Swallows All POST Requests

Symptom

Root Cause

Fix

Early Detection

Incident 9: EDD Checkout 503 Loop from PHP-FPM Exhaustion

Symptom

Root Cause

Fix

Early Detection

Incident 10: Cloudflare Ruleset PUT Wipes Everything

Symptom

Root Cause

Fix

Early Detection

Incident 11: Stale Object Cache Serving Pre-Update Data After Plugin Change

Symptom

Root Cause

Fix

Early Detection

Incident 12: WP-Cron Failing Silently on Shared Hosting

Symptom

Root Cause

Fix

Early Detection

Incident 13: Autoload Table Bloat Slowing Every Admin Page

Symptom

Root Cause

Fix

Early Detection

Incident 14: try-catch in Plugin Code Swallowing Form Submission Errors

Symptom

Root Cause

Fix

Early Detection

Incident 15: Edge Caching Authenticated Pages Leaks User-Specific Data

Symptom

Root Cause

Fix

Early Detection

The Pattern Across All 15

More from the studio