The fully managed Wordpress sites went down due to excessive server load on the cluster. Both CPU and disk usage topped out at 100% which caused a temporary disruption of service.
The number of visitors hitting the sites in our cluster was not particularly high. We noticed a lot of repetitive database and file system activity. As a result, we’ve implemented object caching for sites to reduce the number of database calls. This has reduced CPU and disk usage tremendously and we are now operating back at an acceptable value.