Sitecore is a popular digital marketing platform used by marketing organizations in a variety of industry sectors including healthcare, education and public sector amongst others. stackArmor’s engineers have been hosting Sitecore on AWS since 2013 given the strong support for Windows, SQLServer and related Microsoft technologies on the AWS cloud. Running a Sitecore platform at scale requires a strong understanding of systems engineering and management principles as well as being able to leverage the powerful suite of monitoring and management services on AWS.
How we identified Sitecore availability issues?
A large healthcare customer having a highly available and mission critical installation of Sitecore was experiencing server unavailability issue. Upon analysis by the stackArmor Platform Engineering and Operations team it was observed that:
We analyzed all the logs and metrics to identify the root cause. Log analysis was complicated without any errors immediately visible. We noticed however the repeated log items that said “HttpModule is being initialized”.
The “HttpModule is being initialized” is reported each time when a new Http module is created. There is a pool of Http modules when the web application starts. When all modules from the pool are busy by handling current requests, new one is created and it is reported in the log.
A lot of such entries sequentially reported in the log indicates heavy load on the solution. Sitecore does not create requests on its own, hence it was assumed that such requests were being originated externally. Upon further drill-down analysis, a review of the AWS Application Load Balancer (ALB) logs revealed the source of these requests. The logs indicated that Nexgate Ruby Client (crawler agent) is making over 2000 Requests a minute causing both Sitecore Content Delivery servers to get loaded causing CPU utilization to spike to 100% thereby unable to handle new requests.
Root Cause Analysis
The Amazon Web Services (AWS) environment for hosting Sitecore was sized to handle more than 2,000 requests per minute. Clearly, there were underlying optimization issues within the system that needed to be addressed. A root cause analysis revealed an interesting set of findings that were leading to performance and availability issues. The fishbone diagram below clearly demonstrated the causes leading to gateway timeout errors being experienced in the logs.
Upon the review and analysis of the various system components, parameters and settings, a comprehensive optimization plan was recommended for implementation. Key findings included:
Cache Settings and Memory Monitors were not optimized for 64-bit Operating System, most of the settings were set to default.
Indexes on various Sitecore Database were fragmented. Index housekeeping was not working properly
SQL Database recovery model was setup to FULL RECOVERY. Full Recovery increases log writes and decreases performance.
Unhandled Exceptions were generated frequently, that cause a system overhead and should be handled within the application.
In order to avoid adversely impacting the site during regular business hours, the Nexgate Agent Crawler should be scheduled for off-peak hours.
Platform Engineering and Operations Best Practices
The Amazon Web Services (AWS) environment provides a number of in-built services for logging and monitoring critical services. AWS CloudWatch, CloudWatch Logs and centralized logging allow for efficient analysis and troubleshooting. stackArmor has developed a strong systems operations solution called stackArmor ThreatAlertTM. stackArmor ThreatAlertTM offers the ability to track and monitor key services and systems operations parameters along with integrated log analysis and correlation capabilities using industry standard technologies like Splunk. Forwarding systems logs from various services into a centralized cybersecurity and operations data lake is critical for ensuring robust system operations.