Service Desk

MyPlace : MYPLACE-XXXX

Incident Date: [Insert Date]

Prepared By: [Your Name]


Incident Summary: Over the weekend, an incident occurred where one of our MyPlace apps experienced downtime. The root cause was identified as full Docker logs, which locked the memory and prevented the application from running. Additionally, increased user activity and connection loss issues with Bepoz contributed to excessive log growth, exacerbating the problem.


Timeline of Events:

  1. Incident Detection:

    • [Time] - Monitoring systems alerted the operations team about the app's downtime.

    • [Time] - Initial investigation began to identify the root cause of the issue.

  2. Incident Response:

    • [Time] - Operations team identified that Docker logs were full, causing memory lock.

    • [Time] - Logs were manually cleared, but the application still failed to run due to locked memory.

  3. Mitigation:

    • [Time] - Further investigation revealed excessive log growth due to increased user activity and connection loss with Bepoz.

    • [Time] - Immediate measures were taken to free up memory and restore application functionality.


Root Cause Analysis: The primary cause of the incident was full Docker logs that locked memory, preventing the app from running. Contributing factors included increased user activity leading to exponential log growth and connection loss issues with Bepoz, resulting in log flooding.


Impact:

  • User Impact: Users experienced downtime, unable to access the MyPlace app, leading to frustration and potential loss of trust.

  • Business Impact: Downtime affected multiple clients, causing disruptions in service and potential revenue loss.


Action Items and Preventive Measures:

  1. Automated Docker Log Cleanup:

    • Task: Develop and implement an automated process for regular Docker log cleanup.

    • Status: To Do

    • Owner: DevOps Team (Murali and Saeed)

    • Details: Write a log cleaner script, integrate it into the CI/CD pipeline, and schedule it to run weekly.

  2. Enhanced Offline Mode User Experience:

    • Task: Improve offline mode to provide a more robust user experience when the backend is down.

    • Status: To Do

    • Owner: Backend and Frontend Teams

    • Details: Ensure the offline screen is consistently shown, prevent automatic logout, and preserve access to the member card.

  3. Review and Optimize AWS Services Load:

    • Task: Review and optimize the load on AWS services such as Compute and DB Engine.

    • Status: To Do

    • Owner: DevOps Team

    • Details: Ensure the log cleaning process does not adversely affect the performance of AWS services.


Conclusion: This incident highlighted the need for better log management and a more user-friendly offline mode. By implementing the action items outlined above, we aim to prevent similar incidents in the future and enhance the overall user experience. Regular reviews and adjustments will be made to ensure the effectiveness of these measures.

Next Steps:

  • Immediate implementation of automated log cleanup.

  • Collaborative efforts between backend and frontend teams to improve offline mode.

  • Continuous monitoring and optimization of AWS services.

Attachments:

  • Incident Logs

  • Action Item Task Details

Acknowledgements: Thank you to the operations and development teams for their swift response and efforts to mitigate the impact of this incident. Your dedication ensures we can continue to provide reliable and high-quality service to our users.