Problems with MT (Outgoing) SMS
Incident Report for GatewayAPI
Postmortem

Incident Postmortem: Extraordinary Long Queues

Issue Description: During the weekend and yesterday morning, gatewayapi.com experienced extraordinarily long queues, causing significant delays in message delivery.

Root Cause: The root cause of the issue was identified as the internal number lookup service used for routing. This service encountered difficulties running efficiently, creating a bottleneck for all messages and severely slowing down delivery times.

Actions Taken to Resolve:

  1. Increased Service Instances: To address the issue promptly, we increased the number of always running service instances. This adjustment helped distribute the load more evenly across our infrastructure, alleviating the bottleneck and improving message delivery times.
  2. Timeout Limit Adjustment: We also took action by lowering the timeout limits for lookups in the internal number lookup service. This change was implemented to ensure that a similar issue in the future would have a less severe impact on message delivery times.
  3. Database Upgrade: In addition to the above measures, we performed a database upgrade to further enhance the overall system performance and reliability.

Preventative Measures: To prevent similar incidents in the future, we will undertake the following measures:

  • Continuous Monitoring: We will implement more robust monitoring to promptly detect any anomalies in the internal number lookup service or related components.
  • Redundancy and Scaling: We will explore redundancy options and scaling strategies to ensure the system can handle peak loads without disruptions.
  • Automated Testing: Regular automated testing of critical services will be conducted to identify potential issues before they impact the production environment.

Timeline: The issue began at 10:05 and was resolved at 10:45 on 5th February.

We apologize for any inconvenience this incident may have caused our customers. Our team remains committed to ensuring the reliability and performance of gatewayapi.com. If you have any further questions or concerns, please do not hesitate to reach out to us.

Posted Feb 06, 2024 - 13:03 CET

Resolved
The reported issues have been fully resolved. Our systems are now operating normally, and all services are functioning as expected.

We sincerely apologize for any inconvenience you may have experienced during this time and appreciate your patience and understanding.

Should you have any questions or require further assistance, please feel free to contact our support team.
Posted Feb 05, 2024 - 10:52 CET
Monitoring
We are pleased to inform you that the issue has been successfully resolved by our technical team.

Currently, there is a minor backlog in processing, but we anticipate this will be completely cleared within the next 10-15 minutes. We appreciate your patience and understanding during this time and apologize for any inconvenience caused.
Posted Feb 05, 2024 - 10:45 CET
Investigating
We have identified a MT traffic issue.

Our developers are currently working on the issue.

Users may experience degraded delivery with MT traffic.

We will keep you updated.
Posted Feb 05, 2024 - 10:22 CET
This incident affected: Message Services - Commercial (MT (Outgoing) SMS - Commercial).