During the incident window, DHCP service was not serving any IP address. Devices without an IP address could not retrieve an IP address.
Incident time window
Start: April 30th 2023, at 00:00 UTC
- Problem definition
- Operation to identify the root cause
- Workaround and development of the patch
End: April 30th 2023, at 07:00 UTC
The main DHCP database (EMEA) was not available at the time of the incident. Without any access to the database, DHCP service was not able to perform IP lease distribution.
On April 26th, a task was performed by our Operation team in order to prepare the Database architecture change planned on May 9th. This operation consisted of connecting the new architecture (AMER, APAC) to the existing one in order to synchronize datas in real time and reduce the service downtime during the final migration. This action had no impact on the service when it was performed.
On April 30th at 00:00 UTC a scheduled maintenance task was automatically performed as part of the weekly log clean up. This task had reinitialized the secured layer service and the database. The secured layer service is responsible to secure communication between DHCP nodes including Database channels.
Further to the task on April 26th, a config mistake was introduced on the secured layer service configuration which prevented the service from restarting properly. As the secured layer service was not initialized properly, it prevented all other nodes from connecting to the database.
New devices connected to the Guest network (or guest re-negociating an IP) could not retrieve an IP address during the incident window.
Action 1 - System - Done (30/04/2023, 08:55 UTC)
Secured layer service configuration was reviewed and fixed in order to make sure such issue doesn’t resurface again.
Action 2 - System - Done (30/04/2023, 08:55 UTC)
Operation team had reviewed all configurations and added golden configurations to make sure these configurations are not changed without proper review.