Resolved 16:40 UTC - Outage was resolved. There were multiple issues here - incorrect DNS settings, code sensitive on DNS resolution, podman not using host networking.
Update 14:40 UTC - We are testing a woraround to be able to enable back the service. Fingers crossed.
Update 13:40 UTC - Seems we found a setup which makes the DNS resolving more reliable (removing aadvark-dns) and a bug from the latest release, making this error not being retried. As for why it worked before ….
Reopened 11:45 UTC - The IT DNS was a false lead. Seems we are hitting some podman DNS bug. Investigation continues.
Resolved 11:45 UTC - As it turned out, the VPC we used for years was using a DNS setup that can be prone to problems. After migrating to another VPC the problem is gone.
Update 10:30 UTC - The problems have been identified, currently it seems IT DNS problems causing random resolution errors. We are testing a workaround and working with IT to resolve the problem.
Outage 8:00 UTC - We are investigating some DNS problems on the workers. All tests might fail with Failed to establish a new connection: [Errno -2] Name or service not known'.
Last updated: May 20, 2024 at 1:11 PM UTC