⚓ T373243 DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least)
Article Images
DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least)
Closed, ResolvedPublicBUG REPORT
DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least)
Closed, ResolvedPublicBUG REPORT
Steps to replicate the issue (include links if applicable):
- Go to: https://author-disambiguator.toolforge.org/work_item_oauth.php
- If it works, repeat half a dozen times until it fails
Note - this is an php app running on kubernetes - see /data/project/author-disambiguator etc.
What happens?:
Fatal error: Uncaught mysqli_sql_exception: php_network_getaddresses: getaddrinfo for tools.db.svc.eqiad.wmflabs failed: Temporary failure in name resolution in /data/project/author-disambiguator/public_html/lib/database_tools.php:15
What should have happened instead?:
You should have seen the default page for the application (after OAuth login)
Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):
Other information (browser name/version, screenshots, etc.):
Event Timeline
There are a very large number of changes, so older changes are hidden. Show Older Changes
Same for my tool (pod spacemedia-6fdcc8d798-8sncn). Started to fail at 2024-08-25T17:38:18.469Z with error message "java.net.UnknownHostException: tools.db.svc.wikimedia.cloud"
I don't see name resolution problem on bastion nor my cloud vps instances.
Failed on first try:
Fatal error: Uncaught mysqli_sql_exception: php_network_getaddresses: getaddrinfo for tools.db.svc.wikimedia.cloud failed: Temporary failure in name resolution in /data/project/author-disambiguator/public_html/lib/database_tools.php:15 Stack trace: #0 /data/project/author-disambiguator/public_html/lib/database_tools.php(15): mysqli->__construct() #1 /data/project/author-disambiguator/public_html/work_item_oauth.php(7): DatabaseTools->openToolDB() #2 {main} thrown in /data/project/author-disambiguator/public_html/lib/database_tools.php on line 15
getting this for AntiCompositeBot's nolicense task as well (Pod/anticompositebot.nolicense-cron-28743485-x7fqt on tools-k8s-worker-nfs-38):
2024-08-25 18:06:37 nolicense ERROR: (2003, "Can't connect to MySQL server on 'commonswiki.analytics.db.svc.wikimedia.cloud' ([Errno -3] Temporary failure in name resolution)")
I think this is related:
ERROR: TjfCliError: The jobs service seems to be down – please retry in a few minutes. ERROR: Please report this issue to the Toolforge admins if it persists: https://w.wiki/6Zuu
tools.krinklebot is facing Could not resolve host: commons.wikimedia.org for production hostnames as well. This runs as scheduled toolforge job:
[2024-08-24T15:40:46+00:00] ERROR: Skipping [[Project:Auto-protected files/wikipedia/de]] due to RuntimeException: Could not resolve host: commons.wikimedia.org in /data/project/krinklebot/src/fileprotectionsync/src/FileProtectionSyncBot.php:282 [2024-08-24T15:41:17+00:00] ERROR: Skipping [[Project:Auto-protected files/wikipedia/en]] due to RuntimeException: Could not resolve host: commons.wikimedia.org in /data/project/krinklebot/src/fileprotectionsync/src/FileProtectionSyncBot.php:282 [2024-08-24T20:31:19+00:00] ERROR: Skipping [[Project:Auto-protected files/wikipedia/de]] due to RuntimeException: Could not resolve host: commons.wikimedia.org in /data/project/krinklebot/src/fileprotectionsync/src/FileProtectionSyncBot.php:282 [2024-08-25T19:10:55+00:00] ERROR: Skipping [[Project:Auto-protected files/wikipedia/de]] due to RuntimeException: Could not resolve host: commons.wikimedia.org in /data/project/krinklebot/src/fileprotectionsync/src/FileProtectionSyncBot.php:282 [2024-08-25T19:11:27+00:00] ERROR: Skipping [[Project:Auto-protected files/wikipedia/en]] due to RuntimeException: Could not resolve host: commons.wikimedia.org in /data/project/krinklebot/src/fileprotectionsync/src/FileProtectionSyncBot.php:282 [2024-08-25T19:11:58+00:00] ERROR: Skipping [[Project:Auto-protected files/wikinews/en]] due to RuntimeException: Could not resolve host: commons.wikimedia.org in /data/project/krinklebot/src/fileprotectionsync/src/FileProtectionSyncBot.php:282 [2024-08-25T19:12:29+00:00] ERROR: Skipping [[Project:Auto-protected files/wiktionary/en]] due to RuntimeException: Could not resolve host: commons.wikimedia.org in /data/project/krinklebot/src/fileprotectionsync/src/FileProtectionSyncBot.php:282 [2024-08-25T19:13:00+00:00] ERROR: Skipping [[Project:Auto-protected files/wikipedia/fa]] due to RuntimeException: Could not resolve host: commons.wikimedia.org in /data/project/krinklebot/src/fileprotectionsync/src/FileProtectionSyncBot.php:282 [2024-08-25T19:13:42+00:00] ERROR: Skipping [[Project:Auto-protected files/wikipedia/fr]] due to RuntimeException: Could not resolve host: commons.wikimedia.org in /data/project/krinklebot/src/fileprotectionsync/src/FileProtectionSyncBot.php:282
Noting here that I'm unable to use Build Service, probably due to the same issue. Related log line:
[step-clone] 2024-08-25T22:59:56.754700588Z {"level":"error","ts":1724626796.754072,"caller":"git/git.go:55","msg":"Error running git [fetch --recurse-submodules=yes --depth=1 origin --update-head-ok --force ]: exit status 128\nfatal: unable to access 'https://gitlab.wikimedia.org/toolforge-repos/techcontribs/': Could not resolve host: gitlab.wikimedia.org\n","stacktrace":"github.com/tektoncd/pipeline/pkg/git.run\n\tgithub.com/tektoncd/pipeline/pkg/git/git.go:55\ngithub.com/tektoncd/pipeline/pkg/git.Fetch\n\tgithub.com/tektoncd/pipeline/pkg/git/git.go:150\nmain.main\n\tgithub.com/tektoncd/pipeline/cmd/git-init/main.go:53\nruntime.main\n\truntime/proc.go:255"}
Are people still seeing this issue? I'm unable to produce the specific failure mentioned in the task description.
The last one I got was 2024-08-25 22:07:47Z. But it's been intermittent the whole time.
by 'intermittent' do you mean that it's always failing a little bit, or that every few hours it fails a lot, for a few minutes?
For me the errors are gone (toolforge job service works, I was able to build and deploy my tool. No more DNS errors, everything looks fine).
hmm... from a webservice shell, we get sometimes a non authoritative answer:
I have no name!@shell-1724659470:~$ nslookup tools-harbor.wmcloud.org Server: 10.96.0.10 Address: 10.96.0.10#53 Name: tools-harbor.wmcloud.org Address: 172.16.5.140 I have no name!@shell-1724659470:~$ nslookup tools-harbor.wmcloud.org Server: 10.96.0.10 Address: 10.96.0.10#53 Non-authoritative answer: Name: tools-harbor.wmcloud.org Address: 172.16.5.140
Just manually scaled up the number of replicas for the coredns deployment from 2 to 4, and things seem to be improving, is anyone still seeing issues?
Querying from a webservice shell fails pretty frequently, even for internal names (and without domain searching, ie. with trailing .):
I have no name!@shell-1724670591:~$ time nslookup api.svc.tools.eqiad1.wikimedia.cloud. Server: 10.96.0.10 Address: 10.96.0.10#53 api.svc.tools.eqiad1.wikimedia.cloud canonical name = k8s.svc.tools.eqiad1.wikimedia.cloud. Name: k8s.svc.tools.eqiad1.wikimedia.cloud Address: 172.16.6.113 real 0m0.041s user 0m0.013s sys 0m0.017s ######################################################################## I have no name!@shell-1724670591:~$ time nslookup api.svc.tools.eqiad1.wikimedia.cloud. Server: 10.96.0.10 Address: 10.96.0.10#53 api.svc.tools.eqiad1.wikimedia.cloud canonical name = k8s.svc.tools.eqiad1.wikimedia.cloud. Name: k8s.svc.tools.eqiad1.wikimedia.cloud Address: 172.16.6.113 ;; communications error to 10.96.0.10#53: timed out real 0m5.050s user 0m0.018s sys 0m0.014s
It's running on worker-104
tools.wm-lol@tools-bastion-13:~$ kubectl get pods shell-1724670591 -o yaml | grep worker nodeName: tools-k8s-worker-104
From the coredns pod it's way more reliable:
oot@tools-k8s-control-7:~# time nsenter -n -t 1775910 nslookup api.svc.tools.eqiad1.wikimedia.cloud. 10.96.0.10 Server: 10.96.0.10 Address: 10.96.0.10#53 api.svc.tools.eqiad1.wikimedia.cloud canonical name = k8s.svc.tools.eqiad1.wikimedia.cloud. Name: k8s.svc.tools.eqiad1.wikimedia.cloud Address: 172.16.6.113 real 0m0.049s user 0m0.010s sys 0m0.030s
Trying with nsenter from a few other containers/workers
I can reproduce with nsenter on the worker:
root@tools-k8s-worker-104:~# time nsenter -t 578510 -n nslookup api.svc.tools.eqiad1.wikimedia.cloud. 10.96.0.10 ;; communications error to 10.96.0.10#53: timed out Server: 10.96.0.10 Address: 10.96.0.10#53 api.svc.tools.eqiad1.wikimedia.cloud canonical name = k8s.svc.tools.eqiad1.wikimedia.cloud. Name: k8s.svc.tools.eqiad1.wikimedia.cloud Address: 172.16.6.113 ;; communications error to 10.96.0.10#53: timed out real 0m2.043s user 0m0.021s sys 0m0.020s
When I'm trying to build an image from my github repo, I got this strange issue:
unable to access 'https://github.com/Saisengen/wikibots/': Could not resolve host: github.com\n"
Could it be related to this issue?
Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-26T13:05:06Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-4, tools-k8s-worker-nfs-15, tools-k8s-worker-nfs-18, tools-k8s-worker-nfs-25, tools-k8s-worker-nfs-51, tools-k8s-worker-nfs-52, tools-k8s-worker-104 (T373243)
Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-26T13:12:41Z] <wmbot~dcaro@urcuchillay> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-4, tools-k8s-worker-nfs-15, tools-k8s-worker-nfs-18, tools-k8s-worker-nfs-25, tools-k8s-worker-nfs-51, tools-k8s-worker-nfs-52, tools-k8s-worker-104 (T373243)
So going around with cumin, we found some workers that fail often:
tools-k8s-worker-{nfs-{4,15,18,25,51,52},104}
# running this many times to get all the failures root@cloudcumin1001:~# cumin --force 'O{project:tools name:.*worker.*}' 'nsenter -n -t $(pgrep calico| head -n1) dig +tries=1 tools-harbor.wmcloud.org @10.96.0.10'
The rest of workers do not seem to fail, those are restarting right now, though that did not help with worker-104 :/, so might have to find something else
The reboot did not help xd, the VMs are all running on different cloudvirts:
root@cloudcontrol1007:~# for node in tools-k8s-worker-{nfs-{4,15,18,25,51,52},104}; do echo "$node -> $(OS_PROJECT_ID=tools openstack server show $node | grep hypervisor_hostname)"; done tools-k8s-worker-nfs-4 -> | OS-EXT-SRV-ATTR:hypervisor_hostname | cloudvirt1048.eqiad.wmnet | tools-k8s-worker-nfs-15 -> | OS-EXT-SRV-ATTR:hypervisor_hostname | cloudvirt1034.eqiad.wmnet | tools-k8s-worker-nfs-18 -> | OS-EXT-SRV-ATTR:hypervisor_hostname | cloudvirt1060.eqiad.wmnet | tools-k8s-worker-nfs-25 -> | OS-EXT-SRV-ATTR:hypervisor_hostname | cloudvirt1032.eqiad.wmnet | tools-k8s-worker-nfs-51 -> | OS-EXT-SRV-ATTR:hypervisor_hostname | cloudvirt1057.eqiad.wmnet | tools-k8s-worker-nfs-52 -> | OS-EXT-SRV-ATTR:hypervisor_hostname | cloudvirt1032.eqiad.wmnet | tools-k8s-worker-104 -> | OS-EXT-SRV-ATTR:hypervisor_hostname | cloudvirt1054.eqiad.wmnet |
I have cordoned all the misbehaving workers, users should stop seeing issues right now, will try to debug in more detail and add new nodes if I can't find anything
Just to confirm I've done a few dozen actions that would have triggered this problem a few days ago, and everything is working. Thanks!
New nodes seem to not have the issue, so will continue adding new ones (added worker-nfs-57)
dcaro lowered the priority of this task from Unbreak Now! to Medium.Tue, Aug 27, 7:01 AM
Currently cleaning up the old nodes, but everything seems stable
When I'm trying to build an image from my github repo, I got this strange issue:
unable to access 'https://github.com/Saisengen/wikibots/': Could not resolve host: github.com\n"
Could it be related to this issue?
Yes, that was caused by this issue, it should be gone now (if not please report otherwise)
dcaro closed this task as Resolved.Tue, Aug 27, 9:52 AM
dcaro claimed this task.
I'll close this as it's been stable for a while and all the misbehaving nodes have been deleted :)
The issues I was seeing previously appear to have all resolved themselves, thank you.
@dcaro My tool reads data from DB replica. Less than hour earlier tool was working correctly, but now it returns this error (in 100% of all tries): Unable to connect to any of the specified MySQL hosts. ---> System.ArgumentException: The host name or IP address is invalid.
The host name is ruwiki.
@dcaro My tool reads data from DB replica. Less than hour earlier tool was working correctly, but now it returns this error (in 100% of all tries): Unable to connect to any of the specified MySQL hosts. ---> System.ArgumentException: The host name or IP address is invalid.
The host name is ruwiki.
Which tool is it?
Do you have the snippet of code that does the call?
All the workers seem to be responding ok (might be flaky, but no errors so far):
root@cloudcumin1001:~# cumin --force 'O{project:tools name:.*worker.*}' 'nsenter -n -t $(pgrep calico| head -n1) dig +tries=1 +short ruwiki.analytics.db.svc.wikimedia.cloud @10.96.0.10' 63 hosts will be targeted: tools-k8s-worker-[102-103,105-108].tools.eqiad1.wikimedia.cloud,tools-k8s-worker-nfs-[1-3,5-14,16-17,19-24,26-50,53-58,60-64].tools.eqiad1.wikimedia.cloud FORCE mode enabled, continuing without confirmation ===== NODE GROUP ===== (63) tools-k8s-worker-[102-103,105-108].tools.eqiad1.wikimedia.cloud,tools-k8s-worker-nfs-[1-3,5-14,16-17,19-24,26-50,53-58,60-64].tools.eqiad1.wikimedia.cloud ----- OUTPUT of 'nsenter -n -t $(...loud @10.96.0.10' ----- s6.analytics.db.svc.wikimedia.cloud. 172.20.255.7 ================ PASS |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (63/63) [00:05<00:00, 12.16hosts/s] FAIL | | 0% (0/63) [00:05<?, ?hosts/s] 100.0% (63/63) success ratio (>= 100.0% threshold) for command: 'nsenter -n -t $(...loud @10.96.0.10'. 100.0% (63/63) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Thanks. I already used string indexation in other tools, but not this tool, because it's very old code.
Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct. · Wikimedia Foundation · Privacy Policy · Code of Conduct · Terms of Use · Disclaimer · CC-BY-SA · GPL · Credits