[ironic]: Timeout reached while waiting for callback for node
Thanks Julia.Â In addition to what you mentioned this particular issue seems to have cropped up when we added 100 more baremetal nodes.
I've also narrowed down the issue (TFTP timeouts) when 3-4 baremetal nodes are in "deploy" state and downloading the OS via iSCSI.Â Each iSCSI transfer takes about 6 Gbps and thus with four transfers we are over our 20Gbps capacity of the leaf-spine links.Â Â We are slowly migrating to iPXE so it should help.
That being said is there a document on large scale ironic design architectures?We are looking into a DC design (primarily for baremetals) for upto 2500 nodes.
On Wednesday, October 23, 2019, 03:19:41 PM PDT, Julia Kreger <juliaashleykreger at gmail.com> wrote:
On Tue, Oct 22, 2019 at 12:47 PM fsbiz at yahoo.com <fsbiz at yahoo.com> wrote:
TFTP logs: shows TFTP client timed out (weird).Â Any pointers here?
Sadly this is one of those things that comes with using TFTP. Issues like this is why the community tends to recommend using ipxe.efi to chainload as you can perform transport over TCP as opposed to UDP where in something might happen mid-transport.Â
tftpd shows ramdisk_deployed completed.Â Then, it reports that the client timed out.
Grub does tend to be very abrupt and not wrap up very final actions. I suspect it may just never be sending the ack back and the transfer may be completing. I'm afraid this is one of those things you really need to see on the console what is going on. My guess would be that your deploy_ramdisk lost a packet in transfer or that it was corrupted in transport. It would be interesting to know if the network card stack is performing checksum validation, but for IPv4 it is optional.Â Â [trim]Â
This has me stumped here.Â This exact failure seems to be happening 3 to 4 times a week on different nodes.Any pointers appreciated.
-------------- next part --------------
An HTML attachment was scrubbed...