[Sciserver-users] Failed compute node
Jonas Haase
jhaase at mpe.mpg.de
Fri Feb 2 12:16:02 CET 2024
Dear all
I managed to restart sciserver-comp7 without issues this time and the disk hosting the Docker containers remained intact.
I have reinstated all containers on the compute nodes which have been accessed within the last year and you should be able to restart them now (they might take a little bit of time to start up again)
I am still looking into what caused the compute nodes to fail, but a reasonable guess is too intensive resource use.
Please run all processing which uses a lot of CPU, memory or time as a compute job instead of in a interactive container so you don’t disturb other users.
Comp-7 had 31 containers running, so even though it looks like you are alone in your container, there are other people using the server at all times as well.
cheers
Jonas
> On 1. Feb 2024, at 16:50, Jonas Haase <jhaase at mpe.mpg.de> wrote:
>
> Dear All
>
> Unfortunately we have another similar case with the other large compute node.
> The VM sciserver-comp7 is currently completely unreachable and is using all of its memory.
>
> I am trying to avoid inducing another failed disk by doing a hard reset on the machine, so for now I have isolated the compute node from the system.
> You will likely not be able to reach any existing container on sciserver-comp7 (giving a unhelpful 504 Gateway error), but new large containers can be started and should automatically run on sciserver-comp5 instead.
>
> I have also reduced the total memory a container is allowed to use to 100GB to reduce the risk of heavy processing in a few container gobbling up all the memory of a compute node.
> If you are running tasks which need a lot of memory or processors please run them as a Compute Job instead.
> (see p.14 of the getting started document)
> https://datashare.mpcdf.mpg.de/s/1e0CF3yRNcgDL4V
>
> I will let you know as soon as I have made progress with the failed node.
>
> cheers
> Jonas
>
>> On 22. Jan 2024, at 10:24, Jonas Haase <jhaase at mpe.mpg.de> wrote:
>>
>> Dear All
>>
>> Unforrtunately there was no way to recover the failed drive, so I had to reinitialize it (with higher safety settings this time, knock on wood).
>> That means the containers previous running on sciserver-comp5 were lost - they should have disappeared from your lists in compute already.
>> I hope this has not caused any undue trouble.
>>
>> The compute node is back online
>>
>> cheers
>> Jonas
>>
>>> On 17. Jan 2024, at 12:33, Jonas Haase <jhaase at mpe.mpg.de> wrote:
>>>
>>> Dear Sciserver users
>>>
>>> Unfortunately we had an issue with the compute node sciserver-comp5, where docker and the individual container processes had become unresponsive and refused to shut down cleanly.
>>> As a last resort I rebooted the machine. It has come back up, but unfortunately has the virtual disk which holds the container information become corrupted in the process.
>>>
>>> I will attempt to see if I can fix the disk, but if that does not work out I will have to replace it, which will lead to the loss of the containers which have been running on that machine.
>>> Your data stored on the Storage and Temporary volumes remains unaffected.
>>>
>>> I have turned the node off for the moment, you can still start new containers in the SciServerMPE-Large domain, which then should run on sciserver-comp7 instead.
>>>
>>> My apologies for the inconvenience
>>> Jonas
>>>
>>> —
>>> Jonas Haase
>>> Max Planck Institute for Extraterrestrial Physics (MPE)
>>> Giessenbachstr. 1, 85748 Garching, Germany
>>> X2 366
>>> +49 89 30000 3706
>>>
>>>
>>> --
>>> Sciserver-users mailing list
>>> Sciserver-users at lists.mpe.mpg.de
>>> https://lists.mpe.mpg.de/cgi-bin/mailman/listinfo/sciserver-users
>>
>> —
>> Jonas Haase
>> Max Planck Institute for Extraterrestrial Physics (MPE)
>> Giessenbachstr. 1, 85748 Garching, Germany
>> X2 366
>> +49 89 30000 3706
>>
>
> —
> Jonas Haase
> Max Planck Institute for Extraterrestrial Physics (MPE)
> Giessenbachstr. 1, 85748 Garching, Germany
> X2 366
> +49 89 30000 3706
>
> --
> Sciserver-users mailing list
> Sciserver-users at lists.mpe.mpg.de <mailto:Sciserver-users at lists.mpe.mpg.de>
> https://lists.mpe.mpg.de/cgi-bin/mailman/listinfo/sciserver-users
—
Jonas Haase
Max Planck Institute for Extraterrestrial Physics (MPE)
Giessenbachstr. 1, 85748 Garching, Germany
X2 366
+49 89 30000 3706
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpe.mpg.de/pipermail/sciserver-users/attachments/20240202/4a3b4a54/attachment-0001.htm>
More information about the Sciserver-users
mailing list