[Sciserver-users] Failed compute node
Jonas Haase
jhaase at mpe.mpg.de
Thu Feb 1 16:50:28 CET 2024
Dear All
Unfortunately we have another similar case with the other large compute node.
The VM sciserver-comp7 is currently completely unreachable and is using all of its memory.
I am trying to avoid inducing another failed disk by doing a hard reset on the machine, so for now I have isolated the compute node from the system.
You will likely not be able to reach any existing container on sciserver-comp7 (giving a unhelpful 504 Gateway error), but new large containers can be started and should automatically run on sciserver-comp5 instead.
I have also reduced the total memory a container is allowed to use to 100GB to reduce the risk of heavy processing in a few container gobbling up all the memory of a compute node.
If you are running tasks which need a lot of memory or processors please run them as a Compute Job instead.
(see p.14 of the getting started document)
https://datashare.mpcdf.mpg.de/s/1e0CF3yRNcgDL4V
I will let you know as soon as I have made progress with the failed node.
cheers
Jonas
> On 22. Jan 2024, at 10:24, Jonas Haase <jhaase at mpe.mpg.de> wrote:
>
> Dear All
>
> Unforrtunately there was no way to recover the failed drive, so I had to reinitialize it (with higher safety settings this time, knock on wood).
> That means the containers previous running on sciserver-comp5 were lost - they should have disappeared from your lists in compute already.
> I hope this has not caused any undue trouble.
>
> The compute node is back online
>
> cheers
> Jonas
>
>> On 17. Jan 2024, at 12:33, Jonas Haase <jhaase at mpe.mpg.de> wrote:
>>
>> Dear Sciserver users
>>
>> Unfortunately we had an issue with the compute node sciserver-comp5, where docker and the individual container processes had become unresponsive and refused to shut down cleanly.
>> As a last resort I rebooted the machine. It has come back up, but unfortunately has the virtual disk which holds the container information become corrupted in the process.
>>
>> I will attempt to see if I can fix the disk, but if that does not work out I will have to replace it, which will lead to the loss of the containers which have been running on that machine.
>> Your data stored on the Storage and Temporary volumes remains unaffected.
>>
>> I have turned the node off for the moment, you can still start new containers in the SciServerMPE-Large domain, which then should run on sciserver-comp7 instead.
>>
>> My apologies for the inconvenience
>> Jonas
>>
>> —
>> Jonas Haase
>> Max Planck Institute for Extraterrestrial Physics (MPE)
>> Giessenbachstr. 1, 85748 Garching, Germany
>> X2 366
>> +49 89 30000 3706
>>
>>
>> --
>> Sciserver-users mailing list
>> Sciserver-users at lists.mpe.mpg.de
>> https://lists.mpe.mpg.de/cgi-bin/mailman/listinfo/sciserver-users
>
> —
> Jonas Haase
> Max Planck Institute for Extraterrestrial Physics (MPE)
> Giessenbachstr. 1, 85748 Garching, Germany
> X2 366
> +49 89 30000 3706
>
—
Jonas Haase
Max Planck Institute for Extraterrestrial Physics (MPE)
Giessenbachstr. 1, 85748 Garching, Germany
X2 366
+49 89 30000 3706
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpe.mpg.de/pipermail/sciserver-users/attachments/20240201/78cdc585/attachment.htm>
More information about the Sciserver-users
mailing list