[Sciserver-users] Failed compute node

Thu Feb 1 16:50:28 CET 2024

Dear All

Unfortunately we have another similar case with the other large compute node.
The VM sciserver-comp7 is currently completely unreachable and is using all of its memory.

I am trying to avoid inducing another failed disk by doing a hard reset on the machine, so for now I have isolated the compute node from the system.
You will likely not be able to reach any existing container on sciserver-comp7 (giving a unhelpful 504 Gateway error), but new large containers can be started and should automatically run on sciserver-comp5 instead. 

I have also reduced the total memory a container is allowed to use to 100GB to reduce the risk of heavy processing in a few container gobbling up all the memory of a compute node.
If you are running tasks which need a lot of memory or processors please run them as a Compute Job instead.
(see p.14 of the getting started document)
https://datashare.mpcdf.mpg.de/s/1e0CF3yRNcgDL4V

I will let you know as soon as I have made progress with the failed node.

cheers
Jonas

> On 22. Jan 2024, at 10:24, Jonas Haase <jhaase at mpe.mpg.de> wrote:
> 
> Dear All
> 
> Unforrtunately there was no way to recover the failed drive, so I had to reinitialize it (with higher safety settings this time, knock on wood).
> That means the containers previous running  on sciserver-comp5 were lost - they should have disappeared from your lists in compute already.
> I hope this has not caused any undue trouble. 
> 
> The compute node is back online 
> 
> cheers
> Jonas
> 
>> On 17. Jan 2024, at 12:33, Jonas Haase <jhaase at mpe.mpg.de> wrote:
>> 
>> Dear Sciserver users
>> 
>> Unfortunately we had an issue with the compute node sciserver-comp5, where docker and the individual container processes had become unresponsive and refused to shut down cleanly.
>> As a last resort I rebooted the machine. It has come back up, but unfortunately has the virtual disk which holds the container information become corrupted in the process.  
>> 
>> I will attempt to see if I can fix the disk, but if that does not work out I will have to replace it, which will lead to the loss of the containers which have been running on that machine.
>> Your data stored on the Storage and Temporary volumes remains unaffected. 
>> 
>> I have turned the node off for the moment, you can still start new containers in the SciServerMPE-Large domain, which then should run on sciserver-comp7 instead.
>> 
>> My apologies for the inconvenience
>> Jonas
>> 
>> —
>> Jonas Haase
>> Max Planck Institute for Extraterrestrial Physics (MPE)
>> Giessenbachstr. 1, 85748 Garching, Germany
>> X2 366
>> +49 89 30000 3706
>> 
>> 
>> -- 
>> Sciserver-users mailing list
>> Sciserver-users at lists.mpe.mpg.de
>> https://lists.mpe.mpg.de/cgi-bin/mailman/listinfo/sciserver-users
> 
> —
> Jonas Haase
> Max Planck Institute for Extraterrestrial Physics (MPE)
> Giessenbachstr. 1, 85748 Garching, Germany
> X2 366
> +49 89 30000 3706
> 

—
Jonas Haase
Max Planck Institute for Extraterrestrial Physics (MPE)
Giessenbachstr. 1, 85748 Garching, Germany
X2 366
+49 89 30000 3706

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpe.mpg.de/pipermail/sciserver-users/attachments/20240201/78cdc585/attachment.htm>