SSH Disconnects After Years of Regular Use

Just out of the blue I started getting disconnects both locally and over the internet when I used SSH.  For years prior to this everything just boomed along.  Nary an issue to speak of.  But all of the sudden out of the blue these disconnects started happening and the tasks that were running over SSH were dropped, often resulting in me taking a lot of extra effort to correct or restart the task.

This is annoying.  And since it was random and even when something was happening down the channel and it would affect every device no matter what it was doing I’ve been growing disconcerted.

I did look into this.  I tried to remember things that had changed.  Some of those changes were like:

I picked up three used gigabit switch.  One was a 16 port and the others were 5 and 8 ports.  I immediately began to remove them and test.  No change to my disconnect.  I then thought maybe I needed to disconnect a couple of older switches and try working with the ones I just received hoping that maybe one of those older switches was at issue.  I swapped two 8 port switches in the back room where the server is with the one 16 port that I’d received and just eliminated as the cause.  I then traced down all the cables to ensure that I knew which port on the switch did a specific tasks.  For instance, port 16 went to the pfsense router off the LAN port while port 15 went to the server.  I could observe the lights.  Any cable that wasn’t plugged in on both ends was removed from the picture.

Unfortunately this did not resolve the issue.  It has been almost a week with these disconnects.  And it appears to be happening to two computers though I’m not 100% positive about it happening on a second one.  If it is it has just started happening.  However the 2nd computer is connected to the file server where the random disconnects are happening.  So, it might be factor of the connection to the first, however the software telling me there was a problem has nothing to do with SSH.  Anyway I’ll look at that later.

This morning I decided to look at the sever to see how it was doing.  I decided it would be nice to run a ping test over a long period of time and this morning I checked it to see if there were any lost packets.  The output file actually had 100% lost.  So I tried the ping command manually to see what the results were.  They were all lost.  I found online a method of doing what I wanted and tested that.  That was fine.  I let it run for a while without any disconnects.  But it uses icmp and not tcp or udp that ssh and samba uses.

I looked at the server via htop and noted that samba the smbd daemon was running nearly at 100% and recalled that recently I had set the serrver up to connect to a vlan on my network to grab some backup files and copy them to the raid array.  These are the backups from the proxmox server that is operating as my virtualization server where I run my mail, websites, phone, etc., from.  I then decided to remove that line from the /etc/fstab and reboot and check to see if the smbd cpu usage was at normal or was still showed super high usage.

The usage was normal, so I’m now monitoring the server to see if I get disconnects.  Certainly the high cpu usage could cause a heating issue and that could cause hiccups and cause the random disconnects.

We’ll see.