SSH Disconnects (continued, yet again), and other disconnecting services.

This has been an ongoing for some time and various things have been done in an attempt to resolve it.  Much of that consisted of swapping out cables and switches, testing hardware, doing a test install, and disconnecting VLAN segments,  I also bought an inexpensive replacement cable modem just in case thinking that it may be the issue.   I didn’t test that as I hadn’t eliminated the pfsense setup as the cause, nor for that matter any of it except some specific hardware such as switches and cables.  All told, I had swapped out multiple cables and switches, some a couple of times.  After eliminating those I put everything back to the way it was and went onto the next step.

At times I thought I had narrowed it down to pfsense.  The reasoning was that I had done an upgrade to it about the time the disconnects began.

I went to Reddit and asked some questions and people immediately put a stopper to pfsense as the cause because, as they said, if it was the update many others would be having this same issue, and apparently they weren’t.  Or… they hadn’t hit Reddit with questions about them.

I needed to keep my network up and running all the while, yet I was willing to disconnect certain servers for test purposes.  I needed to run the system without them for long enough to get a proper result.  That meant that I would loose my web, email, and file servers for some length of time.

On Reddit they also asked that I check for duplicate IP addresses as sometimes this can cause the issue. Since I use pfsense I have it use DHCP reservations to ensure that I have a known IP for most computers that provide services.  Those don’t change.  I have a method of numbering IP address ranges so that I know based on the IP address which section of my network it belongs to, such as which room, and what the purpose is.

Since the issue seems to also occur in the LAN (or within the subnet) as well as from connections over the WAN interface they said it can’t be pfsense as pfsense should be out of the picture at that point and only when it traverses interfaces should it be an issue. 

In my case, I have multiple network interface cards (NICs) that operate as VLANs so what they said couldn’t be valid. I create the VLANs for very specific purposes and I do so without using a smart switch.  Instead I use separate NICs.  Data is restricted on certain VLANs for security reasons.  I restrict data traversing to the LAN interface from VLAN2 because, if someone breaks into a service on one of the containers I don’t want them being able to jump to servers on the LAN  or on other VLAN interfaces. 

VLAN1 is used to exclude machines that I want to allow unhindered access to all the internet.  I use pfblockerng-dev along with DNSBL which allows me to block ad and tracking sites as well as other things.  Since Windows is in that mix on my network I set aside a VLAN to handle it — I want VLAN1 to route any computer that might need Windows 10 updates while keeping the rest of my computers safe from Microsoft and other provider’s (Google, Facebook, Twitter — anyone that collects data en-masse about people) prying eyes.

On VLAN2 I have my Proxmox server with LXC containers that run my websites.  I do this because if someone were to break into one website they couldn’t get out of the container, nor could they get into the LAN itself.  That’s the goal, anyway.

I also have a VPN running so that I can connect from any device that presents the proper certificates to allow them to be on the network and have access to resources remotely.  Even though I use SSH most of the time a VPN gives me certain other benefits. 

What I found was that when I connected via the VPN I would also get disconnected, but it would reconnect itself.  If I used SSH it would disconnect and not reconnect automatically.  If I used SSHFS with the reconnect parameter it would reconnect, but that would interrupt certain services, and certainly that was not the ultimate solution.

No matter what, the only solution was to fix the disconnects.  That involved all the testing previously mentioned, and more.

For instance, my Proxmox server has IP addresses set for the containers and virtual machines by the Proxmox itself (on that VLAN) instead of by the pfsense DHCP service.  When I duplicate a container it carries with it all the settings including the hard coded IP address.  When I was told to check for duplicate IP addresses I did check this and found a couple.  Those were not the cause of the disconnects, however.

I wanted to actually know what was causing the disconnects.  I could have just purged everything and started over.  That would have involved a lot of setup to re-implement my configuration and there were bound to be issues.

What I did though was I  backed up the pfsense hard drive by making an image of it (using Linux ‘dd’ command) and compressing that image so that it didn’t take up a massive amount of space on my file server.

After making the image I set the drive aside and found a low capacity 160gb SATA drive and installed pfsense on it and then ran it without any extra configuration except configuration of the WAN and LAN interfaces as those are necessary to let my computers talk to the Internet. 

No disconnects were encountered.  Unfortunately my services (web sites and email, etc) were unavailable to the Internet.  That told me that this was a pfsense issue:  it was not a cable issue, not a switch issue, not a cable modem issue, not a conflicting IP address, not a machine running some other services on the network.  It was pfsense.

At this point I put the old drive back and decided at some point it had to be services or something was corrupted.  I did backup the pfsense configuration to my main workstations just in case.

When I had some time about a week later I decided to start shutting down services.  I had Squid proxy, OpenVPN (server and clients), pfblockerng, and several others. I found that just shutting them down wasn’t enough.  I had to disable them so the changes would survive a reboot.

After disabling them and rebooting the pfsense router I found I was able to continue to work without disconnects of any kind.  I tested this for a 24 hour period knowing that disconnects typically happened in less than an hour.  Since there were no disconnects on the LAN I went home and connected remotely.  I found that after over 12 hours there were no disconnects.

I’m very pleased.  I still didn’t know which services were causing the issue.  I put pfblockerng back.  All remained well.  After concluding that it wasn’t pfblockerng I enabled Squid.  That also proved not to be the problem.  The system now has been running for a while with those services enabled.

The currently disabled services were like iftop, ifperf, OpenVPN (client and server), and one other that I uninstalled instead of disabling it.  My guess is that it is OpenVPN as OpenVPN on pfsense has had numerous issues in the past.  As I said early in this post the system worked well for a long time and only after an update did this problem start.  My guess is that it is the OpenVPN server part.  I’ll enable it without enabling the client part to test and if that pans out I’ll then enable the clients and see if they are causing the issue.

So, not quite done.  Real progress has been made, and made with certainty without leaving me guessing as would have been the case had I just blown away pfsense and started over.