A Colleague recently asked me that was the reason in configuring two PSCs with our management environment as they are not fully redundant without a load balancer sitting in front of them in a multi PSC deployment. I explained that there is replication between both PSC’s and in the event of the primary PSC failing that with a simple CLI command that all services could be restored on the second PSC. Of course, this was all theory as I never had reason to completed that operation. But I was curious how easy that procedure was and in what scenarios could that fail over be completed successfully. So I spun up a small VC environment within my Lab with the intention of simulating a loss of a PSC.
But in researching the topic, I found that I had many more questions that I needed to answer or clear up for myself.
- How could I tell which PSC was master
- How can I tell replication relationships between other PSCs
- What exactly was replicating between the PSC’s besides SSO data?
- What is the interval period between PSC to PSC Replication
Easy one first:
Which PSC is my master PSC? – It was pretty simple for me to tell which was the master as I had just deployed the VMs. But if you are unable to tell you simply need to check these advanced settings of your vCenter.
How can I tell how many PSCs are present and what are the replication status between my primary PSC’s and other PSC’s deployed within my environment? The /usr/lib/vmware-vmdir/bin folder is packed full of different utilities to help you out. I used the replication admin tool to get the below detail.
/usr/lib/vmware-vmdir/bin/vdcrepadmin -f showpartnerstatus -h localhost -u administrator
/usr/lib/vmware-vmdir/bin/vdcrepadmin -f showservers -h localhost -u administrator
What is the interval period between PSC to PSC Replication – The replication interval between two PSCs is 30 seconds. However, under certain conditions, this replication time can be increased in order for all PSCs to fully synchronize. Ref: KB2113115
What exactly is being replicated between the PSC’s
VMware Appliance Management Service (only in Appliance-based PSC)
VMware License Service
VMware Component Manager
VMware Identity Management Service
VMware HTTP Reverse Proxy
VMware Service Control Agent
VMware Security Token Service
VMware Common Logging Service
VMware Syslog Health Service
VMware Authentication Framework
VMware Certificate Service
VMware Directory Service
Now that some of the basics are known I simulated an outage of the Primary PSC. Needless to say, I could no longer login into vCenter. When I ran the showservers and showpartnerstatus command on the secondary, it confirmed that the server was down and no replication was taking place.
So to change the secondary PSC to the Master PSC, I ran this simple command on the VCSA appliance.
cmsso-util repoint --repoint-psc testpsc02.buildnet.local
The whole process took less than 10 mins. When I successfully logged back into vCenter, and I checked the advanced setting, I could see the PSC02 was now the primary.
To get back to a fully redundant solution I wanted to redeploy a new PSC to make sure I had a copy of the SSO, Certificate, and Licensing formation, etc. Before I could do that I needed to remove the stale PSC01 record from the system. It was still listed in the node section but marked as unknown.
To remove the node, I ran this command from the shell on PSC02.
cmsso-util unregister --node-pnid testpsc01.buildnet.local --username firstname.lastname@example.org
Running the showservers and showpartnersataus returned empty results as the PSC was now acting alone with no replication partner available. I then reran the install for an external PSC and opted to join and existing SSO domain. Once deployed everything looked good again.
Lastly, I again simulated another outage (I disconnected the NIC) on the Primary PSC (Remember PSC02 is currently the master PSC). Failed over the services to the secondary node (PSC01) and then brought the failed server back online. Replication started up again with no issues.