Monday, July 18, 2011

Troubleshooting NLB Issue

Here is the issue:

Environment:

a) Two AgilePoint Servers  in VMWare ESX virtual environment.

b) Cisco NLB device for network load balance

c) Two SharePoints on physical machine in NLB

Symptom:

AgilePoint task list web part on SharePoint works or not work sometime.

- Error message from web part: Unable to connect AgilePoint Server. Verify the Impersonator credentials in AgilePoint Configuration List

- Check AgilePoint SharePoint log file, it HTTP 401, Unauthorized

Steps to solve the issue:

STEP1:

a) Run TestWSConn.exe (AgilePoint test tool for cluster) and connects local AgilePoint Server – worked fine

b) Run TestWSConn.exe on SharePoint machine and connects one of remote AgilePoint Servers directly (no cluster name/IP) – HTTP 401, Unauthorized

c) Run setspn.exe, TestWSConn.exe worked from remote machine

STEP2

a) Run TestWSConn.exe on SharePoint machine A and connects AgilePoint Server using cluster name/IP -- worked fine

b) Run TestWSConn.exe on SharePoint machine B and connects AgilePoint Server using cluster name/IP --  HTTP 401, Unauthorized

c) Reset IIS of AgilePoint Servers. It worked on SharePoint B, but not on SharePoint A.

d) Switch non-sticky mode to sticky mode on Cisco NLB device. Both A and B worked fine.

Challenge from customer in this case:

Usually, the team we worked with does not have permission to configure the machine, device and Windows. So they are reluctantly to bother administrator of network, domain, virtual environment. They would try to prove the issue is caused by AgilePoint product. Here is the challenge:

a) They show you the cluster name/IP works on IE by browsing AgilePoint Server web service --- means  web service works fine

Answer: browsing web service wsdl does not go through authentication, so it does not prove the authentication is fine.

b) They show EnterpriseManager works fine --- means configuration is OK

Answer: EntepriseManager is form authentication and connect to local AgilePoint Server. It does not prove authentication from remote machine is OK

c) They tell you the administrator has set principle name by running setspn.exe

Answer: deploy a generic test.asmx to two web appl one with LocalSystem and other one with AgilePoint Server AppPool (service account)

If one with LocalSystem works, but does not work AgilePoint Server AppPool. That proves principle name is not set correctly.

*Notice, LocalSystem does not need principle name

*test.asmx is a simple file with simple and generic web service of Hello()

d) They say the testing environment work fine in VMWare virtual environment --- meaning  virtual environment does not have problem

Answer: In this case, the SharePoint and AgilePoint are on one single machine. It is local connection.  Run TestWSConn.exe from workstation machine and connects to testing AgilePoint Server, not working. That proves VMWare environment has same problem.

Strategy for troubleshooting,

In general, NLB troubleshooting is hard because it could be hidden root cause or multiple cause. In this case, it was VMWare network driver, setspn and Cisco NLB configuration. Setting expectation at beginning is very important:

a) Troubleshooting may take long time. In this case, it took a week from beginning to end, total hour is 7 hours. Because the team needs to get help from other team. That takes days some time.

b) We are not expert of database DBA, not hardware engineer, not domain/network administrator and Cisco NLB expert. What we can do is to narrow down the problem and provide useful information and direction to customer.

Any proof in generic way would be easier to convince customer that is not AgilePoint issue. For example, deploy a generic and simple web service to prove principle name is not configured correctly. Customer would understand immediately and they are willing to get other team’s help.

No comments:

Post a Comment