In hosta SM21 system log, the gateway process (RD) had registered a network connect error (Q0I) "Operating system call connect failed (error no. 10060)":
The architecture of the SAP systems involved in the landscape looks like the following diagram, which shows an SAP ERP single ABAP stack (hosta), which is connected to an SAP Enterprise Portal system which is clustered (hostb & hostc):
Looking in the gateway error log (TX: SMGW), I saw the network interface (Ni) timeout error when looking up the hostname for IP address 192.168.1.1:
The business process involved means that an RFC connection is established from the EP server processes (hostb) to the ERP gateway process (hosta).
As you can see in the architecture diagram, the IP 192.168.1.1 is the private IP address of the hostb server in the EP cluster.
The private IP address is only used as a heartbeat address as part of the Microsoft Cluster service.
So the question was, why was the SAP ERP gateway process receiving inbound connections from hostb via the private IP address?
The method of analysis involves understanding how the IP to hostname lookups are performed on the Windows servers:
First, if you open a command prompt and use NSLOOKUP to check the hostname ("C:> nslookup hostb") it will return the hostname from the Name Service configured for the Windows Server. Usually this is in the order of hosts file first, then DNS. It performs the lookup using the appropriate network interface according to the network routings ("C:> route print").
In our example, the nslookup correctly returned the public IP address.
Second, you can issue a PING of hostb from hostb (itself) using "C:> ping hostb".
When using PING, it sends a network packet using the appropriate network interface according to the network routings ("C:> route print"). Since we are pinging ourselves, it will use the first network interface in bind order of the Windows network connections.
In our case, it was pinging the *private* IP address.
According to Microsoft, the PING command in Windows uses gethostbyname() to find the IP address for the destination to be ping'd(http://support.microsoft.com/kb/822713).
So we have established that the private IP address is seen when we ping our own servername on hostb, why does this mean that the private IP is seen as the client in a SAP RFC gateway connection?
Well, in the same way that PING uses gethostbyname(), the RFC connection that is established through jlaunch.exe (host of the JAVA stack server processes) also uses gethostbyname(). Of course, we know what IP address a gethostbyname() call returns, as this is used in the PING command .
(findstr /M gethostbyname jlaunch.exe)
So, the RFC connection performed by the JAVA stack is created from the jlanuch.exe binary, which uses gethostbyname() during the bind() call, which returns the private IP.
The RFC connection is established and part of the RFC connection process the jlanuch.exe process passes it's calling IP address. Which is the private IP address.
The ERP system gateway at the other end of the connection receives the incoming connection and notes that it is coming from 192.168.1.1. Should the ERP system on hosta wish to make an RFC callback (http://help.sap.com/saphelp_nw04/helpdata/en/22/042a77488911d189490000e829fbbd/content.htm)
(http://help.sap.com/saphelp_nw04/helpdata/en/22/042a91488911d189490000e829fbbd/content.htm), it has no chance of returning the call to the private IP address on hostb. This is where we get the error "NiHsLGetHostByName failed".
So how do we resolve the problem?
We could code the lookup into the hosts file (172.x.x.2 hostb) on hostb, but this might cause other issues.
A quick look into the SAP documentation for installing Netweaver 7.0 SR3 in a Microsoft Clustered system reveals that a simple step was missed during installation on hostb.
In the document it states "The card of the public network must be displayed before that of the private network. If necessary, change the order in which the cards are listed by using the Move Up and Move Down arrows".
So I check the binding orders on hostb in Control Panel -> Network Connections -> Advanced Settings:
They are the wrong way around. The "Cluster Heartbeat" (Private IP) should be beneath the Public IP ("Local Area Connection" in our example).
The Microsoft article here (http://support.microsoft.com/kb/894564) lists the process to change it and provides alternative solutions (I would recommend to follow the SAP document).
Changing the binding order on Windows Server 2003 does not require a reboot, but the change will not be effective until a reboot is performed.