Document ID | Synopsis | Date | ||
ID73104 | Troubleshooting loadbalancing on Sun Ray[TM] | 5 Mar 2004 |
Keyword(s):sunray, Sun Ray, loadbalancing, load balancing, failover
The load is unbalanced between several servers in a Sun Ray[TM] failover group.
Definititions and Abbreviations: ================================ load balancing: the process of distribution Sun Ray sessions over the Sun Ray servers in a failover group. group manager: the part of the authentication manager which is responsible for load balancing. trusted: Sun Ray servers which share a group signature are trusted, and are considered part of the same failover group for load balancing purposes. active session: a Sun Ray session where a user is logged in. idle session: a Sun Ray session waiting at the dtgreet screen, or the utselect -L GUI. SRSS: Sun Ray Server Software. NSCM: Non Smartcard Mobility (SRSS 1.3 and higher only). Background: =========== The load balancing algorithm in the SRSS 1.2 and higher works as follows: When a token is presented, the authentication manager (utauthd) checks whether an active session for the token is available on any of the Sun Ray servers in the failover group. If no active session is available, a load balancing decision will be made. Idle sessions are ignored at this stage. For the load balancing decision, various load related parameters and the server's total CPU power will be combined into a parameter called "desirability". Then, a weighted random selection is made between all "online" servers in the same group, where the token is more likely to be redirected to a server with a higher desirability. Once the token has redirected to a Sun Ray server according to the load balancing decision, the authentication manager on this server will check whether there exists an idle session for the redirected token on this server. Only if no idle session exists for the redirected token on this server will the authentication manager initiate a new session. The reason to incorporate a weighted random selection into the load balancing algorithm is to avoid that all sessions end up on the same server when many users log in simultaneously, say, around 08:30 in the morning when everybody gets into the office. There has not been a single critical bug in the Sun Ray load balancing with the SRSS 1.2 or higher. The typical root causes for poor load balancing are - A misconfiguration which simply turned off load balancing. See section "Checking configuration". - A Sun Ray server has been turned "offline", and thus is ignored during load balancing, except if no "online" server is up. - A Sun Ray server is in a dysfunctional state where it does not accept new sessions, such as utauthd hanging, or being unresponsive. - A network problem or network misconfiguration. See section "Checking configuration". - Poor initial load balancing, which is likely when "pseudo terminal" sessions rather than NSCM are used. See section "Load balancing limitations". - A misunderstanding of what load balancing can achieve. See section "Load balancing limitations". - The EOLed SRSS 1.1 is used. This release provided inferior load balancing. Checking configuration: ======================= 1) The servers should be running in configured mode, utconfig should have been run. Thus, check whether /etc/opt/SUNWut/utadmin.conf exists. 2) When running utconfig to configure a Sun Ray server for failover, failover should be selected, and the same group signature must be entered for all servers in a Sun Ray failover group. [...] Configure this server for a failover group? (y/[n])? y About to configure the following software products: [...] Failover group: yes <---- [...] You have chosen to configure this server for a failover group. All servers in a failover group must share a unique signature, which is a string of 8 or more characters where at least two characters are letters and at least one is not. Enter signature: Re-enter signature: [...] utconfig creates a logfile into /var/adm/log. Check this logfile whether failover was selected. 3) All servers in the failover group must have the same group signature. Use utreplica on the primary servers to get the list of Sun Ray servers which are in the same failover group, then check utgstatus output whether servers are trusted. Also check that every trusted server visible in utgstatus output is listed as part of the failover group in utreplica output. If the group signatures do not match, use /opt/SUNWut/sbin/utgroupsig to fix. 4) The "-g" flag must be set in the policy. Furthermore, the policy must essentially be identical across all servers in a failover group. On 1.x, check utglpolicy output, and check whether utpolicy output is identical to utglpolicy, except possibly for token reader (-t) options. On 2.0, utglpolicy is obsoleted, check utpolicy output only. Note: on an 1.x Sun Ray failover group, either the admin GUI or /opt/SUNWut/sbin/utglpolicy must be used to change the policy. 5) Sun strongly recommends using Non Smartcard Mobility (NSCM) rather than "pseudo terminal" sessions to get good loadbalancing. NSCM is available in 1.3 and higher, and can be turned on by the "-M" policy flag. NSCM also provides hot desking without the use of smartcards. 6) In /etc/opt/SUNWut/auth.props, ensure that the group manager and loadbalancing are not disabled. If the following parameters are set, they must have the listed values: + enableLoadBalancing = true + enableGroupManager = true + useLocalPolicy = false (SRSS 1.x only). Furthermore, it is strongly recommended that all servers in a Sun Ray failover group have identical auth.props files. Note: if these values are wrong, this is a strong indication that the Sun Ray server was not configured for failover when utconfig was run. 7) Check that all Sun Ray servers in the failover group are "online". See 71443 How to check whether a Sun Ray[TM] server is "offline". 8) All Sun Ray interfaces which have Sun Ray appliances connected to them must be up and reachable. If a Sun Ray appliance is connected to interface a of a Sun Ray server A, and Sun Ray server B cannot contact the group manager of Sun Ray server B through this interface a, then Sun Ray server A will not load balance this Sun Ray appliance to Sun Ray server B, because it does not know whether the Sun Ray appliance can reach server B. Thus, check utgstatus output whether all Sun Ray interfaces are up and reachable. Also check /var/opt/SUNWut/log/auth_log* for "token query timed out" messages, such as this: 01/15/2004 23:52:03 token query timed out to host labhost2 interface 192.168.128.2 Here, labhost2 was unreachable on interface 192.168.128.2, and thus this interface was ignored during load balancing. Note: such an issue is frequently caused by network components, like bad firmware, or a bad port on a switch. 9) If different network interfaces are connected to the same physical switch, the network interfaces must have different ethernet addresses. 10) If a network issue is likely, check "/usr/bin/netstat -in" output for errors and collisions, and in 1.3 and higher also collect a few minutes of "/opt/SUNWut/sbin/utcapture" output to check for packet loss. 11) All Sun Ray servers in a failover group must run the same SRSS release. Load balancing limitations: =========================== Sun Ray load balancing is strictly limited to Sun Ray session creation. There is no way to move an existing user session to another server. Thus, once a user has logged in, the user's session will stay on this server until the session has been exited, or terminated. Furthermore, the load balancing is completely unrelated to assigning DHCP addresses to Sun Ray appliances. Load balancing takes place once a Sun Ray appliance which already has a DHCP address successfully connects to the authentication manager, requesting a session for the current token. Example scenarios resulting in a poor distribution of load: ----------------------------------------------------------- Customer has two Sun Ray servers, and uses "pseudo terminal" sessions exclusively. When one Sun Ray server is rebooted, all Sun Ray appliances will connect to the other server, and will get sessions there. When both Sun Ray servers are rebooted at the same time, inevitably one will be up first, and most Sun Ray appliances, if not all, will connect to this server, and get sessions there. TIP: if you use NSCM, the sessions will be created when the user actually logs in at the NSCM login GUI, rather than when the appliance initially connects to a Sun Ray server. This late binding of NSCM will give much better load balancing. Methods to reduce the impact of poor initial distribution of load: ------------------------------------------------------------------ Generally, when using "pseudo terminal" sessions rather than NSCM, after rebooting all servers in a failover group, immediate actions should be taken to balance the initial load between the Sun Ray servers. The simplest ones are - run "/opt/SUNWut/sbin/utpolicy -i soft" from all servers, simultaneously or - run "/opt/SUNWut/sbin/utfwsync" Alternatively, and a little bit more work, the system administrator can use the "enhanced session management" functionality provided with 1.3 and higher to terminate sessions which are waiting at the dtgreet screen. On the server where you want to reduce load, run "/opt/SUNWut/sbin/utsession -p". Sessions which are waiting at the dtgreet screen can be identified because they have an "I" in the last column. If these sessions are terminated by the administrator, new sessions will be created for the corresponding tokens, and most of these new sessions will be on the server which has higher desirability. Once initial load is unbalanced, the system administrator also can temporarily prevent Sun Ray servers which are already under a high load from being assigned any new sessions by the loadbalancing by turning them "offline", using "/opt/SUNWut/sbin/utadm -f". The server can later be switched into normal "online" mode using "/opt/SUNWut/sbin/utadm -n". A server which is offline can be identified by the existance of a file /var/opt/SUNWut/offline. Note: a Sun Ray server which is "offline" will still provide the NSCM login GUI if NSCM is turned on. However, if a user then logs in, loadbalancing is triggered, and the actual user session will be created on another server. References: =========== 16733 Why do all my Ethernet interfaces have the same Ethernet MAC address? 71443 How to check whether a Sun Ray[TM] server is "offline". utgstatus(1M) manual page utreplica(1M) manual page utcapture(1M) manual page (SRSS 1.3 and higher only)