Labels

Saturday, March 29, 2008

Implementing an MSCS 2003 server cluster

When implementing a Windows Server 2003 MSCS server cluster there are several common issues that can easily be avoided by extra planning and configuration. I've compiled a list of pre-configuration, installation and post-installation steps to reduce the risk of encountering issues when installing an MSCS cluster in a SAN environment.

This is mostly a summary of MS documents and general best practice, but I've not seen all of these in one place before so I thought I would post them.

Pre-installation steps for each server:

  • Unplug all HBA's from all cluster nodes.
  • Set the network adapter binding order to external and then internal.
  • Manually set speed, duplex, IP for all NICs, no gateway/DNS/WINS for private network
  • Verify connectivity between each node on public and private adapters
  • Turn off any APM/ACPI power saving features relating to disk drives
  • Create the cluster service account in the domain
  • Ensure the cluster service account is an administrator of the physical cluster nodes. (especially if Kerberos authentication is enabled for virtual servers, but general best practice)
  • Ensure the windows firewall is disabled on both adapters
  • Ensure security auditing is enabled on each node
  • Verify correct drivers are installed on each node (HBA, NIC, chassis backplane etc.) and no device manager errors exist.
  • Shutdown all nodes. Patch HBA’s on the first node, turn on the first node and check storage is visible. Repeat the step for each node, ensuring that only one node has visibility of the storage at any one time. Verify all nodes see the same target paths/disk in the same order.
  • Ensure backup agent is installed and functioning on all servers.
  • Ensure anti-virus agent is installed and functioning on all servers.
  • Configure permissions and role-based security on the servers as required
  • Install Access-Based enumeration on each server (if required)

Cluster first node installation:

  • Shut down all but the first node, so that only the first server has visibility of the storage
  • Re-verify all the SAN disk is visible to the OS
  • Partition and format disks using MBR before adding the first node to the cluster, disable compression. Q: for quorum is a defacto standard, other disks starting after leaving a few letters early in the alphabet for any removable devices/KVM virtual devices etc if they are auto-mounted
  • Use Cluster Administrator to install the cluster, use typical (full) installation when creating a new cluster (there should be no reason not to if the disk is presented the same to each server). Do not use ‘Manage Your Server’ to configure cluster nodes
  • This is where you'll need the cluster name. Use a naming convention that makes sense, linking the physical nodes in the cluster to the virtual cluster name(s)
  • Ensure that the all disks managed by the cluster have associated disk resources before adding the second and subsequent nodes, this ensures disk locking
  • Verify the cluster is functioning, cluster service is running, no event errors, all resources available and functioning etc.

Second and subsequent nodes:

  • Plug in HBA to all other nodes, turn on second node
  • Re-verify all the SAN disk is visible to the OS on the second node
  • Add second node to cluster using Cluster Administrator (the first node will lock the disk)
  • Verify the cluster is functioning (cluster service is running, no event errors, all resources available and functioning etc).
  • Add subsequent nodes using Cluster Administrator
  • Verify the cluster is functioning (cluster service is running, no event errors, all resources available and functioning etc).

Post-installation configuration:

  • Set the role of the private network to be only for internal communication (with mixed for failover according to the design)
  • Set the role of the public network to public network
  • Place the private network at the top of the priority list for internal node-to-node communication
  • Do not use the default cluster group for any resources
  • Do not use the quorum disk for anything else in the cluster
  • Do not install scripts used by generic script resources on cluster disk (easier to recover if they're on local disk)
  • Enable kerberos authentication for network name resources, after taking the network name resources offline). Enabling Kerberos will ensure a computer account is created for the virtual servers and adds Service Principal Names for Kerberos lookup and authentication
  • For the first node, set the startup and recovery settings to start within 5 seconds. For the other nodes, set to 30 seconds, to reduce the risk if all cluster nodes are starting at the same time that there would be quorum conflict/contention.
  • Create and test all resources, resource groups and virtual servers, dependencies, failover/failback policies, including file shares/ABE and print spooler
  • Configure backups appropriate for all cluster nodes
  • Configure performance and service monitoring
  • Configure quotas and file screening using FSRM if required

Other general thoughts:

  • Access Based Enumeration is useful in some file structures, but does not fully equate to functionality provided in Netware. The easiest way I can describe ABE is that it hides what you do not have access to, rather than ensuring you can see what you do have permissions for. For example, in the tree A\B\C, if you have permissions to A and C, but not permissions to B, you will not see C. This is because ABE has hidden what you don’t have access to (B), a by-product of which is that (C) won’t be visible in a default explorer navigation.
  • Having a single virtual print spooler still has a single point of failure – the spoolsv.exe process running on the host system. If that dies because of a configuration error, that error will most likely fail over to any other nodes that can host that resource group. Regardless of ensuring you don’t use kernel-mode (version 2), and only use user-mode (version 3) drivers, any number of issues can occur somewhere in the print process, whether it’s a third-party print processor causing issues, a non-standard port type, or just a poorly written unidrv support DLL. Everything is a lot more transparent with 2003 clustering – drivers, processors and ports all follow the virtual spooler, which most of the time is good, except when you have a problem.

Testing, reproduced from the standard Microsoft confclus.doc document:

Test: Start Cluster Administrator, right-click a resource, and then click “Initiate Failure”. The resource should go into an failed state, and then it will be restarted and brought back into an online state on that node.
Expected Result: Resources should come back online on the same node

Test: Conduct the above “Initiate Failure” test three more times on that same resource. On the fourth failure, the resources should all failover to another node in the cluster.
Expected Result: Resources should failover to another node in the cluster

Test: Move all resources to one node. Start Computer Management, and then click Services under Services and Applications. Stop the Cluster service. Start Cluster Administrator on another node and verify that all resources failover and come online on another node correctly.
Expected Result: Resources should failover to another node in the cluster

Test: Move all resources to one node. On that node, click Start, and then click Shutdown. This will turn off that node. Start Cluster Administrator on another node, and then verify that all resources failover and come online on another node correctly.
Expected Result: Resources should failover to another node in the cluster

Test: Move all resources to one node, and then press the power button on the front of that server to turn it off. If you have an ACPI compliant server, the server will perform an “Emergency Shutdown” and turn off the server. Start Cluster Administrator on another node and verify that all resources failover and come online on another node correctly. For additional information about an Emergency Shutdown, see the following articles in the Microsoft

Knowledge Base:
325343 HOW TO: Perform an Emergency Shutdown in Windows Server 2003
297150 Power Button on ACPI Computer May Force an Emergency Shutdown
Expected Result: Resources should failover to another node in the cluster
Warning: Performing the Emergency Shutdown test may cause data corruption and data loss. Do not conduct this test on a production server

Test: Move all resources to one node, and then pull the power cables from that server to simulate a hard failure. Start Cluster Administrator on another node, and then verify that all resources failover and come online on another node correctly
Expected Result: Resources should failover to another node in the cluster
Warning: Performing the hard failure test may cause data corruption and data loss. This is an extreme test. Make sure you have a backup of all critical data, and then conduct the test at your own risk. Do not conduct this test on a production server

Test: Move all resources to one node, and then remove the public network cable from that node. The IP Address resources should fail, and the groups will all failover to another node in the cluster. For additional information, see the following articles in the Microsoft Knowledge Base:
286342 Network Failure Detection and Recovery in Windows Server 2003 Clusters
Expected Result: Resources should failover to another node in the cluster

Test: Remove the network cable for the Private heartbeat network. The heartbeat traffic will failover to the public network, and no failover should occur. If failover occurs, please see the “Configuring the Private Network Adaptor” section in earlier in this document
Expected Result: There should be no failures or resource failovers


References:


Guide to Creating and Configuring a Server Cluster Under Windows Server 2003

http://www.microsoft.com/downloads/details.aspx?familyid=96F76ED7-9634-4300-9159-89638F4B4EF7&displaylang=en

Best practices for installing and upgrading cluster nodes

http://technet2.microsoft.com/windowsserver/en/library/87f23f24-474b-4dea-bfb5-cfecb3dc5f1d1033.mspx?mfr=true

Best practices for configuring and operating server clusters

http://technet2.microsoft.com/windowsserver/en/library/2798643f-427a-4d26-b510-d7a4a4d3a95c1033.mspx?mfr=true

Before Installing Failover Clustering

http://msdn2.microsoft.com/en-us/library/ms189910.aspx

Cluster Configuration Best Practices for Windows Server 2003

http://www.microsoft.com/downloads/details.aspx?FamilyID=98BC4061-31A1-42FB-9730-4FAB59CF3BF5&displaylang=en

Server Cluster Best Practices

http://technet2.microsoft.com/windowsserver/en/library/8c91dba9-edfc-48b5-8d98-48d6536e0db81033.mspx?mfr=true

Cluster architecture

http://download.microsoft.com/download/0/a/4/0a4db63c-0488-46e3-8add-28a3c0648855/ServerClustersArchitecture.doc

Creating and Configuring a Highly Available Print Server

http://download.microsoft.com/download/2/a/9/2a9c5a6b-472a-40b0-942f-3ba50240ccd9/ConfiguringAHighlyAvailablePrintServer.doc

Disk quotas and clusters

http://technet2.microsoft.com/windowsserver/en/library/1ee8754e-48d6-4472-9b53-29e8d1de09f81033.mspx?mfr=true



Wayne's World of IT (WWoIT), Copyright 2008 Wayne Martin.

No comments:


All Posts

printQueue AD objects for 2003 ClusterVirtualCenter Physical to VirtualVirtual 2003 MSCS Cluster in ESX VI3
Finding duplicate DNS recordsCommand-line automation – Echo and macrosCommand-line automation – set
Command-line automation - errorlevels and ifCommand-line automation - find and findstrBuilding blocks of command-line automation - FOR
Useful PowerShell command-line operationsMSCS 2003 Cluster Virtual Server ComponentsServer-side process for simple file access
OpsMgr 2007 performance script - VMware datastores...Enumerating URLs in Internet ExplorerNTLM Trusts between 2003 and NT4
2003 Servers with Hibernation enabledReading Shortcuts with PowerShell and VBSModifying DLL Resources
Automatically mapping printersSimple string encryption with PowerShellUseful NTFS and security command-line operations
Useful Windows Printer command-line operationsUseful Windows MSCS Cluster command-line operation...Useful VMware ESX and VC command-line operations
Useful general command-line operationsUseful DNS, DHCP and WINS command-line operationsUseful Active Directory command-line operations
Useful command-linesCreating secedit templates with PowerShellFixing Permissions with NTFS intra-volume moves
Converting filetime with vbs and PowerShellDifference between bat and cmdReplica Domain for Authentication
Troubleshooting Windows PrintingRenaming a user account in ADOpsMgr 2007 Reports - Sorting, Filtering, Charting...
WMIC XSL CSV output formattingEnumerating File Server ResourcesWMIC Custom Alias and Format
AD site discoveryPassing Parameters between OpsMgr and SSRSAnalyzing Windows Kernel Dumps
Process list with command-line argumentsOpsMgr 2007 Customized Reporting - SQL QueriesPreventing accidental NTFS data moves
FSRM and NTFS Quotas in 2003 R2PowerShell Deleting NTFS Alternate Data StreamsNTFS links - reparse, symbolic, hard, junction
IE Warnings when files are executedPowerShell Low-level keyboard hookCross-forest authentication and GP processing
Deleting Invalid SMS 2003 Distribution PointsCross-forest authentication and site synchronizati...Determining AD attribute replication
AD Security vs Distribution GroupsTroubleshooting cross-forest trust secure channels...RIS cross-domain access
Large SMS Web Reports return Error 500Troubleshooting SMS 2003 MP and SLPRemotely determine physical memory
VMware SDK with PowershellSpinning Excel Pie ChartPoke-Info PowerShell script
Reading web content with PowerShellAutomated Cluster File Security and PurgingManaging printers at the command-line
File System Filters and minifiltersOpsMgr 2007 SSRS Reports using SQL 2005 XMLAccess Based Enumeration in 2003 and MSCS
Find VM snapshots in ESX/VCComparing MSCS/VMware/DFS File & PrintModifying Exchange mailbox permissions
Nested 'for /f' catch-allPowerShell FindFirstFileW bypassing MAX_PATHRunning PowerSell Scripts from ASP.Net
Binary <-> Hex String files with PowershellOpsMgr 2007 Current Performance InstancesImpersonating a user without passwords
Running a process in the secure winlogon desktopShadow an XP Terminal Services sessionFind where a user is logged on from
Active Directory _msdcs DNS zonesUnlocking XP/2003 without passwords2003 Cluster-enabled scheduled tasks
Purging aged files from the filesystemFinding customised ADM templates in ADDomain local security groups for cross-forest secu...
Account Management eventlog auditingVMware cluster/Virtual Center StatisticsRunning scheduled tasks as a non-administrator
Audit Windows 2003 print server usageActive Directory DiagnosticsViewing NTFS information with nfi and diskedit
Performance Tuning for 2003 File ServersChecking ESX/VC VMs for snapshotsShowing non-persistent devices in device manager
Implementing an MSCS 2003 server clusterFinding users on a subnetWMI filter for subnet filtered Group Policy
Testing DNS records for scavengingRefreshing Computer Account AD Group MembershipTesting Network Ports from Windows
Using Recovery Console with RISPAE Boot.ini Switch for DEP or 4GB+ memoryUsing 32-bit COM objects on x64 platforms
Active Directory Organizational Unit (OU) DesignTroubleshooting computer accounts in an Active Dir...260+ character MAX_PATH limitations in filenames
Create or modify a security template for NTFS perm...Find where a user is connecting from through WMISDDL syntax in secedit security templates

About Me

I’ve worked in IT for over 20 years, and I know just about enough to realise that I don’t know very much.