Written by Mike Clement, Summer-Fall 2002
Originally translated into HTML by Kevin Griffin, Fall 2002
Copyright (C) 2002 Michael R. Clement.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.1 or
any later version published by the Free Software Foundation; with
Invariant Sections being "The GNU Free Documentation License" and all
appendices listed in this document, with no
Front-Cover Texts, and with no Back-Cover Texts. A copy of the license
is included in the section entitled "GNU Free Documentation License"
(see GNU Free Documentation License).
--------
Overview
This document is a record of the process of setting up the Seattle University Beowulf Cluster, or SUBC. This is NOT a comprehensive guide to creating, using, and maintaining a Beowulf Cluster; i.e. USE AT YOUR OWN RISK. Think of it as nicely formatted field notes. I will cover the aspects of configuring and using *this* cluster that appeared most relevant to me at the time of this writing.
I assume that the audience has some level of familiarity with Linux or Unix, so that if I say look at such-and-such a manpage, or run such-and-such command, the reader has a vague idea as to what I mean. However, I attempt to keep things simple, and do not assume any high level of knowledge regarding administration of *nix systems or Beowulf clusters. This is partially because I was pretty new to Red Hat when I took on this project, so a lot of my original notes were written so that I would later understand what I had written. If any of what I discuss goes beyond the reader's knowledge of *nix systems or beowulfing, or if this document does not quench the thirst for knowledge (not that I expect it to), I invite the reader to make use of the resources at the end of this document.
-----------
The Cluster
The SUBC, in its original form, consists of five identical desktop computers, each with a single AMD Athlon XP 1800+ on an ASUS A7V266-E motherboard, 512MB DDR-RAM, 30GB hard drive, National Semiconductor DP83815 ethernet interface (the server node has two), and all the other fixin's. All machines have Red Hat linux version 7.1 (default installation), and run the 2.4.2-2 kernel out-of-the-box.
The Cluster uses a server/client configuration. Our master node is called beastmaster, and the client or processing nodes are babybeast1 through babybeast4. Beastmaster acts as a proxy for the babybeasts, with one of its ethernet interfaces to the SU network, and the other to the cluster's main switch, a NetGear FS108 (100-base-t).
-------------
Configuration
When I came into the position of getting this cluster up and running, the hardware had been purchased and the machines installed. The proxy and internal IPs were already configured. My job was to get all needed network services running, make the nodes play nicely with each other, establish a server/client relationship within the cluster, and get all the clustering framework installed and running. The main things that needed to be configured were remote administration, remote execution of services and binaries, passing messages between related processes, and time synchronization between the nodes.
I do not like to make changes that I cannot undo, so I decided to test most of the configuration on babybeast1 before changing all the nodes. This also would later give me a chance to create and test a cloning script, for copying the general configuration from node to node.
-------------------------------------------------------------
Daemons, Daemons, Everywhere - Shutting off unwanted services
My first task was to disable all the services that did not need to be running on the client nodes. All the nodes were running out-of-the-box Red Hat installations including Gnome, so there was a lot to turn off, including gdm (the Gnome Display Manager), gpm (Gnome Print Manager), lpd, sendmail, talk, and anything else not strictly necessary for a beowulf node. The idea is that client nodes are meant to do processing, and should have just enough services running in order to do that job. Anything else running takes up valuable CPU time and memory. The server, on the other hand, serves as a coding and administrative workstation as well, so it needed to run all the regular services of a normal desktop machine, including a nice interface.
I used ntsysv, tksysv (graphical version of ntsysv), and chkconfig to disable services. These are standard utilities included in the Red Hat distribution, and can be found at /usr/sbin/ntsysv, /usr/X11R6/bin/tksysv, and /sbin/chkconfig. Chkconfig takes a --list argument to display all running services. Typing chkconfig alone lists all available commands and options.
Some services run through xinetd, the inetd replacement. Xinetd runs in the background, and calls other services, like ftp, telnet, and rlogin, when a remote machine sends a request for one of them. The easiest way to control the services provided by this is again to use a tool like ntsysv or chkconfig. These will allow the administrator to turn on or off xinetd services identically to turning on or off standalone daemons. For example, to turn off telnet with chkconfig, type:
# /sbin/chkconfig --level 345 telnet off
The --level switch with numeric argument indicates which runlevels a given service should run at. I could just as well have specified all runlevels "0123456"; I just chose those that seemed the most relevant (the machine runs at runlevel 5 when in its multi-user, fully-booted state).
Another way to control xinetd is to edit the xinetd configuration files, located in /etc/xinetd.d/. In this directory, each available service is listed as its own file, each file containing all the needed parameters for that service. Each file should contain a line that reads "disable = yes" or "disable = no". Making this line read "disable = yes" disables that service. Or, deleting the file does the same thing, just in a more permanent fashion. After making changes, it is important to run (as root) "killall -1 xinetd". This will restart the xinetd daemon, which makes it reread all the configuration files, and updates it to the new configuration.
Note that I did not want to turn off xinetd altogether; I will need some of its services running for remote command calls later.
One other item to take care of is gdm, which every machine was running. My problem was that I had no idea where gdm was started from. After much searching through /etc with no luck, I finally found a page in the Red Hat Reference Guide:
http://www.redhat.com/docs/manuals/linux/RHL-7-Manual/
that described the boot process. It turns out that a graphical login system, like gdm, is started at the end of /etc/inittab, a boot script. I commented out the last line in /etc/inittab so it read: "# x:5:respawn:/etc/X11/prefdm -nodaemon", and no more X or gdm on those systems. Reboot, and these machines were wicked fast.
As a reference point, the services that each of my client nodes ended up running were anacron, atd, crond, keytable, kudzu, netfs, network, nfslock, random, rawdevices, sshd, syslog, xinetd (rsh, rlogin). It may be okay to turn off some of these as well, given my limited knowledge of Red Hat, I decided to play it safe.
------------------------------------
Who Are You? - Host name translation
Now that these machines were running smoothly, I wanted to be able to remotely administer them, and remotely call programs on them. Constantly swapping keyboard and monitor cables is a pain; I could get a KVM switch, but that costs money, and why spend money when you have ssh, right? Well, in the process of weaning these machines off their peripherals, I forgot to check if the BIOS errored on a missing keyboard, which it turned out they all did. This means that when I rebooted each client, they all got stuck at the POST screen giving an error message, and the operating system never booted. So once again, plug in the monitor and keyboard, turn off the BIOS checks, and I was set to go.
Typing in IP addresses by hand is also a pain, so I wanted to refer to the nodes by more friendly names. Easy enough to do, with the /etc/hosts file. /etc/hosts is the local equivalent of DNS. It aliases a name with an IP address, so instead of typing 10.0.0.2, I could type babybeast1, or if I was REALLY lazy, bb1, or b1. So for every babybeast, I added:
10.0.0.N babybeastN
to beastmaster's /etc/hosts. I also aliased beastmaster on each node, so they could refer to the server in a friendly way. Note that on beastmaster, make a line like: "10.0.0.1 beastmaster", separate from the localhost line. Some MPI packages do not like a machine referred to by its "localhost" or 127.0.0.1 identity; I'll come back to that later.
-----------------------------------------------------------
Playing Nicely with rsh - Configuring rsh and trusted hosts
Ssh is great for administration in the real world, where you don't want everyone and their kid brother to know your password and what you're doing. However, I was dealing with a closed network, with five trusted machines on it. I wanted to be able to make remote calls from the server to any node without even giving a password! Rsh and rlogin are perfect for such a thing. Rsh and rlogin make up the insecure predecessor of ssh. Rsh allows remote procedure calls, sending a single command to a remote host. Rlogin supplies an interactive user prompt, like telnet. With the help of the /etc/hosts.equiv file, I can make remote calls and logins easy enough for an automated script to do. Which is important, considering that message-passing packages like LAM-MPI require this capacity. Both of these services are selectable with the same daemon/service tools I used before. I enabled them at runlevels 3, 4, and 5.
/etc/hosts.equiv lists "trusted" hosts, remote machines that can log on to the local machine without supplying a password (given that the UID of the user on the local machine matches the UID of a valid user on the remote machine). Initially, just to keep things simple, I kept just one pre-existing user across all nodes with the same UID of 500. There exist scripts that clone user lists from machine to machine, so it is also possible to allow any user to cross over by cloning all machines' password files. For this cluster, since it is not going to have several large projects underway on it simultaneously, I chose to let only people with that one account's password do remote logins, and thus operate the cluster. Note that /etc/hosts.equiv does not apply to root (usually that requires a similar file existing at /root/.rhosts, with special permissions - see the rhosts(5) manpage for details). On the client node, I typed in beastmaster on its own line in the file, so that only calls from beastmaster would be accepted with supplying a password. In addition, since the cluster sits on its own network, it is harder to gain unauthorized access.
Later down the road, I decided to copy accounts from beastmaster to each client node, so anyone with an account on beastmaster could use the cluster. This is a functional system that works as long as multiple users do not try to run jobs simultaneously, thus slowing down the entire system. Initially, I was going to use trusted rsh once again, this time with the special root permissions, so that beastmaster could push the account information out to each node. Unfortunately, I found that making this work is harder than it looks. After several attempts to get it configured, I decided to take the easy way out and put a cron job on each node to fetch the account files (/etc/passwd and /etc/group) from beastmaster once an hour. This would require one extra NFS mount, which will be discussed later.
--------------------------------------------------
What Time Is It? - Using the time daemon and rdate
Clustering applications can be picky about all nodes having synced clocks. I wanted to automatically set the clocks on all nodes to the server's clock once every day. Unless precision is of the utmost importance, the time daemon and rdate are a good way of doing this. The time daemon resides on the server, allowing a remote node to request the server's internal clock time. The program rdate is the client end, which is called from the command line:
# rdate -s beastmaster
The -s tells the client not to report the time on the screen, but rather to set the local machine's time according to the server's time. I added a cron job to do this once a day. In the folder /etc/cron.daily/, I created a new file called check_time, which holds the above command, after the obligatory "#!/bin/sh".
If high precision synchronization was a concern in this cluster, I would consider using ntpd, or something similar. The ntp system, or Network Time Protocol, is much higher precision than the time daemon, and performs advanced operations such as determining the drift of a machine's clock with respect to the server's time, and automatically adjusting itself at regular intervals. Ntp also tends to account for network delays better than the time daemon. However, the time daemon was handy, and I wasn't too concerned with a few extra decimals of precision.
-------------------------------------------------------------------
It's an NFS World - Relying on the server for files and executables
The Network File System is the *nix equivalent of Windows' Network Neighborhood (SMB/CIFS) or Apple's AppleTalk/AppleShare, and within a cluster, it is the way that executables and data files are transferred between nodes. Just as I can mount a CD-ROM on a mountpoint, say /mnt/cdrom, I can also mount a share from another machine on any mountpoint I choose. Even, say, /usr/local or /home, which is just what I did.
The /etc/exports file defines all NFS shares from that machine and any special options or restrictions on any of them. The shares I wanted to export were on beastmaster, which also has a network interface to the SU network. As such, I was not comfortable with leaving any files openly visible, let alone writable, to that network. Fortunately, the exports file offers a range of options to restrict how a share is exported. On beastmaster, I added to /etc/exports:
/usr/local 10.0.0/255.255.255.0(ro)
/home 10.0.0/255.255.255.0(rw,sync)
The first column is the folder that is exported. Then comes the options. The dotted octets are fragments of an IP address and a subnet mask. I effectively told the NFS daemon to allow any machine on a subnet of 255.255.255.0 with an IP address of 10.0.0.* to mount the share with the options stated in parentheses. Since the SU network uses a netmask of 255.255.240.0, this provides a decent amount of protection. See the exports(5) manpage for more options and examples. To add beastmaster's /etc, for easy access to account information, add:
/etc 1.0.0/255.255.255.0(ro)
Then I modified the /etc/fstab file on all nodes:
beastmaster:/usr/local /usr/local nfs defaults 0 0
beastmaster:/home /home nfs defaults 0 0
And for my cron job to copy user accounts:
beastmaster:/etc /etc/beastmaster nfs defaults 0 0
After rebooting, or calling mount /usr/local and mount /home, all the files at in those spots on the server will magically list on each node. Instead of programming the transfer of data files from machine to machine, and manually copying binaries around, I let the operating system do the work for me. Of course, there is a slight delay involved in executing a binary over the network, but on a fast network and for a large job, it won't be noticeable compared to the processing time. Also note that now the nodes may not boot properly if the server is down - most boot sequences are picky about the /etc/fstab file being accurate to what really exists.
------------------------------------------------------
Who, that? That's my clone - Cloning the configuration
Everything seemed to be configured on babybeast1, so I wanted to find a way to copy this configuration to the other nodes. I wrote a bash script called subc-config, which I have included as an appendix. It details an interactive yes/no approach to configuring individual nodes. It is written specifically for the SUBC, but should be easy to hack for other systems.
------------------------------------------------
Testing it out - a quick and dirty sample script
Jacek Radajewski and Douglas Eadline put together a couple of documents that I highly recommend. I referred to them extensively in the process of configuring this cluster. The documents that I used were a bit dated (1998-1999), but still largely applicable. In the document titled "Beowulf HOWTO", there is a sample script and some C-code to try out. It uses rsh and nfs, and no special message passing libraries. It is simple, but demonstrates the power of a cluster with minimal effort.
The script sums the square roots of all numbers from one to one-billion. In the appendix I have included scripts and some C-code based off their example, that is good to show the cluster in action. Their example is more complete; I highly recommend taking a look at theirs. Mine just gives the basic idea. Note that the file named "output" is a named pipe, and can be create with the command mkfifo(1).
-------------------------------------------
Off to the library - Communicating with MPI
That test script is cool, and proves how powerful the command line is on a *nix system. But for real performance and power, I needed to use a compiled language like C or Fortran, with more streamlined communication versus bulky rsh calls and data transfer via NFS.
Historically, this functionality has been provided by the PVM standard, which has been replaced in popularity recently by the MPI standard, or the Message Passing Interface. It is a collection of coding libraries for C and Fortran77 which allow socket level communication between cluster nodes.
My choice for an MPI implementation is LAM-MPI, which rates highly on the performance scale and offers a lot of monitoring and debugging capabilities. It is slightly different from some implementations in that every node must run a special daemon, which is used to call and sometimes control the cluster job. This seems to allow more debugging, and doesn't seem to slow things down noticeably, so I consider it a feature. The user also has the option of running cluster jobs in a direct sockets mode, which bypasses the daemon for most of the job. See www.lam-mpi.org for more information. For a list of several free and commercial MPI implementations, see www-unix.mcs.anl.gov/mpi/.
I downloaded and installed lam-6.5.6 into /usr/local/lam, and tried to make it work. After figuring out a bit about the configuration files, I determined that /usr/local/lam/etc/lam-bhost.def seemed to be the most important configuration file, where all the cluster's nodes are listed. It seems that even the server (or whichever machine the lamboot command is called from) must be entered in this file. In other words, if I wanted to have a server node which strictly served files and monitored the cluster, I would not include it in this file, and I would have to run lamboot (which starts the daemon on all nodes) from another node. So I entered all the babybeasts and beastmaster, and tried to boot. This failed. Then I looked in the online FAQ, and discovered what I mentioned earlier about making a separate line for beastmaster in the /etc/hosts file. If "lamboot -d" (full debug output) lists any nodes at 127.0.0.1, it will fail. So I listed beastmaster as 10.0.0.1 on a separate line from the localhost line in each node's /etc/hosts file, and after some tweaking, got the cluster booted. Note: after setting up the lam-bhost.def file, run "recon -dv" to see what is going on in the cluster, before trying to boot it for the first time.
They provide a downloadable test suite, lamtests-<version>, which I tried. It seemed happy enough, and reported no errors at the end of the run, so it all seemed good to go. Now to test this all out.
If you decide to set up multiple users, be sure to check out the account synchronization that I use in the configuration script in the appendix, along with the client-conf/subc_sync_accts file; it is important to update the $PATH and $LAMHOME variables on each node, not just the server. The script provides one method of doing this; there are others, but lamboot will fail if the path to the binaries is not provided on every machine.
-------------------------------
It's alive! - Using the cluster
I finally got to actually use the cluster as a cluster, using MPI. There is an example included in the lamtest suite, also found in some MPI reference books, for calculating pi from an integral. After typing in the Fortran code, I ran (as our special user sps):
$ mpif77 -L/usr/local/lam/lib -I/usr/local/lam/include find_pi.f -o find_pi
To compile. Note that due to the special install location of lam, I had to use special library and include arguments (-L and -I - see the g77(1) manpage). Then, I booted the cluster and ran the program from the directory the compiled code was in:
$ lamboot -v
$ mpirun N find_pi
.... look at cool program running ....
$ lamhalt -v
On the mpirun line, find_pi was the compiled code; I was in the directory where find_pi lived. The 'N' told mpirun to use all nodes; there is syntax for specifying certain nodes to use, and for more complex runs, a schema file can be created, which specifies details about how to run the MPI application. See the lam docs for further information.
----------
Conclusion
The next item on my agenda is to learn more about programming for the cluster, and begin some physics research projects on it. I invite the reader again to take a look at the resource links listed at the end of the document, and to download, view, and experiment with the resources I have included with this document. A list of various configuration files, scripts, and examples follows below.
-------------------------------------
Appendix - Example code, scripts, etc.
(See the subdirectory marked in parentheses for the following files)
Client node configuration files (client-conf):
chkconfig.txt - Output from /sbin/chkconfig --list
fstab /etc/fstab - Mount file
hosts /etc/hosts - Host to IP alias file
hosts.equiv /etc/hosts.equiv - Trusted hosts file
network /etc/sysconfig/network - Network settings
subc_check_time /etc/cron.daily/subc_check_time - Custom cron job
Configuration/Administration scripts (scripts):
subc-config - Configuration script -- Use at your own risk!
Server configuration files (server-conf):
chkconfig.txt - Output from /sbin/chkconfig --list
exports /etc/exports - NFS exports file
fstab /etc/fstab - Mount file
hosts /etc/hosts - Host to IP alias file
lam-bhost.def /usr/local/lam/etc/lam-bhost.def - Cluster node list file
network /etc/sysconfig/network - Network settings
Rsh/NFS test (test-code):
1-way.txt - Shell script for 1 node
4-way.txt - Shell script for 4 nodes
results.txt - Some basic results from running this code on SUBC
sigmasqrt.c - C code for sigmasqrt (compile with -lm)
------------------------
References and Resources
Brown, Robert G. "Maximizing Beowulf Performance". August 28, 2000.
http://www.usenix.org/publications/library/proceedings/als2000/brownrobert.html
Gropp, William; Lusk, Ewing; Skjellum, Anthony. "Using MPI" Second Edition. Cambridge: The MIT Press, 1999.
LAM-MPI. http://www.lam-mpi.org
Pacheco, Peter S. "Parallel Programming with MPI." San Fransisco: Morgan Kaufmann Publishers, Inc, 1997.
Radajewski, Jacek and Eadline, Douglas. "Beowulf HOWTO", v1.1.1, November 22, 1998. Available at http://www.beowulf-underground.org/doc_project/index.html
Radajewski, Jacek and Eadline, Douglas. "Beowulf Installation and Configuration HOWTO", v0.1.2, June, 1999. Available at http://www.beowulf-underground.org/doc_project/index.html
Red Hat v7.x Support Page. http://www.redhat.com/docs/manuals/linux/RHL-7-Manual/