To truly reap the rewards of a multiprocessor NT system, you have to use threads
Shashi Prasad
Multithreading (MT) is becoming increasingly attractive for applications; it offers one of the best choices for harnessing the power of SMP (symmetric multiprocessing) machines. In my article "Weaving a Thread" (October BYTE), I discussed multiprocessing and MT on Solaris and Windows NT. In this article, I'll take a closer look at the Win32 interface in Windows NT for developing MT applications.
Processes and Threads
A
process
in NT is a running instance of an application; it has its own virtual address space and owns system resources, such as memory, windows, and open files. When a process is created by a call to
C
reateProcess
, an initial thread is automatically built for the process. You create additional threads by calling the following function:
The newly created thread starts executing the routine specified by
lpStartAddr
, and this routine can take the optional argument
lpThreadParm
. The thread-routine argument is generally a dynamically allocated variable or a global variable. Each thread in NT has its own user and kernel stack, and the size of the stack for the newly created thread can be specified in
cbStack
.
Threads in NT have 32 different priority levels. The
dispatcher
-- the module responsible for thread-scheduling -- uses a preemptive priority scheduler. In Windows NT, the highest-priority thread is always scheduled to run. Threads can change their priority by calling the function
SetThreadPriority
.
NT threads can be suspended and resumed by ot
her threads in the process via calls to
SuspendThread
and
ResumeThread
, respectively. You can also create a thread in suspended state, which means it doesn't start execution until the creating thread calls
ResumeThread
.
A thread can terminate in one of the following ways: It can return from the initial routine; it can call the function
ExitThread
to terminate itself; or it can be terminated by some other thread in the process that calls
TerminateThread
. When a thread terminates, the thread object becomes
signaled
-- all other threads waiting for the thread to terminate are notified. A waiting thread can determine the exit status of a terminated thread with the function
GetExitCodeThread
.
Each thread has a unique identifier that can be retrieved by calling the function
GetCurrentThreadId
(this identifier is also returned in the
lpIDThread
argument during thread creation). However, several Win32 functions require an obj
ect's handle, which, for a thread, is separate from its ID. The handle to the thread object can be retrieved by calling the function
GetCurrentThread
. (The handle is also returned by the function
CreateThread
.) For example, when a thread wants to change its priority class, it can call the following:
Thread Synchronization
In a multithreaded program, all threads within the process run in a single address space. Threads allow easy data sharing; however, safeguards against corruption of the shared data are required. All access to shared resources must be protected by mutual exclusion.
In NT, mutexes are used to serialize critical sections of code. A
critical section
is defined as a segment of code in which a thread accesses shared, modifiable data, and where state changes happen over several instructions. Hence, only one thread can be executing that
section at a given time. Access must be serialized by some form of locking mechanism.
Before entering the critical section, the calling thread acquires the mutex lock by calling the
WaitForSingleObject
function. If the lock is held by some other thread, the calling thread is suspended until the thread holding the mutex lock releases it by calling
ReleaseMutex
.
Critical-section objects
are similar to mutexes but can be used only by threads of a single process.
EnterCriticalSection
is used to acquire ownership of a critical section, and
LeaveCriticalSection
releases ownership. This is one of the fastest mechanisms for mutual exclusion; only a few instructions are executed when there is no contention for the critical section. (If contention occurs, a kernel synchronization object is automatically used.)
In a multithreaded application, it's common to divide work among multiple threads. In such cases, one thread might wait for another thread to reach
a particular state before proceeding. NT provides event objects for thread synchronization. One thread can call
WaitForSingleObject
, thus blocking its execution until a certain condition is satisfied. The other thread, after satisfying the condition, can notify the waiting thread by calling
SetEvent
.
Semaphore objects
are similar to mutexes, except there is no ownership associated with semaphores. Additionally, semaphores have resource counts, which allows multiple threads to acquire a semaphore at the same time.
Finally, NT provides atomic memory operations for integer variables. The functions
InterLockedIncrement
and
InterLockedDecrement
increment and decrement a variable, respectively, while the function
InterLockedExchange
reads the value of a variable.
Threads and Performance
Once you've grasped the basic concepts of NT threads, you need to consider the performance and scalability of threaded applications. A
thread in NT can normally be in one of the following states at any given time: waiting for a specified event to occur (it cannot run); ready to run and waiting for an available processor; or running on a processor.
Threads in the ready or running state can take advantage of the CPUs (presuming they are running on a multiprocessor system). Excessive interthread synchronization can cause too many threads to be in the waiting state, and the creation of too many threads can cause multiple threads to be in the ready state. The number of threads in the running state can never be more than the number of processors. When the number of threads in the ready state is much higher than the number of running threads, the kernel spends a lot of time doing thread-context switching.
As an illustration, consider the various threading models used in the design of a multithreaded TCP/IP server. (This assumes you're familiar with Windows socket APIs on Windows NT.)The first model is single-threaded. The main thread do
es an accept call on the socket and handles the client request. The disadvantage of this model is that while the server is processing a client request, all other requests are being queued.
The second model is also single-threaded: The main thread does a
select
call on all the connected sockets. The select call indicates which connected sockets have data available (i.e., they are waiting to be serviced). Now multiple clients can be serviced concurrently, but -- as in the previous model -- this does not exploit the power of multiple CPUs.
In the third model, the main thread creates a thread for each client. This model is extremely easy to program, but it does not scale well for a high number of active clients. Creating multiple threads takes advantage of multiple processors but uses excessive system resources and causes scheduling overhead. The performance of the system degrades under "burst" traffic. As the number of ready threads increases, the system spends lots of time context-switching
threads in and out of the running state.
Finally, in the fourth model, a pool of worker threads is created to handle client requests. The main thread does a select on all the connected sockets; each new request gets passed to one of the worker threads. The number of worker threads should be slightly greater than the number of processors, because some of the worker threads might become blocked.
This model uses less system resources than the third model, but there's a built-in context switch on every transaction between the main thread and the worker threads. The context switch might not be a problem for longer transactions, but the overhead could be high for short transactions. Also, unless the main thread does some rotation on the results of the select call, this model does not have built-in fairness (i.e., an active client may block other, less active ones).
I/O-Completion Ports
To overcome the limitations of these four models, the engineers of NT 3.5 created a mec
hanism called
I/O-completion ports
. These ports are designed to handle asynchronous or overlapped I/O.
CreateIoCompletionPort
associates a port with a collection of file handles, and the port acts as
a synchronization point
. When a pending I/O operation on any of the file handles completes, an I/O-completion packet is then queued to that particular port. A number of worker threads can manage I/O for clients by calling
GetQueuedCompletionStatus
to wait on the I/O-completion port.
I/O-completion ports have built-in concurrency control. The kernel tries to limit the number of runnable threads associated with a port, never to exceed the port's concurrency value (which is specified when the port is created). When a thread calls
GetQueuedCompletionStatus
, it returns when I/O is available. When one of the threads associated with a completion port is blocked, the kernel selects another thread waiting on the completion port to run. Thus, the system is
n't deluged with runnable threads.
Threads that block on a completion port are awakened in last-in/first-out (LIFO) order, while I/O requests are serviced in first-in/first-out (FIFO) order. Running threads -- after completing a transaction -- can pick up the next request without causing any context switch. I/O-completion ports work efficiently under all loads; their performance does not suffer under heavy traffic.
If my sample TCP/IP server were implemented using I/O-completion ports, the main thread would create an I/O-completion port along with a pool of worker threads to wait on the port. This model is the most efficient; it does not suffer from context-switching overhead (as the fourth model would). The thread that reads the transaction services it. Fairness is built into the completion-port model, since I/O requests are satisfied in FIFO order.
The Common Thread
MT on an SMP machine can provide optimal performance and scalability if the applications are designed
correctly. You should not be surprised to see poorly designed applications run slower on an SMP machine than they do on a uniprocessor machine.Windows NT is a good environment for developing multithreaded applications, but it's important to remember that the OS alone is not responsible for performance and scalability. Understanding such features as I/O-completion ports and overlapped I/O are key to building scalable multithreaded applications on Windows NT.
Mutex --
Serializes access to shared data.
Critical-section object --
Faster than a mutex; cannot be shared
across processes.
Event object --
Used to signal occurrence of an event.
Semaphore --
Controls multithreaded access to a shared but limited
resource.
Interlock call --
Provides atomic access to integer variables.
Shashi Prasad is vice president
shaship@anstec.com
or on BIX c/o "editors."