New NUMA Support with Windows Server 2008 R2 and Windows 7

The 64-bit versions of Windows 7 and Windows Server 2008 R2 support more than 64 Logical Processors (LP) on a single computer. New commodity systems are now appearing that leverage non-uniform memory access (NUMA) architectures. Within the near future, a system with 4 CPU sockets, 8 processor-cores per socket and with Simultaneous Multi-Threading (SMT) enabled per core, will achieve 64 Logical Processors. Many high-end server-class solutions may need to be architected with NUMA awareness in order to achieve linear performance scaling on such systems. Parallel Computing and High Performance Computing solution developers may also find NUMA awareness essential for performance scalability.

Abstract*

The traditional model for multiprocessor support is Symmetric Multi-Processor (SMP). In this model, each processor has equal access to memory and I/O. As more processors are added, the processor bus becomes a limitation for system performance.

System designers are now using non-uniform memory access (NUMA) to increase processor speed without increasing the load on the processor bus. The architecture is non-uniform because each processor is close to some parts of memory and farther from other parts of memory. The processor quickly gains access to the memory it is close to, while it can take longer to gain access to memory that is farther away.

In a NUMA system, CPUs are arranged in smaller systems called nodes. Each node has its own processors and memory, and is connected to the larger system through a cache-coherent interconnect bus.

The system attempts to improve performance by scheduling threads on processors that are in the same node as the memory being used. It attempts to satisfy memory-allocation requests from within the node, but will allocate memory from other nodes if necessary. It also provides an API to make the topology of the system available to applications. You can improve the performance of your applications by using the NUMA functions to optimize scheduling and memory usage.

First of all, you will need to determine the layout of nodes in the system. To retrieve the highest numbered node in the system, use the GetNumaHighestNodeNumber function. Note that this number is not guaranteed to equal the total number of nodes in the system. Also, nodes with sequential numbers are not guaranteed to be close together. To retrieve the list of processors on the system, use the GetProcessAffinityMask function. You can determine the node for each processor in the list by using the GetNumaProcessorNode function. Alternatively, to retrieve a list of all processors in a node, use the GetNumaNodeProcessorMask function.

After you have determined which processors belong to which nodes, you can optimize your application's performance. To ensure that all threads for your process run on the same node, use the SetProcessAffinityMask function with a process affinity mask that specifies processors in the same node. This increases the efficiency of applications whose threads need to access the same memory. Alternatively, to limit the number of threads on each node, use the SetThreadAffinityMask function.

Memory-intensive applications will need to optimize their memory usage. To retrieve the amount of free memory available to a node, use the GetNumaAvailableMemoryNode function. The VirtualAllocExNuma function enables the application to specify a preferred node for the memory allocation. VirtualAllocExNuma does not allocate any physical pages, so it will succeed whether or not the pages are available on that node or elsewhere in the system. The physical pages are allocated on demand. If the preferred node runs out of pages, the memory manager will use pages from other nodes. If the memory is paged out, the same process is used when it is brought back in.

{*}Note: This article is in part a reprint of pre-release Windows SDK documentation. Technical details are subject to change.

Processor Groups

Systems with multiple processors or systems with processors that have multiple cores furnish the operating system with multiple logical processors. A logical processor is one logical computing engine from the perspective of the operating system, application or driver. In effect, a logical processor is a thread.

Support for systems that have more than 64 logical processors is based on the concept of a processor group. A processor group is a static set of up to 64 logical processors that is treated as a single scheduling entity.

When the system starts, the operating system creates processor groups and assigns logical processors to the groups. A system can have up to four groups, numbered 0 to 3. Systems with fewer than 64 logical processors always have a single group, Group 0. The operating system minimizes the number of groups in a system. For example, a system with 128 logical processors would have two processor groups, not four groups with 32 logical processors in each group.

The operating system takes physical locality into account when assigning logical processors to groups, for better performance. All of the logical processors in a core, and all of the cores in a physical processor, are assigned to the same group, if possible. Physical processors that are physically close to one another are assigned to the same group. Entire NUMA nodes are assigned to the same group, so that a node is a subset of a group. If multiple nodes are assigned to a single group, the operating system chooses nodes that are physically close to one another.

For a discussion of operating system architecture changes to support more than 64 processors and the modifications needed for applications and kernel-mode drivers to take advantage of them, see the whitepaper Supporting Systems That Have More Than 64 Processors at http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx.

New Functions

The following new functions are used with processors and processor groups. See the Windows SDK header files winbase.h and WinNT.h. These API's are exposed via "kernel32.dll" and documented within the Windows SDK (which will be available at beta release). See example usage scenarios within the downloads section of this Code Gallery resource page.

CreateRemoteThreadEx
Creates a thread that runs in the virtual address space of another process and optionally specifies extended attributes such as processor group affinity.

GetActiveProcessorCount
Returns the number of active processors in a processor group or in the system.

GetActiveProcessorGroupCount
Returns the number of active processor groups in the system.

GetCurrentProcessorNumberEx
Retrieves the processor group and number of the logical processor in which the calling thread is running.

GetLogicalProcessorInformationEx
Retrieves information about the relationships of logical processors and related hardware.

GetMaximumProcessorCount
Returns the maximum number of logical processors that a processor group or the system can support.

GetMaximumProcessorGroupCount
Returns the maximum number of processor groups that the system supports.

GetNumaAvailableMemoryNodeEx
Retrieves the amount of memory that is available in the specified node as a USHORT value.

GetNumaNodeNumberFromHandle
Retrieves the NUMA node associated with the underlying device for a file handle.

GetNumaNodeProcessorMaskEx
Retrieves the processor mask for the specified NUMA node as a USHORT value.

GetNumaProcessorNodeEx
Retrieves the node number of the specified logical processor as a USHORT value.

GetNumaProximityNodeEx
Retrieves the node number as a USHORT value for the specified proximity identifier.

GetProcessGroupAffinity
Retrieves the processor group affinity of the specified process.

GetProcessorSystemCycleTime
Retrieves the cycle time each processor in the specified group spent executing deferred procedure calls (DPCs) and interrupt service routines (ISRs).

GetThreadGroupAffinity
Retrieves the processor group affinity of the specified thread.

GetThreadIdealProcessorEx
Retrieves the processor number of the ideal processor for the specified thread.

QueryIdleProcessorCycleTimeEx
Retrieves the accumulated cycle time for the idle thread on each logical processor in the specified processor group.

SetThreadGroupAffinity
Sets the processor group affinity for the specified thread.

SetThreadIdealProcessorEx
Sets the ideal processor for the specified thread and optionally retrieves the previous ideal processor.

The following new functions are used with thread pools.

QueryThreadpoolStackInformation
Retrieves the stack reserve and commit sizes for threads in the specified thread pool.

SetThreadpoolCallbackPersistent
Specifies that the callback should run on a persistent thread.

SetThreadpoolCallbackPriority
Specifies the priority of a callback function relative to other work items in the same thread pool.

SetThreadpoolStackInformation
Sets the stack reserve and commit sizes for new threads in the specified thread pool.

New Structures

CACHE_RELATIONSHIP
Describes cache attributes.

GROUP_AFFINITY
Contains a processor group-specific affinity, such as the affinity of a thread.

GROUP_RELATIONSHIP
Contains information about processor groups.

NUMANODERELATIONSHIP
Contains information about a NUMA node in a processor group.

PROCESSORGROUPINFO
Contains the number and affinity of processors in a processor group.

PROCESSOR_RELATIONSHIP
Contains information about affinity within a processor group.

SYSTEMLOGICALPROCESSORINFORMATIONEX
Contains information about the relationships of logical processors and related hardware.

Usage Scenarios (See the sample code via the "downloads" tab on this page.)

 
   // How many processor GROUPs?  Note that some processors may be parked (i.e. "Core Parking").
   { 
         WORD wMaximumProcessorGroupCount = GetMaximumProcessorGroupCount();
         WORD wActiveProcessorGroupCount = GetActiveProcessorGroupCount();
         Display(L"MaximumProcessorGroupCount=%d \tActiveProcessorGroupCount=%d\n",  wMaximumProcessorGroupCount, wActiveProcessorGroupCount);
   }

 
   // How many processors per GROUP?
   { 
        for (WORD groupnum = 0; groupnum < wActiveProcessorGroupCount; groupnum++)
            Display(L"GROUP=0x%02X \tMaximumProcessorCount=%d \tActiveProcessorCount=%d\n", groupnum, GetMaximumProcessorCount(groupnum), GetActiveProcessorCount(groupnum));  
   }

    // Get system logical processor information containing information about NUMA nodes and GROUP_AFFINITY relationships.
    // Each entry in the returned struct array describes a collection of processors denoted by the affinity mask and the type of 
    // relation this collection holds to each other.  The following outlines the type of possible relations:
    //        RelationProcessorCore
    //               The specified logical processors share a single processor core.
    //        RelationNumaNode
    //               The specified logical processors are part of the same NUMA node.  (Also available from GetNumaNodeProcessorMask).
    //        RelationCache
    //               The specified logical processors share a cache.
    //        RelationProcessorPackage 
    //               The specified logical processors share a physical package, for example multi-core processors share the same package.
 
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer = NULL;
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX ptr = NULL;
    DWORD returnLength = 0;
    DWORD byteOffset = 0;
    bool done = FALSE;
 
    while (!done)
    {
        DWORD rc = GetLogicalProcessorInformationEx(RelationAll, buffer, &returnLength);
        if (FALSE == rc) 
        {
            if (GetLastError() == ERROR_INSUFFICIENT_BUFFER) 
            {
                if (buffer) 
                    free(buffer);
                buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)malloc(returnLength);
                if (NULL == buffer) 
                    throw(GetLastError());
            } 
            else 
                throw(GetLastError());
        } 
        else
            done = TRUE;
    }
    ASSERT(buffer);
    TRACE(L"Call_GetLogicalProcessorInformationEx : returnLength=0x%08X\n", returnLength);
		
    ptr = buffer;
    while (byteOffset < returnLength) 
    {
        TRACE(L"\tbyteOffset=0x%08X : ptr->Size=0x%08X\n", byteOffset, ptr->Size);
    		
        switch (ptr->Relationship) 
        {
          case RelationProcessorCore:
        	Display(L"\n  Processor \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n", 
        	           ptr->Processor.GroupMask.Group, 
        	           ptr->Processor.GroupMask.Mask);
          break;
 
          case RelationNumaNode:
        	Display(L"\n  NumaNode \n\t NodeNumber=0x08X \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n",
        	           ptr->NumaNode.NodeNumber,
        	           ptr->NumaNode.GroupMask.Group,
        	           ptr->NumaNode.GroupMask.Mask); 
          break;
 
          case RelationCache:
        	Display(L"\n  Cache \n\t Level=0x%02X \n\t Associativity=0x%02X \n\t LineSize=0x%04X \n\t CacheSize=0x%08X \n\t Type=%ws \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n",
        	           ptr->Cache.Level,
        	           ptr->Cache.Associativity,
        	           ptr->Cache.LineSize,
        	           ptr->Cache.CacheSize,
        	           GetCacheType(ptr->Cache.Type),
        	           ptr->Cache.GroupMask.Group,
        	           ptr->Cache.GroupMask.Mask);
          break;
 
          case RelationProcessorPackage:
	Display(L"\n  Socket \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n",
	           ptr->Processor.GroupMask.Group,
	           ptr->Processor.GroupMask.Mask);
          break;
						
          case RelationGroup:
        	Display(L"\n  Group \n\t MaximumGroupCount=0x%04X \n\t ActiveGroupCount=0x%04X\n",
        	           ptr->Group.MaximumGroupCount,
        	           ptr->Group.ActiveGroupCount);
        	for (int c = 0; c < ptr->Group.ActiveGroupCount; c++)
        	     Display(L"\t\t MaximumProcessorCount=0x%02X \n\t\t ActiveProcessorCount=0x%02X \n\t\t ActiveProcessorMask=0x%08X\n",
        		ptr->Group.GroupInfo[c].MaximumProcessorCount,
        		ptr->Group.GroupInfo[c].ActiveProcessorCount,
        		ptr->Group.GroupInfo[c].ActiveProcessorMask);
          break;
        		
          default:
            Display(L"\n  Error: Unsupported LOGICAL_PROCESSOR_RELATIONSHIP value.  0x%02X\n", ptr->Relationship);
          break;
        }
        byteOffset += ptr->Size;
        ptr = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)(((PUCHAR)buffer) + byteOffset);
    }		
    free(buffer);

Application Awareness of NUMA Locality

Scalable application design requires NUMA awareness from several perspectives. Herb Sutter describes this process as "Maximize Locality, Minimize Contention". Imagine the processor load required to service interrupts from modern 10 Gb/sec network cards, for example. Ideally, the interrupt processing and any Deferred Procedure Calls (DPC) occur local to the network device. Read a detailed analysis by Windows performance expert Mark Friedman . NUMA locality may be applied to processes, threads, devices, interrupts, and memory.

Threads can run only on the logical processors in a single group. By default, the thread affinity is all logical processors in the parent thread’s group. Windows assigns threads across logical processors within the thread’s affinity mask according to thread priority. At thread creation, an application can change the default thread affinity and can specify an ideal processor for a thread by calling the new CreateRemoteThreadEx function.
The ideal processor is the logical processor on which the Windows scheduler tries to run the thread whenever possible. The scheduler searches for a processor in the following order:
1. The thread’s ideal processor.
2. A processor in the thread’s preferred NUMA node.
3. Other processors in the thread affinity mask.

To specify the group affinity for a thread at creation:
A. Call CreateRemoteThreadEx and pass the PROCTHREADATTRIBUTEGROUPAFFINITY extended attribute together with a GROUP_AFFINITY structure.

To change the affinity of an existing thread:
B. Call either the existing SetThreadAffinityMask function or the new SetThreadGroupAffinity function.

To specify the ideal processor at thread creation:
C. Pass the PROCTHREADATTRIBUTEIDEALPROCESSOR extended attribute to CreateRemoteThreadEx together with a PROCESSOR_NUMBER structure.

The following example illustrates NUMA node localization of an existing I/O worker thread with a disk device (option "B" above). The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.

DWORD MapIoThreadWithDiskNumaNode1(pCDiskDrive pDisk)
{
    // FOR ILLUSTRATION ONLY - DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on the same NUMA node.
	
    // This demo illustrates NUMA localization of an existing thread.
	
    USHORT numaNode;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
 
    if (!pDisk || !pDisk->HandleIsValid())
        throw(L"\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n");
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk->Handle(), &numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L"Device \"%ws\" is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n", 
	(const wchar_t*)pDisk->Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
			
    hThread = CreateThread(
	    NULL,    		// default security attributes
	    0,         			// use default stack size  
	    DemoThreadFunction,  	// thread function name
	    &numaNode,          	// argument to thread function 
	    0,                 		// use default creation flags 
	    &dwThreadID);   	                // returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    // Thread is paused while we check and adjust NUMA affinity.
    GetThreadGroupAffinity(hThread, &groupAffinityThread);   
    Display(L"\tThread 0x%08X created on orginal GROUP=0x%04X with KAFFINITYmask=0x%08X\n\n", 
	dwThreadID, groupAffinityThread.Group, groupAffinityThread.Mask);
					
    if ((groupAffinityThread.Group != groupAffinityDisk.Group) ||
        ((groupAffinityThread.Mask & groupAffinityDisk.Mask) != groupAffinityThread.Mask))
    {
        // ****  NOTE:  The SetThreadGroupAffinity API now takes 3 parameters.   This example uses SDK headers from beta1.   After RC1 (May, 2009), the API
        // ****  uses SetThreadGroupAffinity(__in HANDLE hThread, __in CONST GROUP_AFFINITY *GroupAffinity, __out_opt PGROUP_AFFINITY PreviousGroupAffinity).
        // ****  Be sure to use the latest Windows SDK and the released Windows Server 2008 R2 and Windows 7.   See SDK header file "winbase.h" for verification of
        // ****  he API signature.
        SetThreadGroupAffinity(hThread, &groupAffinityDisk);  
    }
    return 1;
}

The following example illustrates NUMA node localization upon creating a new I/O worker thread with a disk device (options "A" and "C" above). Again, The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.

DWORD MapIoThreadWithDiskNumaNode2(pCDiskDrive pDisk)
{
    // DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on an ideal processor on the same NUMA node.
 
    // This demo illustrates NUMA localization upon creating a thread.
	
    USHORT numaNode = 0;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
    LPPROC_THREAD_ATTRIBUTE_LIST pAttributeList = NULL;
    SIZE_T sizeToAlloc = 0;
    SIZE_T sizeOfBuffer = 0;
    DWORD numActiveProcs = 0;
    PROCESSOR_NUMBER processorNumber;
	
    if (!pDisk || !pDisk->HandleIsValid())
        throw(L"\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n");
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk->Handle(), &numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L"Device \"%ws\" is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n", 
	(const wchar_t*)pDisk->Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
	
    // choose one processor within the Disk's NUMA node for the ideal procesor number.
    USHORT node = 0;
    numActiveProcs = GetActiveProcessorCount(groupAffinityDisk.Group);
    processorNumber.Group = groupAffinityDisk.Group;
    processorNumber.Number = 0;
    do {
        GetNumaProcessorNodeEx(&processorNumber, &node); 
    } while ((node != numaNode) && ((processorNumber.Number++) <= numActiveProcs));
	
    // first call returns the size required for 2 attributes.
    InitializeProcThreadAttributeList(NULL, 2, 0, &sizeToAlloc);  
    ASSERT(sizeToAlloc > 0);
    if(sizeToAlloc <= 0)
        throw(GetLastError());
		
    pAttributeList = (LPPROC_THREAD_ATTRIBUTE_LIST)HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, sizeToAlloc);
    ASSERT(pAttributeList != NULL);
    if (!pAttributeList)
        throw(GetLastError());
	
    sizeOfBuffer = sizeToAlloc;
	
    // second call creates the attribute list.
    if (InitializeProcThreadAttributeList(pAttributeList, 2, 0, &sizeOfBuffer) == 0)
        throw(GetLastError());	
    ASSERT(sizeOfBuffer == sizeToAlloc);
	
    // add GROUP_AFFINITY attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_GROUP_AFFINITY,
			&groupAffinityDisk,
			sizeof(GROUP_AFFINITY),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
 
    // add IDEAL_PROCESSOR attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_IDEAL_PROCESSOR,
			&processorNumber,
			sizeof(PROCESSOR_NUMBER),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
		
    // Create the thread on the specified ideal processor or same Numa node as ideal processor.	
    hThread = CreateRemoteThreadEx(
  		GetCurrentProcess(),	                // target process handle
  		NULL,    			// default security attributes
  		0,         			// use default stack size  
  		DemoThreadFunction,  	// thread function name
  		&numaNode,          		// argument to thread function 
  		0,                 		// use default creation flags
  		pAttributeList,		// additional parameters for the new thread. 
  		&dwThreadID);   		// returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    DeleteProcThreadAttributeList(pAttributeList);
    if (HeapFree(GetProcessHeap(), 0, pAttributeList) == 0)
        throw(GetLastError());
	
    GetThreadGroupAffinity(hThread, &groupAffinityThread);   
    Display(L"\tThread 0x%08X created on GROUP=0x%04X, NUMAnode=0x%04X, KAFFINITYmask=0x%08X, IdealProcessor=0x%02X\n\n", 
	dwThreadID, groupAffinityThread.Group, numaNode, groupAffinityThread.Mask, processorNumber.Number);
    return 1;
}

UPDATED WIKI: Home