Large-Scale Tuning for Cray XT
|Note: This page originated on the old Lustre wiki. It was identified as likely having value and was migrated to the new wiki. It is in the process of being reviewed/updated and may currently have content that is out of date.|
DISCLAIMER - EXTERNAL CONTRIBUTOR CONTENT
This content was submitted by an external contributor. We provide this information as a resource for the Lustre™ open-source community, but we make no representation as to the accuracy, completeness or reliability of this information.
This section applies to Cray XT3 Catamount nodes and explains parameters used with the kptllnd module.
With a large number of clients and servers possible on these systems, tuning various request pools becomes important, requiring changes to the ptllnd module.
|max_nodes||max_nodes is the maximum number of queue pairs, and, therefore, the maximum number of peers with which the LND instance can communicate. Set max_nodes to a value higher than the product of the total number of nodes and maximum processes per node.
Max nodes > (Total # Nodes) * (max_procs_per_node)
Setting max_nodes to a lower value than described causes Lustre to throw an error. Setting max_nodes to a higher value causes excess memory to be consumed.
|max_procs_per_node||max_procs_per_node is the maximum number of cores (CPUs), on a single Catamount node. Portals must know this value to properly clean up various queues. LNET is not notified directly when a Catamount process aborts. The first information LNET receives is when a new Catamount process with the same Cray portals NID starts and sends a connection request. If the number of processes with that Cray portals NID exceeds the max_procs_per_node value, LNET removes the oldest one to make space for the new one.|
These two tunables combine to set the size of the ptllnd request buffer pool. The buffer pool must never drop an incoming message, so proper sizing is very important.
Ntx helps to size the transmit (tx) descriptor pool. A tx descriptor is used for each send and each passive RDMA. The max number of concurrent sends == 'credits'. Passive RDMA is a response to a PUT or GET of a payload that is too big to fit in a small message buffer. For servers, this only happens on large RPCs (for instance, where a long file name is included), so the MDS could be under pressure in a large cluster. For routers, this is bounded by the number of servers. If the tx pool is exhausted, a console error message appears.
|Credits||Credits determine how many sends are in-flight at once on ptllnd. Optimally, there are 8 requests in-flight per server. The default value is 128, which should be adequate for most applications.|