Friday, April 29, 2011

Receive Packet Steering (RPS) on Linux 2.6.35

This patch implements software receive side packet steering (RPS). RPS distributes the load of received packet processing across multiple CPUs. Problem statement: Protocol processing done in the NAPI context for received packets is serialized per device queue and becomes a bottleneck under high packet load. This substantially limits pps that can be achieved on a single queue NIC and provides no scaling with multiple cores. (lwn.net: Software receive packet steering)


If you want to find out whether RPS is working, you have to look at /proc/softirqs instead (eg. with watch -n1 cat /proc/softirqs):
                CPU0       CPU1
       HI:          0          0
    TIMER:  480622794  476948579
   NET_TX:   25311134   27075847     <-----
   NET_RX: 1388399338 4191697027     <-----
    BLOCK:    4632803          3
 BLOCK_IOPOLL:          0          0
  TASKLET:         21          4
    SCHED:  154913375  158601463
  HRTIMER:    1576760    2361409
      RCU:  421549961  407634645 

Enabling RPS specifically for a particular interface:

# cat /sys/class/net/eth0/queues/rx-0/rps_cpus 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000  # echo f > /sys/class/net/eth0/queues/rx-0/rps_cpus # cat /sys/class/net/eth0/queues/rx-0/rps_cpus 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,0000000f

Concept of Cache Line

cache line - The smallest unit of memory than can be transferred between the main memory and the L1/L2 cache.
Rather than reading a single word or byte from main memory at a time, each cache entry is usually holds a certain number of words, known as a "cache line" or "cache block" and a whole line is read and cached at once. This takes advantage of the principle of locality of reference: if one location is read then nearby locations (particularly following locations) are likely to be read soon afterward. It can also take advantage of page-mode DRAM which allows faster access to consecutive locations.
The cache line is generally fixed in size, typically ranging from 16 to 256 bytes. The effectiveness of the line size depends on the application, and cache circuits may be configurable to a different line size by the system designer. There are also numerous algorithms for dynamically adjusting line size in real time.