R&D/DPDK

C2758 vs. E5-2609 performance

sunshout 2014. 10. 13. 17:07

DPDK receive performance 


Atom C2758 and E5-2609 core has same clock speed 2.4GHz.

But the receiving performance has difference values, since Atom has much cache-misses.


L2 Forwarding Performance

 
Baremetal
VM(IOMMU)
VM(No IOMMU)
Vhost

Xeon

(E5-2609 @ 2.4 GHz)

13.66 Mpps

(9179 Mbps)


7.46 Mpps

(4777 Mbps)

13.57 Mpps

(9122 Mbps)

 

 

Atom

(C2758 @ 2.41GHz)

9.78 Mpps

(6576 Mbps)

 

 

9.63 Mpps

(6476 Mbps)

* SR-IOV w/o IOMMU (private patch)

 
     

 

 

C2758 Atom Core




root@cnode24-m:~# perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,branch-misses -t 4891 sleep 10


 Performance counter stats for thread id '4891':


         677353724 cache-references                                             [33.31%]

         120428677 cache-misses              #   17.779 % of all cache refs     [33.35%]

       24121523267 cycles                    [33.37%]

       24619412547 instructions              #    1.02  insns per cycle         [50.03%]

        3744803946 branches                                                     [50.00%]

          54565668 branch-misses             #    1.46% of all branches         [49.97%]


      10.001082048 seconds time elapsed


on virtio DPDK
root@server:~/suprem/linux-stable/tools/perf# ./perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,branch-misses -t 2165 sleep 10

 Performance counter stats for thread id '2165':

        89,454,697 cache-references                                             [33.30%]
         3,212,875 cache-misses              #    3.592 % of all cache refs     [33.38%]
     6,635,235,594 cycles                    [33.35%]
     1,744,194,176 instructions              #    0.26  insns per cycle         [50.03%]
       371,715,356 branches                                                     [50.00%]
         6,856,483 branch-misses             #    1.84% of all branches         [49.97%]

      10.001054322 seconds time elapsed

 
C2758 cache size
     *-cache:0
          description: L1 cache
          physical id: 25
          slot: L1-Cache
          size: 448KiB
          capacity: 448KiB
          capabilities: synchronous internal write-back instruction
     *-cache:1
          description: L2 cache
          physical id: 26
          slot: L2-Cache
          size: 4MiB
          capacity: 4MiB
          capabilities: synchronous internal write-back unified


There are no iTLB in ATOM
Last level iTLB entries: 4KB 0
Last level dTLB entries: 4KB 128
E5-2609 Xeon Core

root@cos05-m:~# perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,branch-misses -t 2810 sleep 10

 Performance counter stats for thread id '2810':

         186538457 cache-references                                             [100.00%]
               500 cache-misses              #    0.000 % of all cache refs     [100.00%]
       23988762917 cycles                    [100.00%]
       43773193744 instructions              #    1.82  insns per cycle         [100.00%]
        6520118366 branches                                                     [100.00%]
          25966179 branch-misses             #    0.40% of all branches

      10.001159796 seconds time elapsed


E5-2609 cache size
          configuration: cores=4 enabledcores=4 threads=4
        *-cache:0
             description: L1 cache
             physical id: 700
             size: 128KiB
             capacity: 128KiB
             capabilities: internal write-through data
        *-cache:1
             description: L2 cache
             physical id: 701
             size: 1MiB
             capacity: 1MiB
             capabilities: internal write-through unified
        *-cache:2
             description: L3 cache
             physical id: 702
             size: 10MiB
             capacity: 10MiB
             capabilities: internal write-back unified


The LLC (last-level cache) is the last level in the memory hierarchy before main memory. Any memory requests missing here must be serviced by local or remote DRAM, with significant latency. The LLC Miss metric shows a ratio of cycles with outstanding LLC misses to all cycles.