The purpose of this document is to understand why zero-copy strategy performs better from an empirical point of view.

I’ll cover this with an example of some java code that merges large files into one single target file. For the merge code I will use 2 different approaches:

Using NIO API (zero copy)
Using IO API

To dig into the reason of the better performance of the zero copy I will benchmark these two approaches using jmh. By looking at the results I will point some numbers which shows why the zero copy approach performs better.

The Code

NIO

It will relay on the NIO API FileChannel#transfer which will use the syscall sendfile(). By using this call the kernel will take care of reading data from the source and writing data to the target without leaving the kernel space, thus the zero-copy.

@Benchmark
public void mergeNIO() throws Exception {
    var outFile = Files.createTempFile("benchmark", ".file");
    try (var out = FileChannel.open(outFile, CREATE, WRITE, DELETE_ON_CLOSE)) {
        for (var f : files) {
            try (FileChannel in = FileChannel.open(f.toPath(), READ)) {
                for (long p = 0, l = in.size(); p < l; ) {
                    p += in.transferTo(p, l - p, out);
                }
            }
        }
    }
}

IO

It will read data from one InputStream chunk by chunk into buffer and feed them into the OutputStream. Under the hood a read() syscall and a write() syscall will be performed as well as buffer copy from user space to kernel space and vice versa.

@Benchmark
public void mergeIO() throws Exception {
    var outFile = Files.createTempFile("benchmark", ".file");
    try (var out = Files.newOutputStream(outFile, CREATE, WRITE, DELETE_ON_CLOSE)) {
        for (var f : files) {
            try (var is = Files.newInputStream(f.toPath())) {
                byte[] buffer = new byte[16*1024];
                int read;
                while ((read = is.read(buffer, 0, 16*1024)) >= 0) {
                    out.write(buffer, 0, read);
                }
            }
        }
    }
}

Benchmark results

For the jmh execution I used the gradle plugin me.champeau.jmh with the following profilers configuration:

jmh {
    profilers = ["gc", "perf", "perfasm"]
}

NIO

Summary


Benchmark                                   Mode  Cnt       Score      Error   Units
FileBenchmark.mergeNIO                      avgt   25       3.733 ±  0.145       s/op
FileBenchmark.mergeNIO:·cpi                 avgt            0.872           clks/insn
FileBenchmark.mergeNIO:·gc.alloc.rate       avgt   25       0.002 ±  0.001     MB/sec
FileBenchmark.mergeNIO:·gc.alloc.rate.norm  avgt   25    7559.760 ± 44.878       B/op
FileBenchmark.mergeNIO:·gc.count            avgt   25         ≈ 0              counts
FileBenchmark.mergeNIO:·ipc                 avgt            1.147           insns/clk

Perf

Secondary result "org.example.FileBenchmark.mergeNIO:·perf":
Perf stats:
--------------------------------------------------

      56527,762660      task-clock (msec)         #    0,720 CPUs utilized          
             4.268      context-switches          #    0,076 K/sec                  
               378      cpu-migrations            #    0,007 K/sec                  
               385      page-faults               #    0,007 K/sec                  
    44.583.533.420      cycles                    #    0,789 GHz                      (30,73%)
    51.560.575.148      instructions              #    1,16  insn per cycle           (38,44%)
     9.135.306.805      branches                  #  161,607 M/sec                    (38,46%)
       110.707.654      branch-misses             #    1,21% of all branches          (38,45%)
    15.420.453.382      L1-dcache-loads           #  272,794 M/sec                    (24,72%)
     1.013.435.761      L1-dcache-load-misses     #    6,57% of all L1-dcache hits    (15,43%)
       283.399.957      LLC-loads                 #    5,013 M/sec                    (15,45%)
        98.132.015      LLC-load-misses           #   34,63% of all LL-cache hits     (23,11%)
   <not supported>      L1-icache-loads                                             
       303.670.304      L1-icache-load-misses                                         (30,80%)
    15.177.454.388      dTLB-loads                #  268,496 M/sec                    (29,93%)
         3.568.314      dTLB-load-misses          #    0,02% of all dTLB cache hits   (16,71%)
           453.513      iTLB-loads                #    0,008 M/sec                    (15,37%)
           195.070      iTLB-load-misses          #   43,01% of all iTLB cache hits   (23,03%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

      78,491809508 seconds time elapsed

PerfASM


....[Hottest Methods (after inlining)]..........................................................
  98.59%   [kernel.kallsyms]  [unknown] 
   0.19%           libjvm.so  ElfSymbolTable::lookup 
   0.14%                      <unknown> 
   0.12%        libc-2.27.so  vfprintf 
   0.07%        libc-2.27.so  _IO_fwrite 
   0.04%        libc-2.27.so  _IO_default_xsputn 
   0.03%           libjvm.so  outputStream::do_vsnprintf_and_write_with_automatic_buffer 
   0.03%  libpthread-2.27.so  __libc_write 
   0.03%           libjvm.so  xmlStream::write_text 
   0.03%           libjvm.so  defaultStream::hold 
   0.02%           libjvm.so  stringStream::write 
   0.02%           libjvm.so  outputStream::update_position 
   0.02%        libc-2.27.so  syscall 
   0.02%           libjvm.so  defaultStream::write 
   0.02%          ld-2.27.so  __tls_get_addr 
   0.02%           libjvm.so  fileStream::write 
   0.02%        libc-2.27.so  vsnprintf 
   0.02%           libjvm.so  RelocIterator::initialize 
   0.02%        libc-2.27.so  [unknown] 
   0.02%           libjvm.so  outputStream::print 
   0.55%  <...other 207 warm methods...>
................................................................................................
 100.00%  <totals>

IO

Summary

Benchmark                                   Mode  Cnt       Score      Error   Units
FileBenchmark.mergeIO                       avgt   25       5.028 ±  0.183       s/op
FileBenchmark.mergeIO:·cpi                  avgt            0.966           clks/insn
FileBenchmark.mergeIO:·gc.alloc.rate        avgt   25       0.029 ±  0.001     MB/sec
FileBenchmark.mergeIO:·gc.alloc.rate.norm   avgt   25  154191.947 ± 51.517       B/op
FileBenchmark.mergeIO:·gc.count             avgt   25         ≈ 0              counts
FileBenchmark.mergeIO:·ipc                  avgt            1.035           insns/clk

Perf

Secondary result "org.example.FileBenchmark.mergeIO:·perf":
Perf stats:
--------------------------------------------------

      72972,795879      task-clock (msec)         #    0,772 CPUs utilized          
             4.971      context-switches          #    0,068 K/sec                  
               546      cpu-migrations            #    0,007 K/sec                  
               911      page-faults               #    0,012 K/sec                  
    57.575.878.331      cycles                    #    0,789 GHz                      (30,73%)
    60.626.232.521      instructions              #    1,05  insn per cycle           (38,43%)
    10.920.368.865      branches                  #  149,650 M/sec                    (38,40%)
       129.224.774      branch-misses             #    1,18% of all branches          (38,44%)
    18.526.168.088      L1-dcache-loads           #  253,878 M/sec                    (25,30%)
     2.885.163.789      L1-dcache-load-misses     #   15,57% of all L1-dcache hits    (17,75%)
       306.897.902      LLC-loads                 #    4,206 M/sec                    (17,32%)
       103.565.396      LLC-load-misses           #   33,75% of all LL-cache hits     (23,08%)
   <not supported>      L1-icache-loads                                             
       438.074.938      L1-icache-load-misses                                         (30,77%)
    18.537.122.526      dTLB-loads                #  254,028 M/sec                    (26,65%)
         8.112.288      dTLB-load-misses          #    0,04% of all dTLB cache hits   (18,35%)
         2.404.472      iTLB-loads                #    0,033 M/sec                    (15,39%)
         8.051.716      iTLB-load-misses          #  334,86% of all iTLB cache hits   (23,05%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

      94,524422154 seconds time elapsed

PerfASM

....[Hottest Methods (after inlining)]..........................................................
  88.57%   [kernel.kallsyms]  [unknown] 
   6.70%        runtime stub  StubRoutines::jlong_disjoint_arraycopy 
   0.97%         c2, level 4  java.nio.channels.Channels::writeFully, version 2, compile id 697 
   0.92%         c2, level 4  sun.nio.ch.ChannelInputStream::read, version 2, compile id 706 
   0.24%    Unknown, level 0  sun.nio.ch.NativeThread::current, version 1, compile id 617 
   0.23%    Unknown, level 0  sun.nio.ch.FileDispatcherImpl::write0, version 1, compile id 656 
   0.22%         c2, level 4  org.example.FileBenchmark::mergeIO, version 4, compile id 725 
   0.19%                      <unknown> 
   0.18%  libpthread-2.27.so  __libc_write 
   0.18%  libpthread-2.27.so  __pthread_disable_asynccancel 
   0.15%  libpthread-2.27.so  __pthread_enable_asynccancel 
   0.13%           libjvm.so  ElfSymbolTable::lookup 
   0.13%  libpthread-2.27.so  __libc_read 
   0.11%    Unknown, level 0  sun.nio.ch.FileDispatcherImpl::read0, version 1, compile id 654 
   0.09%        libc-2.27.so  vfprintf 
   0.08%           libnio.so  fdval 
   0.06%           libnio.so  Java_sun_nio_ch_FileDispatcherImpl_write0 
   0.04%        libc-2.27.so  _IO_fwrite 
   0.03%        libc-2.27.so  _IO_default_xsputn 
   0.03%           libjvm.so  xmlStream::write_text 
   0.76%  <...other 201 warm methods...>
................................................................................................
 100.00%  <totals>

Conclusion

If we look at both summary reports we can see that the NIO approach is faster: 3.733 s/op vs 5.028 s/op and it requires barely no heap allocations (0.002 MB/sec vs 0.029 MB/sec) compared to the IO approach.

By having a look at the perf and perfASM reports we can see why NIO is faster.

It incurs fewer page faults 385 vs 911 due to the lack of read() syscall
It doesn't spend time in array copy operations (StubRoutines::jlong_disjoint_arraycopy) thus honoring the zero-copy name 😄

Benchmarking Java Zero-copy

The Code

NIO

IO

Benchmark results

NIO

Summary

Perf

PerfASM

IO

Summary

Perf

PerfASM

Conclusion

References