The purpose of this document is to understand why zero-copy strategy performs better from an empirical point of view.
I’ll cover this with an example of some java code that merges large files into one single target file. For the merge code I will use 2 different approaches:
- Using NIO API (zero copy)
- Using IO API
To dig into the reason of the better performance of the zero copy I will benchmark these two approaches using jmh. By looking at the results I will point some numbers which shows why the zero copy approach performs better.
The Code
NIO
It will relay on the NIO API FileChannel#transfer
which will use the syscall sendfile()
. By using this call the kernel will take care of reading data from the source and writing data to the target without leaving the kernel space, thus the zero-copy.
@Benchmark
public void mergeNIO() throws Exception {
var outFile = Files.createTempFile("benchmark", ".file");
try (var out = FileChannel.open(outFile, CREATE, WRITE, DELETE_ON_CLOSE)) {
for (var f : files) {
try (FileChannel in = FileChannel.open(f.toPath(), READ)) {
for (long p = 0, l = in.size(); p < l; ) {
p += in.transferTo(p, l - p, out);
}
}
}
}
}
IO
It will read data from one InputStream
chunk by chunk into buffer and feed them into the OutputStream
. Under the hood a read()
syscall and a write()
syscall will be performed as well as buffer copy from user space to kernel space and vice versa.
@Benchmark
public void mergeIO() throws Exception {
var outFile = Files.createTempFile("benchmark", ".file");
try (var out = Files.newOutputStream(outFile, CREATE, WRITE, DELETE_ON_CLOSE)) {
for (var f : files) {
try (var is = Files.newInputStream(f.toPath())) {
byte[] buffer = new byte[16*1024];
int read;
while ((read = is.read(buffer, 0, 16*1024)) >= 0) {
out.write(buffer, 0, read);
}
}
}
}
}
Benchmark results
For the jmh execution I used the gradle plugin me.champeau.jmh with the following profilers
configuration:
jmh {
profilers = ["gc", "perf", "perfasm"]
}
NIO
Summary
Benchmark Mode Cnt Score Error Units
FileBenchmark.mergeNIO avgt 25 3.733 ± 0.145 s/op
FileBenchmark.mergeNIO:·cpi avgt 0.872 clks/insn
FileBenchmark.mergeNIO:·gc.alloc.rate avgt 25 0.002 ± 0.001 MB/sec
FileBenchmark.mergeNIO:·gc.alloc.rate.norm avgt 25 7559.760 ± 44.878 B/op
FileBenchmark.mergeNIO:·gc.count avgt 25 ≈ 0 counts
FileBenchmark.mergeNIO:·ipc avgt 1.147 insns/clk
Perf
Secondary result "org.example.FileBenchmark.mergeNIO:·perf":
Perf stats:
--------------------------------------------------
56527,762660 task-clock (msec) # 0,720 CPUs utilized
4.268 context-switches # 0,076 K/sec
378 cpu-migrations # 0,007 K/sec
385 page-faults # 0,007 K/sec
44.583.533.420 cycles # 0,789 GHz (30,73%)
51.560.575.148 instructions # 1,16 insn per cycle (38,44%)
9.135.306.805 branches # 161,607 M/sec (38,46%)
110.707.654 branch-misses # 1,21% of all branches (38,45%)
15.420.453.382 L1-dcache-loads # 272,794 M/sec (24,72%)
1.013.435.761 L1-dcache-load-misses # 6,57% of all L1-dcache hits (15,43%)
283.399.957 LLC-loads # 5,013 M/sec (15,45%)
98.132.015 LLC-load-misses # 34,63% of all LL-cache hits (23,11%)
<not supported> L1-icache-loads
303.670.304 L1-icache-load-misses (30,80%)
15.177.454.388 dTLB-loads # 268,496 M/sec (29,93%)
3.568.314 dTLB-load-misses # 0,02% of all dTLB cache hits (16,71%)
453.513 iTLB-loads # 0,008 M/sec (15,37%)
195.070 iTLB-load-misses # 43,01% of all iTLB cache hits (23,03%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
78,491809508 seconds time elapsed
PerfASM
....[Hottest Methods (after inlining)]..........................................................
98.59% [kernel.kallsyms] [unknown]
0.19% libjvm.so ElfSymbolTable::lookup
0.14% <unknown>
0.12% libc-2.27.so vfprintf
0.07% libc-2.27.so _IO_fwrite
0.04% libc-2.27.so _IO_default_xsputn
0.03% libjvm.so outputStream::do_vsnprintf_and_write_with_automatic_buffer
0.03% libpthread-2.27.so __libc_write
0.03% libjvm.so xmlStream::write_text
0.03% libjvm.so defaultStream::hold
0.02% libjvm.so stringStream::write
0.02% libjvm.so outputStream::update_position
0.02% libc-2.27.so syscall
0.02% libjvm.so defaultStream::write
0.02% ld-2.27.so __tls_get_addr
0.02% libjvm.so fileStream::write
0.02% libc-2.27.so vsnprintf
0.02% libjvm.so RelocIterator::initialize
0.02% libc-2.27.so [unknown]
0.02% libjvm.so outputStream::print
0.55% <...other 207 warm methods...>
................................................................................................
100.00% <totals>
IO
Summary
Benchmark Mode Cnt Score Error Units
FileBenchmark.mergeIO avgt 25 5.028 ± 0.183 s/op
FileBenchmark.mergeIO:·cpi avgt 0.966 clks/insn
FileBenchmark.mergeIO:·gc.alloc.rate avgt 25 0.029 ± 0.001 MB/sec
FileBenchmark.mergeIO:·gc.alloc.rate.norm avgt 25 154191.947 ± 51.517 B/op
FileBenchmark.mergeIO:·gc.count avgt 25 ≈ 0 counts
FileBenchmark.mergeIO:·ipc avgt 1.035 insns/clk
Perf
Secondary result "org.example.FileBenchmark.mergeIO:·perf":
Perf stats:
--------------------------------------------------
72972,795879 task-clock (msec) # 0,772 CPUs utilized
4.971 context-switches # 0,068 K/sec
546 cpu-migrations # 0,007 K/sec
911 page-faults # 0,012 K/sec
57.575.878.331 cycles # 0,789 GHz (30,73%)
60.626.232.521 instructions # 1,05 insn per cycle (38,43%)
10.920.368.865 branches # 149,650 M/sec (38,40%)
129.224.774 branch-misses # 1,18% of all branches (38,44%)
18.526.168.088 L1-dcache-loads # 253,878 M/sec (25,30%)
2.885.163.789 L1-dcache-load-misses # 15,57% of all L1-dcache hits (17,75%)
306.897.902 LLC-loads # 4,206 M/sec (17,32%)
103.565.396 LLC-load-misses # 33,75% of all LL-cache hits (23,08%)
<not supported> L1-icache-loads
438.074.938 L1-icache-load-misses (30,77%)
18.537.122.526 dTLB-loads # 254,028 M/sec (26,65%)
8.112.288 dTLB-load-misses # 0,04% of all dTLB cache hits (18,35%)
2.404.472 iTLB-loads # 0,033 M/sec (15,39%)
8.051.716 iTLB-load-misses # 334,86% of all iTLB cache hits (23,05%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
94,524422154 seconds time elapsed
PerfASM
....[Hottest Methods (after inlining)]..........................................................
88.57% [kernel.kallsyms] [unknown]
6.70% runtime stub StubRoutines::jlong_disjoint_arraycopy
0.97% c2, level 4 java.nio.channels.Channels::writeFully, version 2, compile id 697
0.92% c2, level 4 sun.nio.ch.ChannelInputStream::read, version 2, compile id 706
0.24% Unknown, level 0 sun.nio.ch.NativeThread::current, version 1, compile id 617
0.23% Unknown, level 0 sun.nio.ch.FileDispatcherImpl::write0, version 1, compile id 656
0.22% c2, level 4 org.example.FileBenchmark::mergeIO, version 4, compile id 725
0.19% <unknown>
0.18% libpthread-2.27.so __libc_write
0.18% libpthread-2.27.so __pthread_disable_asynccancel
0.15% libpthread-2.27.so __pthread_enable_asynccancel
0.13% libjvm.so ElfSymbolTable::lookup
0.13% libpthread-2.27.so __libc_read
0.11% Unknown, level 0 sun.nio.ch.FileDispatcherImpl::read0, version 1, compile id 654
0.09% libc-2.27.so vfprintf
0.08% libnio.so fdval
0.06% libnio.so Java_sun_nio_ch_FileDispatcherImpl_write0
0.04% libc-2.27.so _IO_fwrite
0.03% libc-2.27.so _IO_default_xsputn
0.03% libjvm.so xmlStream::write_text
0.76% <...other 201 warm methods...>
................................................................................................
100.00% <totals>
Conclusion
If we look at both summary reports we can see that the NIO approach is faster: 3.733 s/op
vs 5.028 s/op
and it requires barely no heap allocations (0.002 MB/sec
vs 0.029 MB/sec
) compared to the IO approach.
By having a look at the perf and perfASM reports we can see why NIO is faster.
- It incurs fewer page faults
385
vs911
due to the lack ofread()
syscall - It doesn't spend time in array copy operations (
StubRoutines::jlong_disjoint_arraycopy
) thus honoring the zero-copy name 😄