Discuss of speeding up shutil.copytree

Name: Speed up `shutil.copytree` !
Rating: 3.2 (1855 reviews)
Author: mengqinyuan

Write here

This is a discussion on , see: https://discuss.python.org/t/speed-up-shutil-copytree/62078. If you have any ideas, send me please !

Background

shutil is a very useful moudle in Python. You can find it in github: https://github.com/python/cpython/blob/master/Lib/shutil.py

shutil.copytree is a function that copies a folder to another folder.

In this function, it calls _copytree function to copy.

What does _copytree do ?

Ignoring specified files/directories.
Creating destination directories.
Copying files or directories while handling symbolic links.
Collecting and eventually raising errors encountered (e.g., permission issues).
Replicating metadata of the source directory to the destination directory.

Problems

_copytree speed is not very fast when the numbers of files are large or the file size is large.

Test here:

import os
import shutil

os.mkdir('test')
os.mkdir('test/source')

def bench_mark(func, *args):
    import time
    start = time.time()
    func(*args)
    end = time.time()
    print(f'{func.__name__} takes {end - start} seconds')
    return end - start

# write in 3000 files
def write_in_5000_files():
    for i in range(5000):
        with open(f'test/source/{i}.txt', 'w') as f:
            f.write('Hello World' + os.urandom(24).hex())
            f.close()

bench_mark(write_in_5000_files)

def copy():
    shutil.copytree('test/source', 'test/destination')

bench_mark(copy)

The result is:

write_in_5000_files takes 4.084963083267212 seconds
copy takes 27.12768316268921 seconds

What I done

Multithreading

I use multithread to speed up the copying process. And I rename the funtion _copytree_single_threaded add a new function _copytree_multithreaded. Here is the copytree_multithreaded:

def _copytree_multithreaded(src, dst, symlinks=False, ignore=None, copy_function=shutil.copy2,
                            ignore_dangling_symlinks=False, dirs_exist_ok=False, max_workers=4):
    """Recursively copy a directory tree using multiple threads."""
    sys.audit("shutil.copytree", src, dst)

    # get the entries to copy
    entries = list(os.scandir(src))

    # make the pool
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # submit the tasks
        futures = [
            executor.submit(_copytree_single_threaded, entries=[entry], src=src, dst=dst,
                            symlinks=symlinks, ignore=ignore, copy_function=copy_function,
                            ignore_dangling_symlinks=ignore_dangling_symlinks,
                            dirs_exist_ok=dirs_exist_ok)
            for entry in entries
        ]

        # wait for the tasks
        for future in as_completed(futures):
            try:
                future.result()
            except Exception as e:
                print(f"Failed to copy: {e}")
                raise

I add a judgement to choose use multithread or not.

if len(entries) >= 100 or sum(os.path.getsize(entry.path) for entry in entries) >= 100*1024*1024:
        # multithreaded version
        return _copytree_multithreaded(src, dst, symlinks=symlinks, ignore=ignore,
                                        copy_function=copy_function,
                                        ignore_dangling_symlinks=ignore_dangling_symlinks,
                                        dirs_exist_ok=dirs_exist_ok)

else:
    # single threaded version
    return _copytree_single_threaded(entries=entries, src=src, dst=dst,
                                        symlinks=symlinks, ignore=ignore,
                                        copy_function=copy_function,
                                        ignore_dangling_symlinks=ignore_dangling_symlinks,
                                        dirs_exist_ok=dirs_exist_ok)

Test

I write 50000 files in the source folder. Bench Mark:


def bench_mark(func, *args):
    import time
    start = time.perf_counter()
    func(*args)
    end = time.perf_counter()
    print(f"{func.__name__} costs {end - start}s")

Write in:


import os
os.mkdir("Test")
os.mkdir("Test/source")

# write in 50000 files
def write_in_file():
    for i in range(50000):
         with open(f"Test/source/{i}.txt", 'w') as f:
             f.write(f"{i}")
             f.close()

Two comparing:


def copy1():
    import shutil
    shutil.copytree('test/source', 'test/destination1')

def copy2():
    import my_shutil
    my_shutil.copytree('test/source', 'test/destination2')

"my_shutil" is my modified version of shutil.

copy1 costs 173.04780609999943s
copy2 costs 155.81321870000102s

copy2 is faster than copy1 a lot. You can run many times.

Advantages & Disadvantages

Use multithread can speed up the copying process. But it will increase the memory usage. But we do not need to rewrite the multithread in the code.

Async

Thanks to "Barry Scott". I will follow his/her suggestion :

You might get the same improvement for less overhead by using async I/O.

I write these code:


import os
import shutil
import asyncio
from concurrent.futures import ThreadPoolExecutor
import time


# create directory
def create_target_directory(dst):
    os.makedirs(dst, exist_ok=True)

# copy 1 file
async def copy_file_async(src, dst):
    loop = asyncio.get_event_loop()
    await loop.run_in_executor(None, shutil.copy2, src, dst)

# copy directory
async def copy_directory_async(src, dst, symlinks=False, ignore=None, dirs_exist_ok=False):
    entries = os.scandir(src)
    create_target_directory(dst)

    tasks = []
    for entry in entries:
        src_path = entry.path
        dst_path = os.path.join(dst, entry.name)

        if entry.is_dir(follow_symlinks=not symlinks):
            tasks.append(copy_directory_async(src_path, dst_path, symlinks, ignore, dirs_exist_ok))
        else:
            tasks.append(copy_file_async(src_path, dst_path))

    await asyncio.gather(*tasks)
# choose copy method
def choose_copy_method(entries, src, dst, **kwargs):
    if len(entries) >= 100 or sum(os.path.getsize(entry.path) for entry in entries) >= 100 * 1024 * 1024:
        # async version
        asyncio.run(copy_directory_async(src, dst, **kwargs))
    else:
        # single thread version
        shutil.copytree(src, dst, **kwargs)
# test function
def bench_mark(func, *args):
    start = time.perf_counter()
    func(*args)
    end = time.perf_counter()
    print(f"{func.__name__} costs {end - start:.2f}s")

# write in 50000 files
def write_in_50000_files():
    for i in range(50000):
        with open(f"Test/source/{i}.txt", 'w') as f:
            f.write(f"{i}")

def main():
    os.makedirs('Test/source', exist_ok=True)
    write_in_50000_files()

    # 单线程复制
    def copy1():
        shutil.copytree('Test/source', 'Test/destination1')

    def copy2():
        shutil.copytree('Test/source', 'Test/destination2')

    # async
    def copy3():
        entries = list(os.scandir('Test/source'))
        choose_copy_method(entries, 'Test/source', 'Test/destination3')

    bench_mark(copy1)
    bench_mark(copy2)
    bench_mark(copy3)

    shutil.rmtree('Test')

if __name__ == "__main__":
    main()

Output:

copy1 costs 187.21s
copy2 costs 244.33s
copy3 costs 111.27s

You can see that the async version is faster than the single thread version. But the single thread version is faster than the multi-thread version. ( Maybe my test environment is not very good, you can try and send your result as a reply to me )

Thank you Barry Scott !

Advantages & Disadvantages

Async is a good choice. But no solution is perfect. If you find some problem, you can send me as a reply.

End

This is my first time to write discussion on python.org. If there is any problem, please let me know. Thank you.

My Github: https://github.com/mengqinyuan
My Dev.to: https://dev.to/mengqinyuan