0

Im working on a small program where I have three threads with each a task to run: copy a file from A to B using a bash script. One thread copies a 100GB file, then another 2x a 10GB file.

The 100GB file copy starts then with a delay I start the copy of the first 10GB file when that is done the last 10GB copy starts. From the network speed I see that the copy of the first file starts with 120MB/s (more or less), then the 10GB file starts and we see some noise on the line but no real drop in transfer speed. When the 10GB files are finished, the 100GB file continues at a significant lower rate:

enter image description here

import subprocess
import threading
import time


def run(arguments):
    process = subprocess.Popen(
        arguments,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        text=True,  # For string output instead of bytes
    )

    # Print each line as it comes
    for line in process.stdout:
        print(line, end="\n")  # `line` already includes newline

    print(f"Done for thread {threading.current_thread()}")


def create_threads(experiment_paths, experiment_name):
    machine_name = "test"
    threads = []
    for i, path_name in enumerate(experiment_paths):
        cli_args = [
            "bash",
            mount_script,
            f"--folder={experiment_name}",
            f"--thread={i}",
            f"--file-to-copy={path_name}",
            f"--machine-name={machine_name}",
        ]
        new_thread = threading.Thread(target=run, daemon=False, args=(cli_args,))
        threads.append(new_thread)

    return threads

experiment_name = 'test'
experiment_paths = ['/my/path/to/file/100gb', '/my/path/to/file/10gb', '/my/path/to/file/10gb']
threads = create_threads(experiment_paths, experiment_name=experiment_name)
t0 = time.time()
for t in threads:
    print(f"Starting thread {t.name}")
    t.start()         
    time.sleep(80)

for t in threads:
    print(f"Joining thread {t.name}")
    t.join()

how can I ensure that the copy speed of the 100GB file resume at maximum speed? Im working on a linux system btw, copying to a mounted cifs share (samba)

EDIT: when only transfering the 100GB file it goes at full speed the whole time. The idea of the 80second delay was to see if the 100GB transfer at least starts with full speed

6
  • Are you working with an HDD or SSD? Usually HDD have fragmentations might be that the first 20GB, closer together than the rest of file. Commented Oct 4 at 7:04
  • Have you tried monitoring the transfer of just the 100Gb file? It may be that the 10Gb files have nothing to do with what you're seeing. Also, what is the purpose of the 80s delay? Commented Oct 4 at 7:15
  • @Ramrab Transfering just the 100Gb file goes with full speed. The 80 second delay was to see if the 100Gb transfer at least starts with full speed. Commented Oct 4 at 8:08
  • @Aren Hmm Im not even sure! The machine is remote, I assume it is an SSD. Commented Oct 4 at 8:08
  • Please use a capital B for Bytes, as lower case b looks like bits. Since you have KB/s in your image, I assume you mean you're getting 120MB/s on the bigger file, which is pretty slow by modern standards. I don't think this question can be answered without knowledge of the physical hardware being used in the remote system. It could be a NVME SSD but on a NAS, who knows. The other files may be on a completely different device. I rented a powerful Linode server once and had to page a lot of data between memory and storage and got around 5GB/s. Commented Oct 4 at 9:23

1 Answer 1

1

Linux doesn't write the file directly to network but instead store it in as dirty pages or page cache and write it out gradually.

  1. If you're sending over a single file it will go smoothly
  2. If you're working with two or more files Linux will also cache those files before writing it over to samba.
  3. It seems to be the case for write with samba(cache network write and flush them gradually to disk)

Note: It cache files in chunks In you case: Linux start caching your 100GB file and write it gradually to samba at that time it doesn't have to split resources, but when you start the second transfer Linux also cache the 10GB file and write to samba and split resources between those two files and you get a slightly lower performance. Once the 10GB file is done Linux needs to clean the cache. Also samba servers cache writes before writing them over to disk so you're Linux machine might be ready to send in more files but the samba server is still flushing to disk I found a fix to it at fine tune but disabling cache can cause data corruption

Edit: Linux NIC samba fine tune

CIFS

And it more about the way the protocol handle things than an issue with network or hardware

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for the detailed explanation! However, what still puzzles me with your answer in mind is: why doesn't the speed of the 100GB file transfer get back to top-speed when the 10GB files are done? I mean, it should not be bothered with writing cache files for the 10GB files since these are done. Even if the samba server is still flushing to disk, eventually the 100GB file should be able to pump in more data I guess.
Yes it will eventually but because the systems tries to free up cache while caching up another file. It won't go up immediately because system requires resources to free up cache and the 100GB transfer also tries to consume that same resource. So you're 100GB file transfer finishes before the cache are freed. superuser.com/questions/601607/… and when cache utilization goes to high then process that we're using Ram start using the disk to work which will slow down everything
Thanks, I have upvoted and accepted your answer

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.