5

I have a large folder of data files, which I want to copy into subfolders to make a specified number of batches.

Right now I count how many files there are and use that to make ranges, like so:

cd /dump
batches=4
files=$(cat /data/samples.txt | wc -l)

echo "$files $batches"
for ((i=0; i<=batches; i++))
do
    mkdir -p merged_batches/batch$i
    rm -f merged_batches/batch$i/*

    ls merged/*.sorted.labeled.bam* |
      head -n $(( $((files/batches)) * $((i+1)) * 2 )) |
      tail -n $((2 * files/batches)) |
      xargs -I {} cp {} merged_batches/batch$i

done

Is there a more convenient way to do this?

6
  • You define files= but you don't then use it in your code (other than as a comment). What's its relevance? In fact, I can't see the relevance of /data/samples.txt at all Commented Aug 7 at 16:54
  • Do the files in each batch need to be consecutive, or could we take the list of files and spread them? For example, for three batches, could we put the first into batch 1, the second into batch 2, the third into batch 3, and then the fourth into batch 1, etc.? Commented Aug 7 at 16:59
  • does /data/samples.txt contain a list of all the files you need to copy in batches? If so, you can use split to split it into 4 roughly equal "chunks" with e.g. split -n r/4 data/samples.txt -d samples.. Then you can just iterate over the contents of samples.01, samples.02, samples.03, and samples.04 Commented Aug 7 at 17:04
  • also, if you're using GNU cp (which is standard on linux), you can use the -t, --target-directory=DIRECTORY option, so you don't need to use -I {} with xargs. e.g. something like xargs -d '\n' cp -t batch.04/ < samples.04. Note: this assumes that none of your filenames in samples.txt contain newlines. If they do, you'll need to regenerate that file using NUL as the separator (both split and xargs and many other tools can work with NUL-separated input. BTW, NUL is the ONLY truly safe separator to use because it's the only character that can not be in a filename) Commented Aug 7 at 17:09
  • Yes samples.txt has all my files. Commented Aug 7 at 17:44

4 Answers 4

8

Using find, split, and xargs, all with NUL-separated filenames:

#!/bin/bash

batches=4
data_dir='/path/to/where/your/files/are/stored'

find "$data_dir" -type f -print0 > samples.nul

split -a 1 -d -t'\0' -n r/"$batches" samples.nul samples.nul.
rm -f samples.nul

for i in $(seq 1 "$batches"); do
  xargs -0r echo cp -t "batch$i" < "samples.nul.$i"
  rm -f "samples.nul.$i"
done

If necessary, you can use other find predicates such as -mindepth, -maxdepth, -iname etc to control exactly which filenames are found.

NOTE: this is written to be a dry-run. It will only echo what would be copied. Once you've tested that it does what you want, remove the echo from the xargs command. And maybe add -v to cp's options if you want verbose output.

BTW, I used r/$batches instead of l/$batches with split because the round-robin split seems to produce a more equal-ish number of files per batch with the directory I tested with on my system (probably because the paths underneath that dir were of wildly varying lengths, and r/$batches split consecutive long paths between multiple files). This may not be the case with your files.

5
#!/bin/sh
batchsize=5
batchcount=10

tdir=1
buffer=0
for i in *; do
    [ $buffer -eq $batchsize ] && tdir=$((tdir + 1)) && buffer=0
    [ $tdir -gt $batchcount ] && break
    [ -d "$tdir" ] || mkdir -p $tdir
    if [ -f "$i" ]; then
        buffer=$((buffer + 1))
        cp "$i" $tdir/
    fi
done

Running this script in a directory, will make a directory called 1, put the first 5 files in it, then make a directory called 2 and put the next 5 files in that and so on until it's created and filled 10 directories.

If you want to test this, replace the cp command with echo cp so it'll show you the copy commands instead of actually running them, and comment out the mkdir command so it won't create the directories.

Undoing it should be real simple too, just run

rm [1-10]/* ./ && rmdir [1-10]

But if you had any other folders named 1-10 that weren't created by the command, all data in them would be lost so careful with this one.

I don't think there's any more convenient way to do this, this kind of stuff is what bash scripts are for.


Here is an alternative version that automatically fills in either batchcount or batchsize if one of them is not set.

#!/bin/sh
batchsize=auto
batchcount=5

ceil(){ #  This function rounds a number up
    [[ $@ == *"."* ]] && [[ "${@##*\.}" -gt 0 ]] && echo $(("${@%%\.*}" + 1)) || echo "${@%%\.*}"
}

if [[ ! "${batchsize}${batchcount}" =~ ^-?[0-9]+$ ]]; then #One of the variables is not an integer
    filecount=$(ls -Ap | grep -v '/' | wc -l)
    if [[ ! "$batchsize" =~ ^-?[0-9]+$ ]] && [[ ! "$batchcount" =~ ^-?[0-9]+$ ]]; then #Neither of the variables is set.
        echo "Error: Batchsize and Batchcount are both unset, please set at least one of them."
        exit
    elif [[ ! "$batchsize" =~ ^-?[0-9]+$ ]]; then #Only batchsize is unset
        batchsize=$(ceil $(bc -l <<< "$filecount / $batchcount"))
    elif [[ ! "$batchcount" =~ ^-?[0-9]+$ ]]; then #Only batchcount is unset
        batchcount=$(ceil $(bc -l <<< "$filecount / $batchsize"))
    fi
fi

tdir=1
buffer=0
for i in *; do
    [ $buffer -eq $batchsize ] && tdir=$((tdir + 1)) && buffer=0
    [ $tdir -gt $batchcount ] && break
    [ -d "$tdir" ] || mkdir -p $tdir
    if [ -f "$i" ]; then
        buffer=$((buffer + 1))
        cp "$i" $tdir/
    fi
done

So in the above configuration's case, it will divide the files into 5 batches, so if you have 23 files, it will divide them into 5 batches of 5, with the last batch only having 3.

If you set batchcount to auto, and set batchsize to 3, it will divide the files evenly into 7 batches of 3.

4
  • What if I have an odd number of files? Commented Aug 11 at 14:43
  • @rubberduck the last folder will just have less files in it then unless you change the batch size, or the batchcount is small enough that not all the files are covered. Unless you wanted to automate batchsize based on the number of files or something, e.g. only set a batchcount and have an automated batchsize for it, it'd require changing the script but it'd be a fairly trivial change tho. Commented Aug 11 at 14:55
  • I want to automate at least one of the two. Commented Aug 11 at 16:02
  • @rubberduck i have updated the answer with a version that allows you to automate whichever one of those two you want. Set one and the other will set itself. Commented Aug 11 at 19:22
3

This solution spreads the files as equally as it can, numerically speaking, across the number of required batches. It does this by taking the first nrBatches files and putting one into each of the target directories, and then taking the next nrBatches files and spreading those, and so on until we run out of files.

# How many batches
nrBatches=5

# Delete the target batch directories
rm -rf -- "/dump/merged_batches/batch"*

# And off we go…
batch=1
for file in /dump/merged/*.sorted.labeled.bam*
do
    # Only files
    [ -f "$file" ] || continue

    # Target batch directory
    tbDir="/dump/merged_batches/batch$batch"

    echo "Putting $file into batch $batch ($tbDir)" >&2
    mkdir -p -- "$tbDir"
    cp -- "$file" "$tbDir"

    # Move to the next batch
    batch=$(( (batch % nrBatches) +1 ))
done

For testing purposes you may want to comment out the mkdir and cp commands. If you want a silent solution comment out or omit the echo.

3

Here's yet another option, using an array to handle arbitrary filenames as NUL-terminated strings, and taking into consideration the size of the files:

#!/usr/bin/env bash

batches=5

declare -a files

readarray -d '' files < <(
    find merged/*.sorted.labeled.bam* -type f -print0
)

N="${#files[@]}"                # N files found
 
n=$(((N+batches-1) / batches))  # n files per batch
 
rm -rf merged_batches/
eval mkdir -p merged_batches/batch{1..$batches}

for k in "${!files[@]}"
do
    printf 'cp -p %s "merged_batches/batch%d/"\n' "${files[k]@Q}" $((k / n + 1))
done

That script will print the commands to copy your files into the batched directories. Pipe it to less or your favorite pager and inspect the proposed command list:

./test.sh | less

If you approve of that work, exit the pager, and this time, pipe the script into a shell:

./test.sh | bash

If the files are large, or so very numerous that they total up to a significant amount of space, consider creating hard links instead. Change the printf statement to be:

printf 'ln %s "merged_batches/batch%d/"\n' "${files[k]@Q}" $((k / n + 1))

Using ln instead of cp means that the batched files won't occupy any space except the space the original files take up. If the files are large, this will also make the script run faster. If your plan is to eventually delete the original files and move forward with the batched files, then further down the road, you can eventually:

rm -rf merged/*.sorted.labeled.bam*

and still retain the batched files under the merged_batches/ directory.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.