1

I have a Bash script that has a loop inside of which there is a Bash command that calls another Bash script which in turn calls Python scripts.

Each of these bash commands within the loops could be run independently from each other. When I later run it on an actual dataset, it takes some time to execute each command. Therefore, I would like to take advantage and parallelize this part of the script.

I spent a few days going over options in Bash that do parallel execution, while also giving me the option to choose the number of cores that I want to parallelize the code such that I wont flood the server. After looking for options the GNU, xargs -P seemed to me the most reasonable, since I do not have to have a specific Bash version and it will work without installing extra libraries. However I am having difficulties making it work, even though it seems straight forward.

#!/bin/bash

while getopts i:t: option
do
case "${option}"
in
    i) in_f=${OPTARG};;
    t) n_threads=${OPTARG};;
esac
done    

START=$(date +%s)
class_file=$in_f
classes=( $(awk '{print $1}' ./$class_file))
rm -r tree_matches.txt
n="${#classes[@]}"
for i in $(seq 0  $n);
   do
     for j in $(seq $((i+1)) $((n-1)));
         do
            echo ${classes[i]}"    "${classes[j]} >> tree_matches.txt
         done
   done
col1=( $(awk '{print $1}' ./tree_matches.txt ))
col2=( $(awk '{print $2}' ./tree_matches.txt ))


printf "%s\0" {0..1275} | xargs -0 -I k -P $n_threads sh myFunction.sh -1 ${classes[k]} -2 ${classes[k]}

n_pairs="${#col1[@]}"

END=$(date +%s)
DIFF=$(( $END - $START ))
echo "Exec time $DIFF seconds"

You can ignore the initial two nested loops, I just pasted the entire script for completeness. The part that is going to be parallelized is the 4th line of code counting from the end of the script:

printf "%s\0" {0..1275} | xargs -0 -I k -P $n_threads sh myFunction.sh -1 ${classes[k]} -2 ${classes[k]}

This will loop over all pairs which is in my case 1275 in total and will ideally execute myFunction.sh in parallel with the specified number of threads using the variable $n_threads.

However, I am doing something wrong because the iterator k in that line is not indexing my two arrays ${classes[k]} and ${classes[k]}.

The loop keeps iterating 1275 times but it only indexes the first element of both arrays when I echo them. I later changed that line to this for troubleshooting:

printf "%s\0" {0..1275} | xargs -0 -I k -P $n_threads echo "index" k

It is actually incrementing the value of k each time it loops, however when I change that line to this:

printf "%s\0" {0..1275} | xargs -0 -I k -P $n_threads echo "index" "$((k))"

it is printing out 0, 1275 times as the value for k. I don't know what I'm doing wrong.

I actually have two vectors that are the same sizes and are input for myFunction.sh script. I just want an integer index to be able to index them at the same time and call my function with those two values that are indexed from those two vectors. I modified my code as follows based on your suggestion:

 for x in {0..10};
    do
        printf "%d\0" "$x"; done| xargs -0 -I @@ -P $n_threads sh markerGenes2TreeMatch.sh -1 ${col1[@@]}-2 ${col2[@@]}

however now when I execute the code I get the following error:

@@: syntax error: operand expected (error token is "@@")

I guess this index @@ is still in string format. I just want integer indices to be generated by as I loop and can execute this command in parallel.

3
  • I'm not sure why your last command doesn't work, but notice that -I implies -L 1, i.e., only one line if input will be processed at a time. Commented Jan 29, 2019 at 16:12
  • k can only be incremented by xargs if it sees it! ${classes[k]} is expanded by the inital script Commented Jan 29, 2019 at 16:16
  • Right, I think the problem is the order of evaluation. $(( ... )) is handled by the shell before xargs gets to see it, that's why $(( {} )) (or whatever you use as the argument to -I) doesn't work. You might have to use bash -c in your xargs command, see the manual. Commented Jan 29, 2019 at 16:18

3 Answers 3

1

For the line in question:

printf "%s\0" {0..1275} | xargs -0 -I k -P $n_threads sh myFunction.sh -1 ${classes[k]} -2 ${classes[k]}

${classes[k]} will be expanded by the shell (to nothing most likely), before xargs has a chance to see it.

Perhaps you could reorder to:

for x in {0..1275}; do printf "%s\0" "${classes[$x]}"; done |\
xargs -0 -I @@ -P $n_threads sh myFunction.sh -1 @@ -2 @@
Sign up to request clarification or add additional context in comments.

4 Comments

I'm getting the same error again, actually I have two vectors that are input to myFunction.sh these two vectors are of the same size and I can use one index value per command execution to call that function
you could replace sh with echo and check the expected value is being passed in. Are you sure classes holds valid data?
its actually two vectors, col1 & col2, I updated my main post to clarify this.
okay I fianlly figured it out... I simplified myFunction.sh script such that now it takes one argument and then it splits (cuts) the line into two using a comma as the delimiter. I followed your suggested structure and its working fine. Thanks all for the suggestions.
0

This line isn't working as you think it is:

printf "%s\0" {0..1275} | xargs -0 -I k -P $n_threads sh myFunction.sh -1 ${classes[k]} -2 ${classes[k]}

What happens is that BASH will first expand things like $n_threads and ${classes[k]} into strings and then calls xargs. Btw. ${classes[k]} is always "" since the key "k" isn't in the array classes. Try ${classes[$k]}; then BASH will substitute the variable k first, then use the result to look up a value in classes.

Maybe a better approach would be write the values from classes into a file and use that as input for xargs. You may have to change myFunction.sh to accept a single argument (= one line of input) and take it apart in the script.

Comments

0

With GNU Parallel you could probably do:

classes=( $(awk '{print $1}' ./$class_file))
parallel markerGenes2TreeMatch.sh -1 {=1 'if($arg[1] ge $arg[2]) { skip() }' =} -2 {2} ::: ${classes[@]} ::: ${classes[@]}

or:

parallel --plus markerGenes2TreeMatch.sh -1 {1choose_k} -2 {2choose_k} ::: ${classes[@]} ::: ${classes[@]}

Then you can skip the whole generation of tree_match.txt, and $col1/$col2.

Use parallel --embed to include GNU Parallel directly in your script, so you do not have external dependencies.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.