csv format issue using multithreading in perl

Question

I'm running a perl script consisting of 30 threads to run a subroutine. For each thread, I'm supplying 100 data. In the subroutine, after the code does what its supposed to, I'm storing the output in a csv file. However, I find that on execution, the csv file has some data overlapped. For example, in the csv file, I'm storing name, age, gender, country this way-

print OUTPUT $name.",".$age.",".$gender.",".$country.",4\n";

The csv file should have outputs as such-

Randy,35,M,USA,4
Tina,76,F,UK,4

etc.

However, in the csv file, I see that some columns has overlapped or has been entered haphazardly in this way-

Randy,35,M,USA,4
TinaMike,76,UK
23,F,4

Is it because some threads are executing at the same time? What could I do to avoid this? I'm using the print statement only after I'm getting the data. Any suggestions?

4 is the group id which will remain constant.

Below is the code snippet:

#!/usr/bin/perl

use DBI;
use strict;
use warnings;
use threads;
use threads::shared;

my $host = "1.1.1.1";
my $database = "somedb";
my $user = "someuser";
my $pw = "somepwd";

my @threads;


open(PUT,">/tmp/file1.csv") || die "can not open file";
open(OUTPUT,">/tmp/file2.csv") || die "can not open file";

my $dbh = DBI->connect("DBI:mysql:$database;host=$host", $user, $pw ,) || die "Could not connect to database: $DBI::errstr";
$dbh->{'mysql_auto_reconnect'} = 1;

my $sql = qq{
    //some sql to get a primary keys
};

my $sth = $dbh->prepare($sql);
$sth->execute();
while(my @request = $sth->fetchrow_array())
{
#get other columns and print to file1.csv
            print PUT $net.",".$sub.",4\n";
            $i++; #this has been declared before
}


for ( my $count = 1; $count <= 30; $count++) {
        my $t = threads->new(\&sub1, $count);
        push(@threads,$t);
}
foreach (@threads) {
        my $num = $_->join;
        print "done with $num\n";
}

sub sub1 {
        my $num = shift;

        //calculated start_num and end_num based on an internal logic

        for(my $x=$start_num; $x<=$end_num; $x++){

                print OUTPUT $name.",".$age.",".$gender.",".$country.",4\n";
                $j++; #this has been declared before
            }

        sleep(1);
        return $num;
}

I have problem in the file2 which has the OUTPUT handler

you should lock file before printing to it, and unlock when done. But the simplest way, is print to STDOUT, and then just redirect output to new csv file — krevedko
– krevedko, Commented Aug 13, 2014 at 13:48
Printing to STDOUT won't help - you'll hit the same problems. Each of your threads are completely separate processes printing to the same filehandle. There's nothing that guarantees they won't interrupt each other with their 'print' statements. — Sobrique
– Sobrique, Commented Aug 13, 2014 at 14:28

Sobrique · Accepted Answer · 2014-08-13 14:26:56Z

4

You are multithreading and printing to a file from multiple threads. This will always end badly - print is not an 'atomic' operation, so different prints can interrupt each other.

What you need to do is serialize your output such that this cannot happen. The simplest way is to use a lock or a semaphore:

    my $print_lock : shared;

    { 
        lock $print_lock; 
        print OUTPUT $stuff,"\n";
    }

when the 'lock' drifts out of scope, it'll be released.

Alternatively, have a separate thread that 'does' file IO, and use Thread::Queue to feed lines to it. Depends somewhat on whether you need any ordering/processing of the contents of 'OUTPUT'.

Something like:

    use Thread::Queue;

    my $output_q = Thread::Queue -> new();


    sub output_thread {
      open ( my $output_fh, ">", "output_filename.csv" ) or die $!; 

       while ( my $output_line = $output_q -> dequeue() ) {
          print {$output_fh} $output_line,"\n"; 
       }

       close ( $output_fh );


     sub doing_stuff_thread {
        $output_q -> enqueue ( "something to output" );  #\n added by sub!
     }


     my $output_thread = threads -> create ( \&output_thread );
     my $doing_stuff_thread = threads -> create ( \&doing_stuff_thread );

     #wait for doing_stuff to finish - closing the queue will cause output_thread to flush/exit. 
     $doing_stuff_thread -> join();
     $output_q -> end;
     $output_thread -> join();

edited Aug 13, 2014 at 14:26

answered Aug 13, 2014 at 13:48

Sobrique

53.6k8 gold badges63 silver badges107 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user3446683 Over a year ago

I don't have any idea on locks. Could you refer some sites or snippets where I could have an understanding of locks?

user3446683 Over a year ago

thank you. I'll research a little more then as these are new to me :)

Sobrique Over a year ago

You may also want to look at autoflush as that may become relevant: perldoc.perl.org/IO/Handle.html - part of the problem is that - by default, file IO is buffering for efficiency. That can cause problems when multithreading. (But not if you just have one thread handling the IO, as then it's irrelevant).

Miller · Accepted Answer · 2014-08-13 17:37:21Z

2

Open the File handle globally, then try using flock on the file handle as demonstrated:

sub log_write {
    my $line = shift;
    flock(OUTPUT, LOCK_EX)      or die "can't lock: $!";
    seek(OUTPUT, 0, SEEK_END)   or die "can't fast forward: $!";
    print OUTPUT $line;
    flock(OUTPUT, LOCK_UN)      or die "can't unlock: $!";
}

Other example:

perlfaq5 - I still don't get locking. I just want to increment the number in the file. How can I do this?

edited Aug 13, 2014 at 17:37

Miller

35.3k4 gold badges42 silver badges61 bronze badges

answered Aug 13, 2014 at 13:28

Venkatesan

4423 gold badges6 silver badges20 bronze badges

Collectives™ on Stack Overflow

csv format issue using multithreading in perl

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related