Efficiency problem while building csv in perl

Question

I've written a script in perl which takes an xml, parses it, and builds a csv. Taking the xml, and parsing it and sorting it seems to go very smoothly, but once I get into larger datasets (ie building a csv with 10000 rows and 260 columns) the script starts to take a huge amount of time (~1hr) while building the csv string. I understand that Perl probably isn't the best for string concatenation; but I would have thought that it would have been more efficient than this.

Basically for the sake of sorting, I have two hashes of arrays. One hash contains arrays which I used for sorting. The other hash contains arrays for all of the other columns (columns I want to write into the csv; but have no relevance to how I want to sort them). So my problem code (and I have verified this is the code block taking forever) code is like this:

  my $csv = "Header1, Header2, Header3, Header4,...,HeaderN-1,HeaderN\n";
  foreach my $index (@orderedIndecies) {
    my @records = @{$primaryFields{"Important Field 1"}};
    $csv .= $records[$index] ? "$records[$index]," : ",";
    $csv .= $primaryIndex[$index] >= 0 ? "$primaryIndex[$index]," : ",";
    @records = @{$primaryFields{"Important Field 2"}};
    $csv .= $records[$index] ? "$records[$index]," : ",";
    foreach my $key (@keys) {
      @records = @{$csvContent{$key}};
      if($key eq $last) {
        $csv .= $records[$index] ? "$records[$index]\n" : "\n";
      } else {
        $csv .= $records[$index] ? "$records[$index]," : ",";
      }
    }
  }

I have also tried the same thing only using the join method instead of ".=". I've also tried foregoing the string aggregation all together and writing directly into a file. Both of these didn't seem to help that much. I'll be the first one to admit that my knowledge of memory management in perl probably isn't the greatest; so please feel free to school me (constructively). Also, if you think this is something I should consider rewriting outside of perl, please let me know.

EDIT: Some sample xml (please keep in mind I'm not in the position to edit the structure of the xml):

<fields>
  <field>
    <Name>IndicesToBeSorted</Name>
    <Records>idx12;idx14;idx18;...idxN-1;idxN</Records>
  </field>
  <field>
    <Name>Important Field1</Name>
    <Records>val1;val2;;val4;...;valn-1;valn</Records>
  </field>
  <field>
    <Name>Important Field2</Name>
    <Records>val1;val2;;val4;...;valn-1;valn</Records>
  </field>
  <field>
    <Name>Records...</Name>
    <Records>val1;val2;;val4;...;valn-1;valn</Records>
  </field>
  <field>
    <Name>More Records...</Name>
    <Records>val1;val2;;val4;...;valn-1;valn</Records>
  </field>
</fields>

The position of a record in one field corresponds to the position in another field. For example; the first item from each "Records" element is associated and makes up a column in my csv. So basically, my script parses all of this, and creates an array of ordered indices (which is what is in the @orderedIndecies in my example). The @orderdIndecies contains data like...

print "$orderedInecies[0]\n"  #prints index of location of idx0 
print "$orderedInecies[1]\n"  #prints index of location of idx1
print "$orderedInecies[2]\n"  #prints index of location of idx2
print "$orderedInecies[3]\n"  #prints index of location of idx3

I do things this way because the string from the orderedIndecies is out of order; and I didn't want to move all of the data around.

EDIT: FINAL ANSWER

  open my $csv_fh, ">", $$fileNameRef or die "$$fileNameRef: $!";
  print $csv_fh "Important Field 1,Index Field,Important Field 2";

  # Defining $comma, $endl, $empty allows me to do something like:
  #
  #                    print $csv_fh $val ? $val : $empty;
  #                    print $csv_fh $comma;
  # 
  # As opposed to....
  #
  #                    print $csv_fh $val ? "$val," : ",";
  #
  # Note, the first method avoids the string aggregation of "$val,"
  my $comma = ",";
  my $endl = "\n";
  my $empty = "";

  my @keys = sort(keys %csvContent);
  my $last = $keys[-1];
  foreach (@keys) {
    print $csv_fh $_;
    print $csv_fh $_ eq $last ? $endl : $comma;
  }

  # Even though the hash lookup is probably very efficient, I still
  # saw no need to redo it constantly, so I defined it here as 
  # opposed to inline within the for loops
  my @ImportantFields1 =  @{$primaryFields{"Important Field 1"}};
  my @ImportantFields2 =  @{$primaryFields{"Important Field 2"}};

  print "\n\n--------- BUILD CSV START ---------------\n\n";
  foreach my $index (@orderedIndecies) {
    print $csv_fh exists $ImportantFields1[$index] ? $ImportantFields1[$index] : $empty;
    print $csv_fh $comma;
    print $csv_fh $originalIndexField[$index] >= 0 ? $originalIndexField[$index] : $empty;
    print $csv_fh $comma;
    print $csv_fh exists $ImportantFields2[$index] ? $ImportantFields2[$index] : $empty;

    #If needed, this is where you would make sure to escape commas
    foreach my $key (@keys) {
      print $csv_fh $comma;
      $record = exists @{$csvContent{$key}}[$index] 
                     ? @{$csvContent{$key}}[$index];
                     : $empty;
    }
    print $csv_fh $endl;
  }  

  print "\n\n------- CSV Contents wrtten to file -----------\n\n"
  close($csv_fh);

Thanks for the help guys :D

It's kind of hard to tell what you're doing without seeing some sample XML and the expected CSV output. Also, this would probably be trivial (or very easy) using XSLT instead. — Jim Garrison
– Jim Garrison, Commented Jan 14, 2011 at 17:53
Updated with sample xml. Would using xslt be that much more efficient? I know that the xml I have isn't structured very well and I would have to have a lot of logic for string parsing inside the xslt to get the resulting csv to do what I want. Especially since I need to sort the contents of the rows. Is xslt still something I should consider? — Dave
– Dave, Commented Jan 14, 2011 at 18:21
Actually, perl is a great tool for manipulating strings - it's one of its primary purposes. The reason why it takes so long is your algorithm itself - it's O(N^2). regardless of language, that is going to take a long time with 2.6 million values to concatenate. It has to allocate, then re-allocate every time (almost - perl is generally smart about it) you do that kind of operation. — Robert P
– Robert P, Commented Jan 14, 2011 at 18:25
Hey all, I just wanted to say that I'm still working on this. Some other stuff came up; but I intend to try you're answers and post results as soon as I get the chance. — Dave
– Dave, Commented Jan 18, 2011 at 17:39

Robert P · Accepted Answer · 2011-01-14 18:43:33Z

I put together a small program that creates an array of 100,000 random strings and another of 260 random strings, and then looped over them, concatenating, printing out the time it's taken to get to that part of the loop. For the first 70,000 values concatonated, this is the time it's taking me with my program:

On Key 0, has been 00:00:00
On value 0, has been 00:00:00
On value 10000, has been 00:00:05.4373956
On value 20000, has been 00:00:22.3901951
On value 30000, has been 00:00:51.1552678
On value 40000, has been 00:01:31.0138775
On value 50000, has been 00:02:26.4659378
On value 60000, has been 00:03:32.6834164
On value 70000, has been 00:04:48.4788361

As you can see, concatenation itself is generally not a great way to handle this kind of operation. Even though you have a 'fixed' amount of data, the time it takes increases due to the memory copying needs.

The alternative is to do what is recommended in Perl - write it to disk! :) Don't do any concatenation at all - write each part, as you get it, to disk. My sample program that finished its last line at 23.2183042 seconds. It'd look something like this:

# a couple utilities

use File::Temp qw/tempdir/;
use File::Spec::Functions; # catdir

# ... and later

my $dir = tempdir();  # give me a temporary, empty place to put it
my $file = catdir($dir, 'temp.csv');

open my $fh, '>', $file
    or die "Can't open '$file' for write: $!";

print {$fh} "Header1, Header2, Header3, Header4,...,HeaderN-1,HeaderN\n";
foreach my $index (@orderedIndecies) {
    my @records = @{$primaryFields{"Important Field 1"}};
    print $records[$index] ? "$records[$index]," : ",";
    print $primaryIndex[$index] >= 0 ? "$primaryIndex[$index]," : ",";
    @records = @{$primaryFields{"Important Field 2"}};
    print $records[$index] ? "$records[$index]," : ",";
    foreach my $key (@keys) {
        @records = @{$csvContent{$key}};
        if($key eq $last) {
            print {$fh} $records[$index] ? "$records[$index]\n" : "\n";
        } else {
            print {$fh} $records[$index] ? "$records[$index]," : ",";
        }
    }
}

close $fh
    or warn "Can't close '$file' for some reason: $!";

Now there is no concatenation, no memory copying. It should go really, really fast.

Marking this as an answer. I originally had "thought" I had done this properly. I did switch to printing to the file handle directly before, but failed to realize I was still doing a lot of string aggregation. I will post an edit with the final algo I ended up using.

Ashley · Accepted Answer · 2011-01-14 19:13:45Z

4

If you try using the following instead of hand rolled code, you will see at least some speed increase and I'd expect it to be pretty dramatic. Text::CSV_XS for the CSV writing (and reading if you need it) and XML::LibXML for the XML parsing. They each use C under the hood.

String concatenation can be a sign of trouble for robustness as well as speed.

answered Jan 14, 2011 at 19:13

Ashley

4,3352 gold badges21 silver badges28 bronze badges

Comments

mob · Accepted Answer · 2011-01-14 18:10:26Z

2

Is it necessary to assemble the whole output into the one variable $csv? A saner approach would be to use an array containing, say, one element for each record. Then you can just print the array to your output stream, or if you insist, using join to concatenate the array elements into a single scalar.

Within each record, you can also assemble the individual fields into an array, and join them at the end

  my @csv = ("Header1, Header2, Header3, Header4,...,HeaderN-1,HeaderN\n");
  foreach my $index (@orderedIndecies) {
    my @records = @{$primaryFields{"Important Field 1"}};
    my @newRow = ();
    push @newRow, $records[$index] ? $records[$index] : "";
    # alternatively:  push @newRow, $records[$index] || "";
    push @newRow, $primaryIndex[$index]>=0 ? $primaryIndex[$index] : "";
    @records = @{$primaryFields{"Important Field 2"}};
    push @newRow, $records[$index] ? $records[$index] : "";
    foreach my $key (@keys) {
      @records = @{$csvContent{$key}};
      push @newRow, $records[$index] ? $records[$index] : "";
    }
    push @csv, join(",", @newRow) . "\n";
  }

  ...
  $csv = join '', @csv;      # if necessary

edited Jan 14, 2011 at 18:10

answered Jan 14, 2011 at 18:05

mob

119k18 gold badges159 silver badges291 bronze badges

1 Comment

Dave Over a year ago

Thanks for the answer; I'll give that a try when I get the chance and post my results.

Collectives™ on Stack Overflow

Efficiency problem while building csv in perl

3 Answers 3

1 Comment

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related