I've written a script in perl which takes an xml, parses it, and builds a csv. Taking the xml, and parsing it and sorting it seems to go very smoothly, but once I get into larger datasets (ie building a csv with 10000 rows and 260 columns) the script starts to take a huge amount of time (~1hr) while building the csv string. I understand that Perl probably isn't the best for string concatenation; but I would have thought that it would have been more efficient than this.
Basically for the sake of sorting, I have two hashes of arrays. One hash contains arrays which I used for sorting. The other hash contains arrays for all of the other columns (columns I want to write into the csv; but have no relevance to how I want to sort them). So my problem code (and I have verified this is the code block taking forever) code is like this:
my $csv = "Header1, Header2, Header3, Header4,...,HeaderN-1,HeaderN\n";
foreach my $index (@orderedIndecies) {
my @records = @{$primaryFields{"Important Field 1"}};
$csv .= $records[$index] ? "$records[$index]," : ",";
$csv .= $primaryIndex[$index] >= 0 ? "$primaryIndex[$index]," : ",";
@records = @{$primaryFields{"Important Field 2"}};
$csv .= $records[$index] ? "$records[$index]," : ",";
foreach my $key (@keys) {
@records = @{$csvContent{$key}};
if($key eq $last) {
$csv .= $records[$index] ? "$records[$index]\n" : "\n";
} else {
$csv .= $records[$index] ? "$records[$index]," : ",";
}
}
}
I have also tried the same thing only using the join method instead of ".=". I've also tried foregoing the string aggregation all together and writing directly into a file. Both of these didn't seem to help that much. I'll be the first one to admit that my knowledge of memory management in perl probably isn't the greatest; so please feel free to school me (constructively). Also, if you think this is something I should consider rewriting outside of perl, please let me know.
EDIT: Some sample xml (please keep in mind I'm not in the position to edit the structure of the xml):
<fields>
<field>
<Name>IndicesToBeSorted</Name>
<Records>idx12;idx14;idx18;...idxN-1;idxN</Records>
</field>
<field>
<Name>Important Field1</Name>
<Records>val1;val2;;val4;...;valn-1;valn</Records>
</field>
<field>
<Name>Important Field2</Name>
<Records>val1;val2;;val4;...;valn-1;valn</Records>
</field>
<field>
<Name>Records...</Name>
<Records>val1;val2;;val4;...;valn-1;valn</Records>
</field>
<field>
<Name>More Records...</Name>
<Records>val1;val2;;val4;...;valn-1;valn</Records>
</field>
</fields>
The position of a record in one field corresponds to the position in another field. For example; the first item from each "Records" element is associated and makes up a column in my csv. So basically, my script parses all of this, and creates an array of ordered indices (which is what is in the @orderedIndecies in my example). The @orderdIndecies contains data like...
print "$orderedInecies[0]\n" #prints index of location of idx0
print "$orderedInecies[1]\n" #prints index of location of idx1
print "$orderedInecies[2]\n" #prints index of location of idx2
print "$orderedInecies[3]\n" #prints index of location of idx3
I do things this way because the string from the orderedIndecies is out of order; and I didn't want to move all of the data around.
EDIT: FINAL ANSWER
open my $csv_fh, ">", $$fileNameRef or die "$$fileNameRef: $!";
print $csv_fh "Important Field 1,Index Field,Important Field 2";
# Defining $comma, $endl, $empty allows me to do something like:
#
# print $csv_fh $val ? $val : $empty;
# print $csv_fh $comma;
#
# As opposed to....
#
# print $csv_fh $val ? "$val," : ",";
#
# Note, the first method avoids the string aggregation of "$val,"
my $comma = ",";
my $endl = "\n";
my $empty = "";
my @keys = sort(keys %csvContent);
my $last = $keys[-1];
foreach (@keys) {
print $csv_fh $_;
print $csv_fh $_ eq $last ? $endl : $comma;
}
# Even though the hash lookup is probably very efficient, I still
# saw no need to redo it constantly, so I defined it here as
# opposed to inline within the for loops
my @ImportantFields1 = @{$primaryFields{"Important Field 1"}};
my @ImportantFields2 = @{$primaryFields{"Important Field 2"}};
print "\n\n--------- BUILD CSV START ---------------\n\n";
foreach my $index (@orderedIndecies) {
print $csv_fh exists $ImportantFields1[$index] ? $ImportantFields1[$index] : $empty;
print $csv_fh $comma;
print $csv_fh $originalIndexField[$index] >= 0 ? $originalIndexField[$index] : $empty;
print $csv_fh $comma;
print $csv_fh exists $ImportantFields2[$index] ? $ImportantFields2[$index] : $empty;
#If needed, this is where you would make sure to escape commas
foreach my $key (@keys) {
print $csv_fh $comma;
$record = exists @{$csvContent{$key}}[$index]
? @{$csvContent{$key}}[$index];
: $empty;
}
print $csv_fh $endl;
}
print "\n\n------- CSV Contents wrtten to file -----------\n\n"
close($csv_fh);
Thanks for the help guys :D