4

I have written a Perl code to process huge number of CSV files and get output, which is taking 0.8326 seconds to complete.

my $opname = $ARGV[0];
my @files = `find . -name "*${opname}*.csv";mtime -10 -type f`;
my %hash;
foreach my $file (@files) {
chomp $file;
my $time = $file;
$time =~ s/.*\~(.*?)\..*/$1/;

open(IN, $file) or print "Can't open $file\n";
while (<IN>) {
    my $line = $_;
    chomp $line;

    my $severity = (split(",", $line))[6];
    next if $severity =~ m/NORMAL/i;
    $hash{$time}{$severity}++;
}
close(IN);

}
foreach my $time (sort {$b <=> $a} keys %hash) {
    foreach my $severity ( keys %{$hash{$time}} ) {
        print $time . ',' . $severity . ',' . $hash{$time}{$severity} . "\n";
    }
}

Now I'm writing the same logic in Java, which I wrote but taking 2600ms i.e 2.6 sec to complete. My question is why Java is taking so long time? How to achieve the same speed as Perl? Note: I ignored the VM initialization and class loading time.

    import java.io.BufferedReader;
    import java.io.File;
    import java.io.FileFilter;
    import java.io.FileReader;
    import java.io.IOException;
    import java.util.HashMap;
    import java.util.Map;
    import java.util.TreeMap;

    public class MonitoringFileReader {
        static Map<String, Map<String,Integer>> store= new TreeMap<String, Map<String,Integer>>(); 
        static String opname;
        public static void testRead(String filepath) throws IOException
        {
            File file = new File(filepath);

            FileFilter fileFilter= new FileFilter() {

                @Override
                public boolean accept(File pathname) {
                    // TODO Auto-generated method stub
                    int timediffinhr=(int) ((System.currentTimeMillis()-pathname.lastModified())/86400000);
                    if(timediffinhr<10 && pathname.getName().endsWith(".csv")&& pathname.getName().contains(opname)){
                        return true;
                        }
                    else
                        return false;
                }
            };

            File[] listoffiles= file.listFiles(fileFilter);
        long time= System.currentTimeMillis();  
            for(File mf:listoffiles){
                String timestamp=mf.getName().split("~")[5].replace(".csv", "");
                BufferedReader br= new BufferedReader(new FileReader(mf),1024*500);
                String line;
                Map<String,Integer> tmp=store.containsKey(timestamp)?store.get(timestamp):new HashMap<String, Integer>();
                while((line=br.readLine())!=null)
                {
                    String severity=line.split(",")[6];
                    if(!severity.equals("NORMAL"))
                    {
                        tmp.put(severity, tmp.containsKey(severity)?tmp.get(severity)+1:1);
                    }
                }
            store.put(timestamp, tmp);
            }
        time=System.currentTimeMillis()-time;
            System.out.println(time+"ms");  
            System.out.println(store);


        }

        public static void main(String[] args) throws IOException
        {
            opname = args[0];
            long time= System.currentTimeMillis();
            testRead("./SMF/data/analyser/archive");
            time=System.currentTimeMillis()-time;
            System.out.println(time+"ms");
        }

    }

File input format(A~B~C~D~E~20150715080000.csv),around 500 files of ~1MB each,

A,B,C,D,E,F,CRITICAL,G
A,B,C,D,E,F,NORMAL,G
A,B,C,D,E,F,INFO,G
A,B,C,D,E,F,MEDIUM,G
A,B,C,D,E,F,CRITICAL,G

Java Version: 1.7

////////////////////Update///////////////////

As per the below comments , I replaced the split with regex , and the performance is improved a lot. Now I am doing this in a loop , and after 3-10 iteration the performance is quite acceptable .

import java.io.BufferedReader;
import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

    public class MonitoringFileReader {
        static Map<String, Map<String,Integer>> store= new HashMap<String, Map<String,Integer>>(); 
        static String opname="Etis_Egypt";
        static Pattern pattern1=Pattern.compile("(\\d+\\.)");
        static Pattern pattern2=Pattern.compile("(?:\"([^\"]*)\"|([^,]*))(?:[,])");
        static long currentsystime=System.currentTimeMillis();
        public static void testRead(String filepath) throws IOException
        {
            File file = new File(filepath);

            FileFilter fileFilter= new FileFilter() {

                @Override
                public boolean accept(File pathname) {
                    // TODO Auto-generated method stub
                    int timediffinhr=(int) ((currentsystime-pathname.lastModified())/86400000);
                    if(timediffinhr<10 && pathname.getName().endsWith(".csv")&& pathname.getName().contains(opname)){
                        return true;
                        }
                    else
                        return false;
                }
            };

            File[] listoffiles= file.listFiles(fileFilter);
        long time= System.currentTimeMillis();  
            for(File mf:listoffiles){
                Matcher matcher=pattern1.matcher(mf.getName());
                matcher.find();
                //String timestamp=mf.getName().split("~")[5].replace(".csv", "");
                String timestamp=matcher.group();
                BufferedReader br= new BufferedReader(new FileReader(mf));
                String line;
                Map<String,Integer> tmp=store.containsKey(timestamp)?store.get(timestamp):new HashMap<String, Integer>();
                while((line=br.readLine())!=null)
                {
                    matcher=pattern2.matcher(line);
                    matcher.find();matcher.find();matcher.find();matcher.find();matcher.find();matcher.find();matcher.find();
                    //String severity=line.split(",")[6];
                    String severity=matcher.group();
                    if(!severity.equals("NORMAL"))
                    {
                        tmp.put(severity, tmp.containsKey(severity)?tmp.get(severity)+1:1);
                    }
                }
                br.close();
            store.put(timestamp, tmp);
            }
        time=System.currentTimeMillis()-time;
            //System.out.println(time+"ms");    
            //System.out.println(store);


        }

        public static void main(String[] args) throws IOException
        {
            //opname = args[0];
            for(int i=0;i<20;i++){
            long time= System.currentTimeMillis();
            testRead("./SMF/data/analyser/archive");
            time=System.currentTimeMillis()-time;


            System.out.println("Time taken for "+i+" is "+time+"ms");
            }
        }

    }

But I have another question now ,

See the result while running on a small dataset,.

**Time taken for 0 is 218ms
Time taken for 1 is 134ms
Time taken for 2 is 127ms**
Time taken for 3 is 98ms
Time taken for 4 is 90ms
Time taken for 5 is 77ms
Time taken for 6 is 71ms
Time taken for 7 is 72ms
Time taken for 8 is 62ms
Time taken for 9 is 57ms
Time taken for 10 is 53ms
Time taken for 11 is 58ms
Time taken for 12 is 59ms
Time taken for 13 is 46ms
Time taken for 14 is 44ms
Time taken for 15 is 45ms
Time taken for 16 is 53ms
Time taken for 17 is 45ms
Time taken for 18 is 61ms
Time taken for 19 is 42ms

For first few instance time taken is more , and then its reduced ,.. Why ???

Thanks ,

11
  • 3
    Same goes for Perl. metacpan.org/pod/Text::CSV will be much safer than your own implementation. Commented Jul 15, 2015 at 9:34
  • 2
    perl is basically a text processing purpsoe language. they developed text processing in mind Commented Jul 15, 2015 at 9:38
  • 2
    There is a lot you can do to make that Perl code go faster! Commented Jul 15, 2015 at 9:50
  • 1
    Text::CSV may be safer but it is probably slower than your existing impl Commented Jul 15, 2015 at 13:35
  • 1
    OK, this makes more sense. See my answer. Your Java is far from nice, if you're interested in making the code better, post it on code review. Please, clean it up a bit first. Commented Jul 15, 2015 at 19:20

2 Answers 2

5

A few seconds are not enough for Java to get to its full speed because of JIT compilation. Java is optimized for servers running for hours (or years), not for tiny utilities taking just a few seconds.

Concerning class loading, I guess you don't know about e.g. Pattern and Matcher which you use indirectly in split and which get loaded as needed.


static Map<String, Map<String,Integer>> store= new TreeMap<String, Map<String,Integer>>(); 

A Perl hash is most like a Java HashMap, but you're using a TreeMap which is slower. I guess this doesn't matter, just note that there are way more differences than you think.


 int timediffinhr=(int) ((System.currentTimeMillis()-pathname.lastModified())/86400000);

You're reading the time for each file again and again. You're doing it even for those whose name doesn't end with ".csv". That's surely not what find does.


String timestamp=mf.getName().split("~")[5].replace(".csv", "");

Unlike Perl, Java doesn't cache regexes. As far as I know the split on a single character gets optimized separately, but otherwise you'd be much better with using something like

private static final Pattern FILENAME_PATTERN =
    Pattern.compile("(?:[^~]*~){5}~([^~]*)\\.csv");

Matcher m = FILENAME_PATTERN.matcher(mf.getName());
if (!m.matches) ... do what you want
String timestamp = m.group(1);

 BufferedReader br = new BufferedReader(new FileReader(mf), 1024*500);

This could be the culprit. By default, it uses platform encoding, which may be UTF-8. This is usually slower than ASCII or LATIN-1. As far as I know Perl works directly with bytes unless instructed otherwise.

The buffer size of half a megabyte is insanely big for anything taking just a few seconds, especially when you allocate it multiple times. Note that there's nothing like this in your Perl code.


That all said, Perl and find might be indeed faster for such short tasks.

Sign up to request clarification or add additional context in comments.

9 Comments

You make some good points. I would add that the OP has implemented the Perl code exactly, including making store (%hash in Perl) a hash of hashes. The corresponding Java TreeMap of HashMaps is far less obvious and presumably much slower than the Perl original. Perl buffers input streams in an 8KB buffer by default, but its split doesn't use the regex engine. I don't know how FileFilter works, but I believe Perl can emulate the filter quicker than shelling out to find. In the end I think it is down to the OP to present what he has. I haven't seen a good case for a rewrite
Thanks for your answer, it make sense. I would like to add few things, 1. I got the class loading part, when its required , the class is getting loaded. 2. Treemap part, actually in Perl also I am sorting the map ,hence treemap in java., and with hashmap also no improvement.
3.The splits internal regex part is new for me.4.Initially I used default buffer size for reader, but then I though maybe due to more disk IOs the performance is slow, hence I increased it. But nothing changed in terms of timings
@user3080158 I'd bet, it'll get better when iterating multiple times. A typical benchmark makes 5-20 throw-away iterations before it starts measuring the time. The question is how good it gets. Without closing the readers, you'll run out of file descriptors soon. +++ If you could provide the data, someone could try harder to optimize.
@RBanerjee That's JIT compilation. At first, the code gets interpreted and stats get collected. Concurrently, a simple compiler (C1) runs and produces some medium quality code, which gets used when ready. Then, a better compiler (C2) runs to produce highly optimized code. And this all holds for each relevant part of the code (parts executed just a few times usually need no compilation). It's actually a bit more complicated (google out OSR or deoptimization).
|
0

One obvious thing: use of split() will slow you down. According to JDK source code I can find online, Java does not cache compiled regexps (please correct me if I am wrong).

Make sure you use pre-compiled regexps in your Java code.

1 Comment

Thanks!!I will try, and let you know.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.