1

I would like to merge csv files present in some folder into one file. Suppose there are 12000 files in folder and every file have 20000 record. Could any one please give me some best idea/solution using multi-threading concept. I have written below code but i think for large data it is not fine :

public class Test {
    public static void main(String[] args) throws IOException {

        String path="Desktop//Files//";
        List<Path> listofFile=getListOfFileInFolder(path);  
        List<Path> paths = listofFile;
        List<String> mergedLines = getMergedLines(paths);
        Path target = Paths.get(path+"temp.csv");
        System.out.println(target);
        Files.write(target, mergedLines, Charset.forName("UTF-8")); }

    public static List<Path> getListOfFileInFolder(String path){
    List<Path> results = new ArrayList<Path>();
    File[] files = new File(path).listFiles();
    for (File file : files) {
        if (file.isFile()) {
         results.add(Paths.get(path+file.getName()));
        }
    }
    return results;
    }
    private static List<String> getMergedLines(List<Path> paths) throws IOException {
        List<String> mergedLines = new ArrayList<> ();
        for (Path p : paths){
            List<String> lines = Files.readAllLines(p, Charset.forName("UTF-8"));
            if (!lines.isEmpty()) {
                if (mergedLines.isEmpty()) {
                    mergedLines.add(lines.get(0)); //add header only once
                }
                mergedLines.addAll(lines.subList(1, lines.size()));
                }
        }
        return mergedLines;
    }
}
8
  • Why do you want to multi-thread this? Commented Jul 11, 2017 at 14:12
  • How exactly do you want them merged? Commented Jul 11, 2017 at 14:16
  • Hi slim.. for better performance i would like to process it by thread . Commented Jul 11, 2017 at 14:17
  • 20,000 files in one directory will cause performance problems on some filesystems. Commented Jul 11, 2017 at 14:18
  • Hi K.Krol .. CSV containing with same header with different data . Commented Jul 11, 2017 at 14:19

1 Answer 1

1

It is unlikely that multi-threading will improve performance in this case. Multi-threading can speed up batch-operations when CPU is the bottleneck, by utilising more than one core. But in your process, the bottleneck will be disk reads. A single CPU core will handle the merge as quickly as the filesystem can deliver the bytes.

The biggest concern with the number of files you're proposing, is that the initial directory listFiles() will take some time, and the resulting File[20000] will consume a lot of memory.

Likewise, for 10,000 record files, slurping the whole thing into memory with readAllLines() is going to use a lot of memory, working the GC hard for no good reason.

And, you're collecting the results into a list of String, which will have 200,000,000 entries by the time it's got 20,000 * 10,000 lines in it. For an 80 column file, this is 16GB of Strings, plus object overheads.

Better to read in a small amount at a time, get it written to your output file, then dropped from memory as early as possible.

You could do this by going event-driven, and using java.nio.Files.walkFileTree():

 try(OutputStream out = new FileOutputStream(outpath)) {
     Files.walkFileTree(inputDirectoryPath, Collections.emptySet(), 1, new MergeToOutputStreamVisitor(out));
 }

... where MergeToOutputStreamVisitor is something along the lines of:

 public class MergeToOutputStreamVisitor extends SimpleFileVisitor {

     private final OutputStream outstream;

     @Override
     public FileVisitResult visitFile(T file, BasicFileAttributes attrs) {
         FileUtils.copyFile(file, outstream); 
         return FileVisitResult.CONTINUE;
     }
 }

I've used Apache Commons-IO's FileUtils.copyFile() to squirt each file's contents into the OutputStream. If you can't use Commons-IO, write your own version of this. If you need to do more, for example, skip the first line, you can also roll your own, using something like:

   try(BufferedReader reader = new BufferedReader(new FileReader(file)); Writer writer = new OutputStreamWriter(outstream)) {
       reader.readLine(); // header - throw it away
       String line = reader.readLine();
       while(line != null) {
           writer.write(line);
           line = reader.readLine();
       }
   }

Using this approach, at any moment, your program has at most one File entry and one line of data in memory at a time (plus any buffering the library routines use -- this is good). It will stream the lines directly from input files to output file. I promise that you'll get nowhere near 100% usage of a single core, therefore multi-threading will not make it any faster.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.