0

I'm trying to parse a tab separated value dump from the IMDB. (The actual dump contains an inconsistent amount of tabs throughout each line.):

$, Claw         "OnCreativity" (2012)  [Himself]

$, Homo         Nykytaiteen museo (1986)  [Himself]  <25>
                   Suuri illusioni (1985)  [Guests]  <22>

$hutter         Battle of the Sexes (2017)  (as $hutter Boy)  [Bobby Riggs Fan]  <10>
                   NVTION: The Star Nation Rapumentary (2016)  (as $hutter Boy)  [Himself]  <1>
                   Secret in Their Eyes (2015)  (uncredited)  [2002 Dodger Fan]
                   Steve Jobs (2015)  (uncredited)  [1988 Opera House Patron]
                   Straight Outta Compton (2015)  (uncredited)  [Club Patron/Dopeman]

$lim, Bee Moe       Fatherhood 101 (2013)  (as Brandon Moore)  [Himself - President, Passages]
                   For Thy Love 2 (2009)  [Thug 1]
                   Night of the Jackals (2009) (V)  [Trooth]
                   "Idle Talk" (2013)  (as Brandon Moore)  [Himself]
                   "Idle Times" (2012) {(#1.1)}  (as Brandon Moore)  

$ly, Yung           Town Bizzness Pt 3 (2014) (V)  [Yung $ly]
                   "From Tha Bottom 2 Tha Top" (2016)  [Yung $ly]
                   "From Tha Bottom 2 Tha Top" (2016) {T-Pain (#1.2)}  [Yung $ly]

$torm, Cuntry       From the Woods: The Discovery of LYB (????)  (as Country $torm)  [Himself]

& Davi, Bruninho   Michel na Balada (2011) (V)  [Themselves]
                   Michel TelÛ: Sunset (2013) (V)  [Themselves]
                   "Programa da Sabrina" (2014) {(2016-01-23)}  [Themselves]

& Dollar Furado, Caio Corsalette    "Som Brasil" (2007) {ZezÈ di Camargo & Luciano (#5.7)}  [Themselves]

& Fabiano, CÈsar Menotti    Nascemos para Cantar (2010) (TV)  [Themselves]
                            Show da Virada (2011) (TV)  [Themselves - Performers]
                            Teleton 2010 (2010) (TV)  [Themselves]
                            "Altas Horas" (2000) {(2013-06-29)}  [Themselves]
                            "Altas Horas" (2000) {(2013-12-14)}  [Themselves]
                            "Eliana" (2009) {(2012-10-21)}  [Themselves]
                            "Tudo … PossÌvel" (2005) {(2008-04-13)}  [Themselves]
                            "TV Xuxa" (2005) {(2013-01-05)}  [Themselves]

My code:

package com.mycompany.imdbproject;

import java.io.BufferedReader;
import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class ActorListParser {

Charset charset = Charset.forName("ISO-8859-1");

BufferedReader reader = null;

public ActorListParser() {

    try {

        this.reader = Files.newBufferedReader(
                new File(System.getProperty("user.home") + ("/IMDBLogs" + "/dataDirectory" + "/acsshort.txt")).toPath(), charset);

        String line = null;

        while ((line = reader.readLine()) != null) {
            String[] lineAsArray = null;

            Pattern startsWithTab = Pattern.compile("^\t.*$");

            Matcher tab = startsWithTab.matcher(line);

            boolean startsWithTabMatcher = tab.matches();

            if (!startsWithTabMatcher) {

                lineAsArray = line.split("\t");

                for (int i = 0;i < lineAsArray.length; i++) {

                    System.out.println("Length: " + lineAsArray.length +", Value:"+ i +"  "+ lineAsArray[i]);
                }
            }else{
            //parse lines that start with a tab (actor's other movies)
            }

        }
    } catch (IOException ex) {

        Logger.getLogger(ActorListParser.class.getName()).log(Level.SEVERE, null, ex);

    }
}

public static void main(String[] args) {
    ActorListParser acp = new ActorListParser();
}

}

The output:

Length: 2, Value:0  $, Claw
Length: 2, Value:1          "OnCreativity" (2012)  [Himself]
Length: 1, Value:0  
Length: 2, Value:0  $, Homo
Length: 2, Value:1          Nykytaiteen museo (1986)  [Himself]  <25>
Length: 1, Value:0    
Length: 2, Value:0  $hutter
Length: 2, Value:1          Battle of the Sexes (2017)  (as $hutter Boy)  [Bobby Riggs Fan]  <10>
Length: 1, Value:0  
Length: 2, Value:0  $lim, Bee Moe
Length: 2, Value:1      Fatherhood 101 (2013)  (as Brandon Moore)  [Himself - President, Passages]
Length: 1, Value:0  
Length: 2, Value:0  $ly, Yung
Length: 2, Value:1      Town Bizzness Pt 3 (2014) (V)  [Yung $ly]
Length: 1, Value:0  
Length: 2, Value:0  $torm, Cuntry
Length: 2, Value:1      From the Woods: The Discovery of LYB (????)  (as Country $torm)  [Himself]
Length: 1, Value:0  
Length: 2, Value:0  & Davi, Bruninho
Length: 2, Value:1  Michel na Balada (2011) (V)  [Themselves]
Length: 1, Value:0  
Length: 2, Value:0  & Dollar Furado, Caio Corsalette
Length: 2, Value:1  "Som Brasil" (2007) {Zezà di Camargo & Luciano (#5.7)}  [Themselves]
Length: 1, Value:0  
Length: 2, Value:0  & Fabiano, CÃsar Menotti
Length: 2, Value:1  Nascemos para Cantar (2010) (TV)  [Themselves]

As you can see, I take the first appearance of the author's name and parse the name and movie from it.(for later use in a Map). I will get the other movies attributed to the actor in a separate regex.

Unfortunately, there is an array of length 1 with no value that keeps appearing in my output. What am I doing incorrectly that is creating this empty array?

5
  • Two tabs in a row would generate that empty value, maybe split on \t+ ? Commented Jul 20, 2016 at 22:14
  • I'd suggest lineAsArray = line.trim().split("\t+"); Commented Jul 20, 2016 at 22:17
  • What do you think the return value of "".split("\t") is? Hint: It's an array of one value: An empty string. See javadoc: If the expression does not match any part of the input then the resulting array has just one element, namely this string. Commented Jul 20, 2016 at 22:18
  • Any reason not to use one of the many excellent open source CSV parsing libraries? Commented Jul 20, 2016 at 22:19
  • @dnault. Thanks. I just don't want to. I like the exercise. Commented Jul 20, 2016 at 22:20

1 Answer 1

4

You have empty lines in the input. These don't start with a tab character, so they match the if statement. Then, splitting an empty line on anything will result in an array of length 1 with an empty string element. For example, "".split("blah") returns an array of length 1 with an empty string element in it. That's just the way String.split works.

So, the solution is to add a check for !line.isEmpty().

@Andreas said it best in a comment:

Yup. See javadoc: If the expression does not match any part of the input then the resulting array has just one element, namely this string.

Sign up to request clarification or add additional context in comments.

4 Comments

Yup. See javadoc: If the expression does not match any part of the input then the resulting array has just one element, namely this string.
@janos. That's it. Thank you. I can't accept your answer until 3 minutes has passed. I really goofed that.
Or !line.trim().isEmpty().
Thank you Andreas and Wiktor.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.