2

I'm doing a short application that tells you which is the most popular hero out of a big dataset of Marvel movies based on their number of appearances.

I've installed Pyspark from the Anaconda environment and also from console to try and solve this error without results. I also installed the Java-jdk for conda but no results neither.

The error I'm getting is the following:

    py4j.protocol.Py4JJavaError: An error occurred while calling o24.partitions.
: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2018:19
    at org.apache.hadoop.fs.Path.initialize(Path.java:205)
    at org.apache.hadoop.fs.Path.<init>(Path.java:171)
    at org.apache.hadoop.fs.Path.<init>(Path.java:93)
    at org.apache.hadoop.fs.Globber.glob(Globber.java:211)
    at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1676)
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:259)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
    at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:61)
    at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:45)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.URISyntaxException: Relative path in absolute URI: 2018:19
    at java.net.URI.checkPath(URI.java:1823)
    at java.net.URI.<init>(URI.java:745)
    at org.apache.hadoop.fs.Path.initialize(Path.java:202)
    ... 30 more

And here is my code:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Mar 20 13:33:45 2019

@author: Carlos
"""
from pyspark import SparkConf, SparkContext
import collections

conf = SparkConf().setMaster("local").setAppName("personaje_mas_popular")
sc=SparkContext(conf=conf)

def numerocoapariciones(linea):
    elementos = linea.split()
    return (int(elementos[0]), len(elementos)-1)

def codificarnombres(linea):
    fields = linea.split('\"')
    return (int(fields[0]), fields[1].encode("utf8"))

nombres = sc.textFile("./Marvel/Marvel-names.txt")
nombresrdd=nombres.map(codificarnombres)

lines = sc.textFile("./Marvel/Marvel-graph.txt")

emparejar = lines.map(numerocoapariciones)
totalapariciones = emparejar.reduceByKey(lambda x,y :x + y)

flipped = totalapariciones.map(lambda xy: (xy[1], xy[0]))

maspopular = flipped.max()

nombremaspopular = nombresrdd.lookup(maspopular[1])[0]
print("Héroe más popular: \n" + str(nombremaspopular))

Am i missing some library? Is it a version problem? Im running Python 3.7.0 ,Pyspark 2.4.0 and writing the code on Spyder (Anaconda environment).

5
  • 1
    This might help you and I feel like maybe try no dashes for the files? stackoverflow.com/questions/25334604/… Commented Mar 21, 2019 at 14:46
  • @JoeA yeah I think the issue is in the file path, and im trying multiple things but with no result. I think i would need the 'mac' version of this answer: stackoverflow.com/questions/39552235/… Commented Mar 21, 2019 at 15:54
  • Just tried full path but got same error: "/Users/Carlos/Desktop/UEM/2018\:19/GrandesVolúmenesDatos/Marvel/Marvel-names.txt" Commented Mar 21, 2019 at 16:05
  • 1
    The useful line on the error message is : java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2018:19 which probably tells you that there is a problem with your file path. Try to escape the path string properly Commented Mar 21, 2019 at 16:09
  • Thank you both Joe and Alexandros for your answers, it was a path issue. I'll write full answer since it may be useful for other users. Commented Mar 21, 2019 at 16:15

1 Answer 1

1

After trying different things it was the simplest of them all: a file path issue.

The original path of both my .py file and .txt files included a ':' which i've read in multiple sites can cause error when reading file paths. I just moved both the .py and the .txt to other path (Desktop, for making things 100% easy) and worked perfectly.

In the process of solving this i also came across the issue that, after updating pyspark from terminal, Anaconda Environment wont launch, so i updated python with the following command and worked like a charm again:

conda update python -yn root

(I know this last issue is offtopic but hey, I hope it helps someone someday)

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.