0

I am trying to use Spark to read data stored in a very large table (contains 181,843,820 rows and 50 columns) which is my training set, however, when I use spark.table() I noticed that the row count is different than the row count when calling the DataFrame's count(), I am currently using PyCharm.

I want to preprocess the data in the table before I can use it further as a training set for a model I need to train. When loading the table I found out that the DataFrame I'm loading the table to is much smaller (10% of the data in this case).

what I have tried:

  1. raised spark.kryoserializer.buffer.max capacity.
  2. load a smaller table into the DataFrame (70k rows) and actually found no difference in the count() outputs.

this sample is very similar to the code I ran in order to investigate the problem.

df = spark.table('myTable')
print(spark.table('myTable').count()) # output: 181,843,820
print(df.count()) # output 18,261,961

I expect both outputs to be the same (the original 181m), yet they are not, and I dont understand why.

3
  • try to write them down, both of them, and compare the output. Commented Jul 9, 2019 at 14:56
  • I cant, the best I can do is to run a SELECT query on the hive, but than again, I cant compare them via the hive. also I cannot get print the output since I cant get the full content of the original table (too big and also its part of my question, I cant get it into a Dataframe) Commented Jul 9, 2019 at 18:25
  • Don’t know if this will solve your problem, but this stackoverflow looks like it may help: stackoverflow.com/questions/48639592/… Commented Jul 10, 2019 at 13:46

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.