Why is row count different when using spark.table().count() and df.count()?

I am trying to use Spark to read data stored in a very large table (contains 181,843,820 rows and 50 columns) which is my training set, however, when I use spark.table() I noticed that the row count is different than the row count when calling the DataFrame's count(), I am currently using PyCharm.

I want to preprocess the data in the table before I can use it further as a training set for a model I need to train. When loading the table I found out that the DataFrame I'm loading the table to is much smaller (10% of the data in this case).

what I have tried:

raised spark.kryoserializer.buffer.max capacity.

load a smaller table into the DataFrame (70k rows) and actually found no difference in the count() outputs.

this sample is very similar to the code I ran in order to investigate the problem.

df = spark.table('myTable')
print(spark.table('myTable').count()) # output: 181,843,820
print(df.count()) # output 18,261,961

I expect both outputs to be the same (the original 181m), yet they are not, and I dont understand why.

edited Jul 17, 2019 at 6:56

DennisLi

4,1947 gold badges38 silver badges73 bronze badges

asked Jul 9, 2019 at 14:05

Elad Cohen

4714 silver badges16 bronze badges

try to write them down, both of them, and compare the output.

Steven
– Steven

2019-07-09 14:56:51 +00:00
Commented Jul 9, 2019 at 14:56
I cant, the best I can do is to run a SELECT query on the hive, but than again, I cant compare them via the hive. also I cannot get print the output since I cant get the full content of the original table (too big and also its part of my question, I cant get it into a Dataframe)

Elad Cohen
– Elad Cohen

2019-07-09 18:25:14 +00:00
Commented Jul 9, 2019 at 18:25
Don’t know if this will solve your problem, but this stackoverflow looks like it may help: stackoverflow.com/questions/48639592/…

Bob Swain
– Bob Swain

2019-07-10 13:46:08 +00:00
Commented Jul 10, 2019 at 13:46

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Why is row count different when using spark.table().count() and df.count()?

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked