I am trying to use Spark to read data stored in a very large table (contains 181,843,820 rows and 50 columns) which is my training set, however, when I use spark.table() I noticed that the row count is different than the row count when calling the DataFrame's count(), I am currently using PyCharm.
I want to preprocess the data in the table before I can use it further as a training set for a model I need to train. When loading the table I found out that the DataFrame I'm loading the table to is much smaller (10% of the data in this case).
what I have tried:
- raised
spark.kryoserializer.buffer.maxcapacity.- load a smaller table into the DataFrame (70k rows) and actually found no difference in the
count()outputs.
this sample is very similar to the code I ran in order to investigate the problem.
df = spark.table('myTable')
print(spark.table('myTable').count()) # output: 181,843,820
print(df.count()) # output 18,261,961
I expect both outputs to be the same (the original 181m), yet they are not, and I dont understand why.