Pyspark sql count returns different number of rows than pure sql

Question

I've started using pyspark in one of my projects. I was testing different commands to explore functionalities of the library and I found something that I don't understand.

Take this code:

from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql.dataframe import Dataframe

sc = SparkContext(sc)
hc = HiveContext(sc)

hc.sql("use test_schema")
hc.table("diamonds").count()

the last count() operation returns 53941 records. If I run instead a select count(*) from diamonds in Hive I got 53940.

Is that pyspark count including the header?

I've tried to look into:

df = hc.sql("select * from diamonds").collect()
df[0]
df[1]

to see if header was included:

df[0] --> Row(carat=None, cut='cut', color='color', clarity='clarity', depth=None, table=None, price=None, x=None, y=None, z=None)
df[1] -- > Row(carat=0.23, cut='Ideal', color='E', clarity='SI2', depth=61.5, table=55, price=326, x=3.95, y=3.98, z=2.43)

The 0th element doesn't look like the header.

Anyone has an explanation for this?

Thanks! Ale

What results do you get when you use hc.read.table(...).count()? — Bala
– Bala, Commented Feb 6, 2018 at 9:34
I tested it with 1.6 and all of them returns same count. What version are you using? Create 2 df that gives you different count and do substract from one another do get the differing row and let us know. — Bala
– Bala, Commented Feb 6, 2018 at 9:40
I've tried different combination of count operations hc.sql("select count(*) from diamonds").show() hc.read.table().count() hc.table(...).count(), all returning 53491. Spark version is 2.1.0, pyspark 2.1.1 — AlessioG
– AlessioG, Commented Feb 6, 2018 at 9:47

savagedata · Accepted Answer · 2018-02-06 18:56:42Z

5

Hive can give incorrect counts when stale statistics are used to speed up calculations. To see if this is the problem, in Hive try:

SET hive.compute.query.using.stats=false;
SELECT COUNT(*) FROM diamonds;

Alternatively, refresh the statistics. If your table is not partitioned:

ANALYZE TABLE diamonds COMPUTE STATISTICS;
SELECT COUNT(*) FROM diamonds;

If it is partitioned:

ANALYZE TABLE diamonds PARTITION(partition_column) COMPUTE STATISTICS;

Also take another look at your first row (df[0] in your question). It does look like an improperly formatted header row.

answered Feb 6, 2018 at 18:56

savagedata

7301 gold badge6 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

AlessioG Over a year ago

@savagedata you were right, that helps understanding why hive miscount the number of rows. thanks!

Collectives™ on Stack Overflow

Pyspark sql count returns different number of rows than pure sql

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related