2

I've started using pyspark in one of my projects. I was testing different commands to explore functionalities of the library and I found something that I don't understand.

Take this code:

from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql.dataframe import Dataframe

sc = SparkContext(sc)
hc = HiveContext(sc)

hc.sql("use test_schema")
hc.table("diamonds").count()

the last count() operation returns 53941 records. If I run instead a select count(*) from diamonds in Hive I got 53940.

Is that pyspark count including the header?

I've tried to look into:

df = hc.sql("select * from diamonds").collect()
df[0]
df[1]

to see if header was included:

df[0] --> Row(carat=None, cut='cut', color='color', clarity='clarity', depth=None, table=None, price=None, x=None, y=None, z=None)
df[1] -- > Row(carat=0.23, cut='Ideal', color='E', clarity='SI2', depth=61.5, table=55, price=326, x=3.95, y=3.98, z=2.43)

The 0th element doesn't look like the header.

Anyone has an explanation for this?

Thanks! Ale

4
  • What results do you get when you use hc.read.table(...).count()? Commented Feb 6, 2018 at 9:34
  • @Bala when I run that I got 53941 Commented Feb 6, 2018 at 9:35
  • I tested it with 1.6 and all of them returns same count. What version are you using? Create 2 df that gives you different count and do substract from one another do get the differing row and let us know. Commented Feb 6, 2018 at 9:40
  • I've tried different combination of count operations hc.sql("select count(*) from diamonds").show() hc.read.table().count() hc.table(...).count(), all returning 53491. Spark version is 2.1.0, pyspark 2.1.1 Commented Feb 6, 2018 at 9:47

1 Answer 1

5

Hive can give incorrect counts when stale statistics are used to speed up calculations. To see if this is the problem, in Hive try:

SET hive.compute.query.using.stats=false;
SELECT COUNT(*) FROM diamonds;

Alternatively, refresh the statistics. If your table is not partitioned:

ANALYZE TABLE diamonds COMPUTE STATISTICS;
SELECT COUNT(*) FROM diamonds;

If it is partitioned:

ANALYZE TABLE diamonds PARTITION(partition_column) COMPUTE STATISTICS;

Also take another look at your first row (df[0] in your question). It does look like an improperly formatted header row.

Sign up to request clarification or add additional context in comments.

1 Comment

@savagedata you were right, that helps understanding why hive miscount the number of rows. thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.