Skip to main content
Filter by
Sorted by
Tagged with
3 votes
0 answers
56 views

I’m experiencing data loss when writing a large DataFrame to Redis using the Spark-Redis connector. Details: I have a DataFrame with millions of rows. Writing to Redis works correctly for small ...
gianfranco de siena's user avatar
0 votes
0 answers
35 views

I'm using a PySpark notebook inside of Azure Synapse. This is my schema definition qcew_schema = StructType([ StructField( 'area_fips', dataType = CharType(5), ...
Vijay Tripathi's user avatar
1 vote
0 answers
51 views

I am running a data ingestion ETL pipeline orchestrated by Airflow using PySpark to read data from MongoDB (using the MongoDB Spark Connector) and load it into a Delta Lake table. The pipeline is ...
Tavakoli's user avatar
  • 1,433
0 votes
0 answers
15 views

I have an application using EKS in AWS that runs a spark session that can run multiple workloads. In each workload, I need to access data from S3 in another AWS account, for which I have STS ...
md12345's user avatar
Advice
0 votes
2 replies
65 views

I am trying to convert a DataStage code into Pyspark. In existing DataStage code, Standardize stage is used to standardize US Address, US Area and US Name. I want to replicate the same logic into ...
SK ASIF ALI's user avatar
0 votes
1 answer
61 views

# ===================================================== # 🧊 Step 4. Write Data to Iceberg Table (Glue Catalog) # ===================================================== table_name = "glue_catalog....
Mohammed Suhail's user avatar
3 votes
1 answer
146 views

In my Databricks cluster I'm trying to write a DataFrame to my table with the following code: df.write.jdbc(url=JDBCURL, table=table_name, mode="append") And this line fails with ...
Be Chiller Too's user avatar
2 votes
0 answers
98 views

I've got a multiline CSV file which is about 150GB and I've been trying to load it using the usual code e.g. df = spark.read.format('csv').option('header', True).option('multiLine', True).load('path/...
rocket porg's user avatar
0 votes
0 answers
85 views

I’m trying to load JSON data into an Iceberg table. The source files are named with timestamps that include colons (:), so I need to read them as plain text first. Additionally, each file is in a ...
Raj Mhatre's user avatar
0 votes
0 answers
66 views

I am observing different write behaviors when executing queries on EMR Notebook (correct behavior) vs when using spark-submit to submit a spark application to EMR Cluster (incorrect behavior). When I ...
shiva's user avatar
  • 2,781
1 vote
1 answer
82 views

Hi I'm trying to implement a stateprocessor for my custom logic., ideally we are streaming and I want the custom logic of calculating packet loss from a previous row. i implemented the stateprocessor ...
Pranav ramachandran's user avatar
1 vote
1 answer
247 views

What could be a cause of the following error of my code in a Databricks notebook, and how can we fix the error? ImportError: cannot import name 'pipelines' from 'pyspark' (/databricks/python/lib/...
nam's user avatar
  • 24.2k
0 votes
0 answers
44 views

We have a scenario to read a VSAM file directly along with a copy book to understand the column lengths, we were using COBRIX library as part of spark read. However, we could the same is not properly ...
Rocky1989's user avatar
  • 409
0 votes
0 answers
45 views

Why does the following code produce the desired dataframe without issue data = [ ("James,,Smith",["Java","Scala","C++"],["Spark","Java"],&...
Billy Pilgrim's user avatar
2 votes
2 answers
102 views

As part of a function I create df1 and df2 and aim to stack them and output the results. But the results do not display within the function, nor if I output the results and display after. results = ...
platyfish800's user avatar
0 votes
0 answers
74 views

I have a small files with.csv.gz compressed format in gcs bucket and have mounted it and created external volumes on top of it in databricks(unity catalog enabled). So when I try to read a file with ...
Tony's user avatar
  • 311
0 votes
0 answers
94 views

I am relatively new to spark streaming but really experienced in normal batch processing. I grab data from eventhub in azure using kafka connector. Cluster: Standard_Ds3_v2 with 16GB RAM 4 cores. It ...
Samuel Demir's user avatar
0 votes
0 answers
99 views

I keep running into this issue when running PySpark. I was able to connect to my database and retrieve data, but whenever I try do operations like .show() or .count(), or when I try to save a Spark ...
Siva Indukuri's user avatar
0 votes
2 answers
79 views

I have created this Docker Compose file: # Command: docker stack deploy streaming-stack --compose-file docker/spark-kstreams-stack.yml # Gary A. Stafford (2022-09-14) # Updated: 2022-12-28 version: &...
Vasileios Anagnostopoulos's user avatar
0 votes
1 answer
102 views

I am new to python and pyspark. I'm trying to run it on Windows Server 2022. I have environment variables HADOOP_HOME=C:\spark\hadoop JAVA_HOME=C:\Program Files\Microsoft\jdk-17.0.16.8-hotspot ...
EdH's user avatar
  • 621
1 vote
1 answer
67 views

I want to parse a JSON request and create multiple columns out of it in pyspark as follows: { "ID": "abc123", "device": "mobile", "Ads": [ { ...
Codegator's user avatar
  • 659
0 votes
1 answer
113 views

Given an arbitrary pyspark.sql.column.Column object (or, similarly, a pyspark.sql.connect.column.Column object), is there a way to get a datatype back -- either as a DDL string or pyspark.sql.types....
Philip Kahn's user avatar
0 votes
0 answers
77 views

I use Spark+ Hudi to write data into S3. I was writing data in bulk_insert mode, which cause there be many small paruqet files in Hudi table. Then I try to schedule clustering on the Hudi table: ...
Rinze's user avatar
  • 834
2 votes
1 answer
148 views

Persist() is helpful when the same dataframe is used repeatedly in the code. But what about cases where transformations are on top of each other?: a = spark.createDataFrame(data) trigger action on a ...
Yuji Reda's user avatar
  • 151
0 votes
0 answers
51 views

I have a PySpark job that ingests data into a Delta table originally partitioned by year, month, day and hour. The job takes 2hr to complete. The job runs daily ingesting previous days full data. ...
steve's user avatar
  • 305

1
2 3 4 5
821