PySpark StackOverflowError - Union Multiple Dataframes

Problem I was generating some test data from existing dataset by using pyspark. The approach I used was: Loading existing data to a dataframe Do some random data manupulation, such as changing timestamp to random timestamp. Repeat the 2nd process 1000 times Use Union to join the dataframes together This is the code: As a result, I received the Java StackOverflowError below: Solution This error message is a very old school error message as I haven't seen it for long long time, so the first feeli…

Spark On Yarn Error - Failed to send RPC

Problem I am new to the big data world, and I am trying to build a Hadoop cluster by using docker. The spark shell did not work with the error message below: Diagnose The problem looks like it can't connect to the ip address, so I start with testing the connection between spark to the ip address. The ping went through without problems. Then I look at yarn to see if I can find any logs there, I discovered the error messages below: It looks obvious that the job containers were killed because the …