SparkException – java.io.NotSerializableException: java.io.PrintStream

When I tried to parse a CSV-dataset using Apache Spark, I decided to go through it and print all records to a console using System.out.println() inside the foreach method of a Dataset object.

But all I got was exceptions:

org.apache.spark.SparkException: Task not serializable

and

java.io.NotSerializableException: java.io.PrintStream

Fixing NotSerializableException: java.io.PrintStream in Spark

This only means that Spark tries to use System.out object, which has a type of java.io.PrintStream and does not implement the Serializable interface.

What you can do in such case is to use SparkSql. Below you can my the code sample to view the records in the dataset.

    SparkSession spark = SparkSession
                .builder()
                .appName("Java Spark View Dataset Records")
                .master("local")
                .getOrCreate();

    StructType schema = new StructType()
                .add("Id", "long")
                .add("ProductId", "string");

    Dataset<Row> dataset = spark.read()
                .schema(schema)
                .csv(FILE_PATH);

    dataset.createOrReplaceTempView("review");
    Dataset<Row> sqlResult = spark.sql("select * from review");

    sqlResult.show();

As you can see, I used the “good old” SELECT * FROM table_name command, to see the result.

NOTE: sqlResults.show() is limited to 20 rows only. But it is still good for checking the correctness of your dataset and schema.

Leave a Reply

Be the First to Comment!