When I tried to parse a CSV-dataset using Apache Spark, I decided to go through it and print all records to a console using System.out.println() inside the foreach method of a Dataset object.
But all I got was exceptions:
org.apache.spark.SparkException: Task not serializable
Fixing NotSerializableException: java.io.PrintStream in Spark
This only means that Spark tries to use System.out object, which has a type of java.io.PrintStream and does not implement the Serializable interface.
What you can do in such case is to use SparkSql. Below you can my the code sample to view the records in the dataset.
SparkSession spark = SparkSession .builder() .appName("Java Spark View Dataset Records") .master("local") .getOrCreate(); StructType schema = new StructType() .add("Id", "long") .add("ProductId", "string"); Dataset<Row> dataset = spark.read() .schema(schema) .csv(FILE_PATH); dataset.createOrReplaceTempView("review"); Dataset<Row> sqlResult = spark.sql("select * from review"); sqlResult.show();
As you can see, I used the “good old” SELECT * FROM table_name command, to see the result.
NOTE: sqlResults.show() is limited to 20 rows only. But it is still good for checking the correctness of your dataset and schema.