How to Write CSV file in PySpark easily in Azure Databricks

In this article we will have a DEMO on How to Write CSV file in PySpark in Azure Databricks Notebook.

In PySpark, we can use CSV function (dataframeObj.write.csv) of DataFrameWriter instance (dataframeObj.write) to write on disk or File system , Azure Storage, AWS S3, HDFS.

In this article, We will use PySpark to write CSV with several options: with or without a header, with specific delimiter, with save mode optionsand verify after writing to file.

Now let us see below how we can Write CSV in PySpark:

Table of Contents:
  1. Read Source Data in PySpark: Quick Recap for demo
  2. Understanding CSV Writer method in PySpark
  3. Write PySpark to CSV file
  4. PySpark Write CSV Options
  5. Write Saving modes
How to Write CSV in PySpark Azure Databricks Notebook; write csv pyspark
How to Write CSV in PySpark Azure Databricks Notebook
Read Source Data in PySpark: Quick Recap for demo:

For demo purpose , we will read one CSV file from DBFS databricks file system and filter the data and write to a new CSV file of a new directory “CSVWrite”.

RECOMMENDATION:

Read CSV in PySpark: Please refer this link for details on how read CSV file in PySpark

Now let us review source CSV data to be filtered before writing to a new CSV file:

Azure Databricks Notebook Read CSV data in PySpark
Azure Databricks Notebook Read CSV data in PySpark
Understanding CSV Writer method in PySpark:

Now let us know what is dataframeobj.write property is and dataframeobj.write.csv function is because this is what need to write a CSV file in PySpark.

DataFrame.Write as instance of DataFrameWriter:

DataFrame object contains Write property (dataframeobj.write) which is instance of DataFrameWriter class.

Instance of DataFrameWriter class (dataframeobj.write) contains CSV method to write data into CSV file.

Azure Databricks Notebook DataFrameWriter Instance in PySpark.
Azure Databricks Notebook DataFrameWriter Instance in PySpark.

From below image we can see dataframeobj.write is type of DataFrameWriter using type function:

Azure Databricks Notebook write is DataFrameWriter Type in PySpark
Azure Databricks Notebook: write is DataFrameWriter Type in PySpark
Dataframe.write.CSV() method signature:

From below image, we can see different input parameters of CSV function using help method.

We will use Path parameters for this demo. This is mandatory. It saves the content of the datafrae in CSV format in specific path.

Azure Databricks Notebook DataFrameWriter CSV Method Signature in PySpark
Azure Databricks Notebook DataFrameWriter CSV Method Signature in PySpark
Write PySpark to CSV file:

We can write to CSV in PySpark with header or without header;

PySpark Write to CSV without Header:

To create CSV file without header we will do foollwing as shown in below image:

  • Read Source data from CSV file from DBFS databricks file system; Please refer 1st line code.
  • Next we are filtering record based on price
  • at last we are calling resultant dataframe.write.csv() method to create new CSV file in DBFS.

Here we will only provide path parameter value to create CSV file; we will not provide Header attribute value i.e. “Header” attribute value “True” in Options() function.

Once we execute below command, it will create partitioned CSV file under specific path in DBFS databricks file system as shown in below image.

Azure Databricks Notebook PySpark Write to CSV without Header
Azure Databricks Notebook PySpark Write to CSV without Header
Quick verification of written data without header:

As shown in below, to verify that our previous PySpark code wrote proper CSV, we run below code with path of partitioned CSV file created by PySpark.

Here we copied the full path of PySpark created partitioned CSV from DBFS databricks file system.

As we have not provided header attribute to create file, so it generated default column name as _c0, _c1 etc like these as shown below.

Azure Databricks Notebook verification of written data without header
Azure Databricks Notebook verification of written data without header

If we provide header as True attribute to read the generated CSV file, it will use 1st row as header of the file as shown in below image.

Azure Databricks Notebook Read CSV without header as ROW in PySpark
Azure Databricks Notebook Read CSV without header as ROW in PySpark

Want to know How to create PySpark DataFrame Easy and Simple way?: https://sarnendude.com/how-to-create-pyspark-dataframe-easy-and-simple-way/

PySpark Write to CSV with Header:

To write CSV pyspark with header we will do following as shown in below image:

  • Read Source data from CSV file from DBFS databricks file system; Please refer 1st line code.
  • Next we are filtering record based on price
  • at last we are calling resultant dataframe.write.csv() method to create new CSV file in DBFS.
  • Here we use PySpark to write CSV with header for that we provide “Header” attribute value “True” in Options() function, so that it creates CSV file with header.

Once we execute below command, it will create partitioned CSV file under specific path in DBFS databricks file system as shown in below image .

Azure Databricks Notebook PySpark Write to CSV with Header
Azure Databricks Notebook PySpark Write to CSV with Header
Quick verification of written data with header:

As shown in below, to verify that our previous PySpark code wrote proper CSV, we run below code with path of partitioned CSV file created by PySpark.

So here we use PySpark to read CSV with header.

As we have provided header attribute to create file, so it generated actual column name what was there in datafarme object as shown below.

Azure Databricks Notebook verification of written data with header
Azure Databricks Notebook verification of written data with header
PySpark Write CSV Options:

We can use pyspark to write csv to local file system with delimiter

Write CSV with specific delimiter option:

In this demo, we will use PySpark to write pipe delimited file as shown in below image and we will use this delimiter to read the generated file as well.

Here we are providing delimiter value in Options function along with header attribute value.

Azure Databricks Notebook Write CSV with delimiter option in PySpark
Azure Databricks Notebook Write CSV with delimiter option in PySpark
Quick verification of written data with specific delimiter:

As shown in below, to verify that our previous PySpark code wrote proper CSV, we run below code with path of partitioned CSV file created by PySpark.

As we have provided pipe | delimiter to create file, so we have to provide the same pipe | delimiter as option to read the generated partitioned CSV file as shown below.

We will use PySpark to read pipe delimited file, as we can see it read the CSV file properly.

Please note, it displayed only two rows based on filter on price > 45. In next section, we will overwrite input file with new logic of price > 50 to get only one row.

Azure Databricks Notebook Read CSV with delimiter in PySpark
Azure Databricks Notebook Read CSV with delimiter in PySpark

Now let us read the generated CSV file with option of wrong delimiter for example comma , now PySpark can not parse the file rather treated entire CSV as single column as shown below.

Azure Databricks Notebook Read CSV with wrong delimiter in PySpark
Azure Databricks Notebook Read CSV with wrong delimiter in PySpark
PySpark Write Saving modes:
PySpark Write CSV with Overwrite mode:

In PySpark, mode() method of DataFrameWriter allows us to perform different operation on file using below saving mode.

  • Overwrite: This mode is used to overwrite an existing file.
  • Append: It allows us to add new data to an existing file.
  • Ignore – It ignores write operation when given file already exists in the directory.
  • Error: As default option, it throws error when file already exists.

For this demo, let’s do following:

  • overwrite previously created pipe delimited CSV file new delimiter & Overwrite mode i.e.we will use pyspark to write csv overwrite mode.
  • filter record with new logic price> 50 to get only one row instead of logic of price > 45 to get two rows earlier.
  • provide delimiter value as comma , instead of previously used pipe | as delimiter.

Once we executes below code, it will overwrite the previously created csv file with one record.

Azure Databricks Notebook Write CSV with overwrite option in PySpark
Azure Databricks Notebook Write CSV with overwrite option in PySpark
Quick verification of overwritten data:

To verify, let’s copy the new file under same directory (we have only one file under that directory csvwithdelimiter) and ran the below code to get the file’s data as shown below.

Here it only written one record as per new filter logic price > 50 by parsing with new delimiter , comma.

Azure Databricks Notebook Read CSV overwrite mode in PySpark
Azure Databricks Notebook Read CSV overwrite mode in PySpark

So in this article we saw how to write CSV using PySpark using several option.

More from Azure Synapse Tutorial:

Azure Synapse Intelligent Cache for Apache Spark: https://sarnendude.com/azure-synapse-analytics-intelligent-cache-for-apache-spark/

Flowlet transformation in Azure Data Factory and Azure Synapse pipeline: https://sarnendude.com/azure-synapse-analytics-intelligent-cache-for-apache-spark/

Azure Synapse Tutorial: Three In ONE Service: https://sarnendude.com/azure-synapse-tutorial/

More from Azure Data Factory Tutorial:

Azure Data Factory Data Flow: Change Data Capture Architecture & Demo: https://sarnendude.com/azure-data-factory-data-flow-support-change-data-capture/

Leave a Reply