How to Write CSV file in PySpark easily in Azure Databricks -

In this article we will have a DEMO on How to Write CSV file in PySpark in Azure Databricks Notebook.

In PySpark, we can use CSV function (dataframeObj.write.csv) of DataFrameWriter instance (dataframeObj.write) to write on disk or File system , Azure Storage, AWS S3, HDFS.

In this article, We will use PySpark to write CSV with several options: with or without a header, with specific delimiter, with save mode optionsand verify after writing to file.

Now let us see below how we can Write CSV in PySpark:

Read Source Data in PySpark: Quick Recap for demo
Understanding CSV Writer method in PySpark
- DataFrame.Write as instance of DataFrameWriter
- Dataframe.write.csv method signature
Write PySpark to CSV file
PySpark Write CSV Options
- Write CSV with specific delimiter option
- Quick verification of written data with specific delimiter
Write Saving modes
- Write CSV file with Overwrite mode.
- Quick verification of overwritten data

How to Write CSV in PySpark Azure Databricks Notebook; write csv pyspark — How to Write CSV in PySpark Azure Databricks Notebook

Read Source Data in PySpark: Quick Recap for demo:

For demo purpose , we will read one CSV file from DBFS databricks file system and filter the data and write to a new CSV file of a new directory “CSVWrite”.

RECOMMENDATION:

Read CSV in PySpark: Please refer this link for details on how read CSV file in PySpark

Now let us review source CSV data to be filtered before writing to a new CSV file:

Azure Databricks Notebook Read CSV data in PySpark

Understanding CSV Writer method in PySpark:

Now let us know what is dataframeobj.write property is and dataframeobj.write.csv function is because this is what need to write a CSV file in PySpark.

DataFrame.Write as instance of DataFrameWriter:

DataFrame object contains Write property (dataframeobj.write) which is instance of DataFrameWriter class.

Instance of DataFrameWriter class (dataframeobj.write) contains CSV method to write data into CSV file.

Azure Databricks Notebook DataFrameWriter Instance in PySpark.

From below image we can see dataframeobj.write is type of DataFrameWriter using type function:

Azure Databricks Notebook write is DataFrameWriter Type in PySpark — Azure Databricks Notebook: write is DataFrameWriter Type in PySpark

Dataframe.write.CSV() method signature:

From below image, we can see different input parameters of CSV function using help method.

We will use Path parameters for this demo. This is mandatory. It saves the content of the datafrae in CSV format in specific path.

Azure Databricks Notebook DataFrameWriter CSV Method Signature in PySpark

Write PySpark to CSV file:

We can write to CSV in PySpark with header or without header;

PySpark Write to CSV without Header:

To create CSV file without header we will do foollwing as shown in below image:

Read Source data from CSV file from DBFS databricks file system; Please refer 1st line code.
Next we are filtering record based on price
at last we are calling resultant dataframe.write.csv() method to create new CSV file in DBFS.

Here we will only provide path parameter value to create CSV file; we will not provide Header attribute value i.e. “Header” attribute value “True” in Options() function.

Once we execute below command, it will create partitioned CSV file under specific path in DBFS databricks file system as shown in below image.

Azure Databricks Notebook PySpark Write to CSV without Header

Quick verification of written data without header:

As shown in below, to verify that our previous PySpark code wrote proper CSV, we run below code with path of partitioned CSV file created by PySpark.

Here we copied the full path of PySpark created partitioned CSV from DBFS databricks file system.

As we have not provided header attribute to create file, so it generated default column name as _c0, _c1 etc like these as shown below.

Azure Databricks Notebook verification of written data without header

If we provide header as True attribute to read the generated CSV file, it will use 1st row as header of the file as shown in below image.

Azure Databricks Notebook Read CSV without header as ROW in PySpark

Want to know How to create PySpark DataFrame Easy and Simple way?: https://sarnendude.com/how-to-create-pyspark-dataframe-easy-and-simple-way/

PySpark Write to CSV with Header:

To write CSV pyspark with header we will do following as shown in below image:

Read Source data from CSV file from DBFS databricks file system; Please refer 1st line code.
Next we are filtering record based on price
at last we are calling resultant dataframe.write.csv() method to create new CSV file in DBFS.
Here we use PySpark to write CSV with header for that we provide “Header” attribute value “True” in Options() function, so that it creates CSV file with header.

Once we execute below command, it will create partitioned CSV file under specific path in DBFS databricks file system as shown in below image .

Azure Databricks Notebook PySpark Write to CSV with Header

Quick verification of written data with header:

As shown in below, to verify that our previous PySpark code wrote proper CSV, we run below code with path of partitioned CSV file created by PySpark.

So here we use PySpark to read CSV with header.

As we have provided header attribute to create file, so it generated actual column name what was there in datafarme object as shown below.

Azure Databricks Notebook verification of written data with header

PySpark Write CSV Options:

We can use pyspark to write csv to local file system with delimiter

Write CSV with specific delimiter option:

In this demo, we will use PySpark to write pipe delimited file as shown in below image and we will use this delimiter to read the generated file as well.

Here we are providing delimiter value in Options function along with header attribute value.

Azure Databricks Notebook Write CSV with delimiter option in PySpark

Quick verification of written data with specific delimiter:

As shown in below, to verify that our previous PySpark code wrote proper CSV, we run below code with path of partitioned CSV file created by PySpark.

As we have provided pipe | delimiter to create file, so we have to provide the same pipe | delimiter as option to read the generated partitioned CSV file as shown below.

We will use PySpark to read pipe delimited file, as we can see it read the CSV file properly.

Please note, it displayed only two rows based on filter on price > 45. In next section, we will overwrite input file with new logic of price > 50 to get only one row.

Azure Databricks Notebook Read CSV with delimiter in PySpark

Now let us read the generated CSV file with option of wrong delimiter for example comma , now PySpark can not parse the file rather treated entire CSV as single column as shown below.

Azure Databricks Notebook Read CSV with wrong delimiter in PySpark

PySpark Write Saving modes:

PySpark Write CSV with Overwrite mode:

In PySpark, mode() method of DataFrameWriter allows us to perform different operation on file using below saving mode.

Overwrite: This mode is used to overwrite an existing file.
Append: It allows us to add new data to an existing file.
Ignore – It ignores write operation when given file already exists in the directory.
Error: As default option, it throws error when file already exists.

For this demo, let’s do following:

overwrite previously created pipe delimited CSV file new delimiter & Overwrite mode i.e.we will use pyspark to write csv overwrite mode.
filter record with new logic price> 50 to get only one row instead of logic of price > 45 to get two rows earlier.
provide delimiter value as comma , instead of previously used pipe | as delimiter.

Once we executes below code, it will overwrite the previously created csv file with one record.

Azure Databricks Notebook Write CSV with overwrite option in PySpark

Quick verification of overwritten data:

To verify, let’s copy the new file under same directory (we have only one file under that directory csvwithdelimiter) and ran the below code to get the file’s data as shown below.

Here it only written one record as per new filter logic price > 50 by parsing with new delimiter , comma.

Azure Databricks Notebook Read CSV overwrite mode in PySpark

So in this article we saw how to write CSV using PySpark using several option.

Table of Contents:

Read Source Data in PySpark: Quick Recap for demo:

Understanding CSV Writer method in PySpark:

DataFrame.Write as instance of DataFrameWriter:

Dataframe.write.CSV() method signature:

Write PySpark to CSV file:

PySpark Write to CSV without Header:

Quick verification of written data without header:

PySpark Write to CSV with Header:

Quick verification of written data with header:

PySpark Write CSV Options:

Write CSV with specific delimiter option:

Quick verification of written data with specific delimiter:

PySpark Write Saving modes:

PySpark Write CSV with Overwrite mode:

Quick verification of overwritten data:

More from Azure Synapse Tutorial:

More from Azure Data Factory Tutorial:

Leave a ReplyCancel reply