In this article we will have a DEMO on How to Read CSV file in PySpark and load into a DataFrame in several ways using a Azure Databricks Notebook.
PySpark provides us CSV() and Load() methods to read and load data from:
- Single CSV file
- Multiple CSV file from different directory
- And multiple CSV file from same directory.
We will read CSV file using different option like delimiter/separator , inferSchema, custom schema using PySpark read csv options.
PySpark provides options to read CSV file having different delimiter/separator like comma, pipe, tab, space etc.
PySpark also provides parquet function, JSON function and text function to read respective type of files.
Table of Contents:
- First Know the Data Source for this DEMO
- Read and Load Single CSV file
- Read and Load Multiple CSV file
- Read Single CSV File with delimiter
- Read Single CSV File with inferSchema
- Read CSV File with custom schema
First Know the Data Source for this DEMO: DBFS (Databricks File System)
For this POC, we are using Databricks File System ( DBFS ) as data source where we created two directory “Product” and “ProductOld”under “FileStore” as shown below image (pyspark read csv from local file system):
The Databricks File System (DBFS) is a distributed file system mounted into an Azure Databricks workspace and available on Azure Databricks clusters
Source Directory Structure in Azure Databricks DBFS:
In one of the scenarios, we will load CSV files from these different directory as both directory contains CSV file having same schema.
Two CSV files under same directory in Azure Databricks DBFS:
In Azure Databricks file system, “Product” directory contains two CSV files “ProductCurrent.csv” and “ProductNew.csv” as shown below:
In another scenario, we will be loading multiple CSV files from this directory (“Product”)
One CSV file under one directory in Azure Databricks DBFS:
To test “Multiple CSV file from different directory” scenario this CSV file will be used as one csv file from “ProductOld” directory along with another file from “Product”directory.
“ProductOld” directory contains one CSV file “ProductOld.csv”as shown below:
Read and Load Single CSV file:
Read Single CSV file with CSV() method:
Reading Single CSV file without header option (read csv file in pyspark databricks):
As shown in below image, here
- we are loading single CSV file data into a PySpark DataFrame using csv() method of spark.read i.e. DataFrameReader instance.
- and header option is to use the first row of input csv file as the dataframe’s column names
- at last it will display the schema of the input file as well using printSchema() method of dataframe.
Once we run the above notebook, it displays data of Single CSV file as shown below:
It displayed all data including the header row as data itself and we have not provided header option in csv() method above
- Column names are _c0, _c1, _c2, _c3 which are default generated.
- and actual headers row of “ProdID”,”ProdName”, “Category” and “Price” is added in dataset as file data.
It displays the schema with default generated column and string data type of all column as CSV infer all column data type as string.
We will solve this data type issue using inferSchema and custom Schema in the later section of this article.
Read Single CSV file with header option:
This is continuation of above notebook, everything is same but here we are passing header option in CSV method as Header = True as shown in below image:
- we are loading single CSV file data into a PySpark DataFrame using csv() method of spark.read i.e. DataFrameReader instance.
- and header option is to use the first row of input csv file as the dataframe’s column names.
- at last it will print the schema of the input file as well using printSchema() method of dataframe.
After running the above notebook, it display data of Single CSV file with actual header as shown in below image:
- Actual headers contains “ProdID”,”ProdName”, “Category” and “Price” column names i.e. spark read csv with different number of columns.
- Displayed Schema also contains these column names.
So once we provided Header=True option in CSV() method, it displays header row properly and column name in schema as expected.
Read Single CSV file with Load() method:
Reading Single CSV file using load method without header
Similar to above CSV method, as shown in below image, here
- we are loading single CSV file data into a PySpark DataFrame using load() method of spark.read i.e. DataFrameReader instance.
- Here we are providing CSV file path as input of Load method.
- and header option is to use the first row of input csv file as the dataframe’s column names
- at last it will display the schema of the input file as well using printSchema() method of dataframe.
Display data of CSV file using load method without header:
Once we run the above notebook, it displays data of Single CSV file as shown below:
- It displayed all including the header row as data itself as we have not provided header option in load() method above
- Column names are _c0, _c1, _c2, _c3 which are default generated
- and actual headers row of “ProdID”,”ProdName”, “Category” and “Price” is added in dataset as file data.
- It prints the schema with default generated column and string data type of all column as CSV infer all column data type as string i.e. pyspark read csv without header.
Reading CSV file using load method with header option in PySpark:
This is continuation of above notebook, everything is same but here we are passing header value in option function as Key= ‘header’ and value=True (pyspark read csv with header) as shown in below image Option function is a method of spark.read i.e. DataFrameReader instance.
- We are loading single CSV file data into a PySpark DataFrame using load() method of spark.read i.e. DataFrameReader instance.
- Here we are providing CSV file path as input of Load method.
- and we are passing header value in option function as Key= ‘header’ and value=True to use the first row of input csv file as the dataframe’s column names.
- at last it will print the schema of the input file as well using printSchema() method of dataframe.
Display data of CSV file using load method with header:
After running the above notebook, it display data of Single CSV file with actual header as shown in below image:
- Actual headers contains “ProdID”,”ProdName”, “Category” and “Price” column names
- Displayed Schema also cotains these column names.
So once provided Header details as key value in option function of dataframe reader, it displays header row properly and column name in schema as expected.
Read and Load Multiple CSV file:
Read and load multiple CSV file from different directory in PySpark:
As shown in below image, here
- We are loading multiple CSV file data of different directory into a PySpark DataFrame using csv method of spark.read i.e. DataFrameReader instance.
- in CSV method, multiple file path are provided as array of string as shown in below image.
- and header option is to use the first row of input csv file as the dataframe’s column names
- At last it will display the schema of the input file as well using printSchema() method of dataframe.
Once we run the above notebook, it merged data from two CSV file and loads the full data into PySpark DataFrame.Using display function, it displays data of dataframe as shown below:
Read multiple CSV Files from same directory in PySpark:
As shown in below image, here
- We are loading multiple CSV file data of same directory into a PySpark DataFrame using csv method of spark.read i.e. DataFrameReader instance.
- in CSV method, directory path containing all files is provided as string for “path” parameter shown below image.
- and header option is to use the first row of input csv file as the dataframe’s column names
- At last it will display the schema of the input file as well using printSchema() method of dataframe.
Once we run the above notebook, it merged data from two CSV file and loads the full data into PySpark DataFrame.Using display function, it displays data of dataframe as shown below:
Read Single CSV File with delimiter in PySpark:
As shown in below image, here
- We are loading a single CSV file using csv method with an option of delimiter value to identity how the data of source file separated.
- Here in Option function, we are passing delimiter value as ‘,’ comma as data of source file is separated by comma. Other delimiter can be pipe |, tab or space as well.
- and header option is to use the first row of input csv file as the dataframe’s column names
- At last it will print the schema of the input file as well using printSchema() method of dataframe.
Once we run the above notebook, it parse the CSV file with input delimiter and loads the data of CSV file into PySpark DataFrame.Using display function, it displays data of dataframe as shown below:
Read Single CSV File with inferSchema in PySpark:
As shown in below image, here
- We are loading a single CSV file using csv method with inferSchema details in Option function.
- PySpark will use inferSchema option to infer the column data type from CSV file. Here now it will infer data typeof each input column correctly like ProdID as Integer etc.
- and header option is to use the first row of input csv file as the dataframe’s column names
- At last it will print the schema of the input file as well using printSchema() method of dataframe.
Once we run the above notebook, it PySpark infers the column data type of input columns as shown below.
Now instead of default data type as string for all column of CSV file, it infers the ProdID as Integer, ProdName as String, Category as String and Price as double i.e. pyspark read csv schema.
Read CSV File with custom schema in PySpark:
As shown in below image, here
- We are passing custom schema as a input parameter in CSV method to load a CSV file.
- To provide, custom schema we need StructType and StructField class; for that we are importing functions from pyspark.sql.types module.
- StructType instance is the structure of dataframe i.e. list of columns.
- StructField is the instance of each column containing the column name and data type.
- StructType contains the array of StructField as shown in below
- We added all columns & their data type in StructField instance and added each StructField instance in StructType.
- next we provided inputSchema as schema parameter in csv method.
- At last it will print the schema of the input file as well using printSchema() method of dataframe.
Once we run the above notebook, it will display the the data and required custom schema as shown in below image.
So in this tutorial, we learnt several ways How to Read CSV file in PySpark in Azure Databricks Notebook i.e. PySpark Read CSV file into DataFrame.
More from Azure Synapse Tutorial:
Azure Synapse Intelligent Cache for Apache Spark: https://sarnendude.com/azure-synapse-analytics-intelligent-cache-for-apache-spark/
Flowlet transformation in Azure Data Factory and Azure Synapse pipeline: https://sarnendude.com/azure-synapse-analytics-intelligent-cache-for-apache-spark/
Azure Synapse Tutorial: Three In ONE Service: https://sarnendude.com/azure-synapse-tutorial/
More from Azure Data Factory Tutorial:
Azure Data Factory Data Flow: Change Data Capture Architecture & Demo: https://sarnendude.com/azure-data-factory-data-flow-support-change-data-capture/