We can create PySpark DataFrame using different functions of SparkSession instance (pyspark.sql.SparkSession).
Here we will discuss on how to create PySpark DataFrame using createDataFrame() method with hard coded value using Azure Databricks Notebook.
What is DataFrame?
- A DataFrame is a Dataset organized into named columns.
- It is equivalent to a table in a relational database as dataframe holds its own schema and data as well.
Data sources to create PySpark DataFrame:
- Azure Data Storage
- Amazon S3
- DBFS
- HDFS
- Existing RDD
- RDBMS
- NoSQL databases & many more.
File formats to create PySpark DataFrame:
- CSV
- Parquet
- JSON
- ORV
- TXT
- Avro and few more.
Create DataFrame using createDataFrame() method:
We can create dataframe using different functions of SparkSession instance (pyspark.sql.SparkSession). spark is default instance of SparkSession.
DataFrame Creation:
We will create PySpark dataframe using createDataFrame method (pyspark.sql.SparkSession.createDataFrame) of SparkSession instance.
CreateDataFrame() Signature:
SparkSession.createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)
From the above signature, we can see only data input is mandatory and all others parameters are optional as those contains default value.
- data: Here data parameters provides the data which dataframe will hold, like tables has its own data.
- Input data can be:
- list of lists
- tuples
- dictionaries
- pyspark.sql.Rows
- pandas DataFrame
- RDD consisting of such a list
- schema: this input parameters provides structures of the dataframe like column name and data type.
- samplingRatio: to determined the ratio of rows used for schema inference.
- verifySchema: to verify data types of every row against schema. It is enabled by default.
About createDataFrame method with help method:
Create DataFrame with data input only:
First let’ create dataframe with data parameter only, no schema will be provided as shown in below image.
It creates dataframe but without any proper column name as it is showing column name as _1 and _2.
This is because we have not provided proper column name or schema.
Inferred default schema:
As we have not provided any schema, Spark DataFrame inferred default schema from input data itself.
It inferred both generic column name (_1 & _2) and data type from input data as shown in below image.
Create DataFrame with data input and Column name only
As shown in below image when we provided input data and column name only (not data type) as schema, it creates dataframe with those column name.
Inferred only data type:
As we provided only column name as part of schema, it creates datafeame accordingly but inferred data type from input data.
Below schema details is displayed using printSchema() function of dataframe.
Create DataFrame with data input and Full Schema: Column name & datatype
Here we provides both column name and data type in schema.
To provides data type also, we need StructType (spark.sql.types.StructType) and StructField (spark.sql.types.StructField).
- StructType – Defines the structure of the Dataframe.
- StructField – Defines the metadata of the DataFrame column:
- column name(String)
- column data type (DataType)
- If Nullable column (Boolean)
- metadata (MetaData)
StructType contains list of StructField instance and StructField contains individual column meta data.
Here we have provide two column details each using StructField: TopicID and its data type is Integer; TopicName and its data type is String.
As shown in below image, we created schema using below StructType and StructType details.
- StructType:
- StructField(“TopicID”, IntegerType())
- StructField(“TopicName”, StringType())
Accordinngly dataframe is created as shown in below image.
Let’s display schema of dataframe using printSchema() function of this dataframe;
As shown in below image, now it is showing column name and data type what we provided to create input schema.
So in this tutorial we learn How to create PySpark DataFrame manually using databricks notebook.
More from Azure Synapse Tutorial:
Azure Synapse Intelligent Cache for Apache Spark: https://sarnendude.com/azure-synapse-analytics-intelligent-cache-for-apache-spark/
Flowlet transformation in Azure Data Factory and Azure Synapse pipeline: https://sarnendude.com/azure-synapse-analytics-intelligent-cache-for-apache-spark/
Azure Synapse Tutorial: Three In ONE Service: https://sarnendude.com/azure-synapse-tutorial/
More from Azure Data Factory Tutorial:
Azure Data Factory Data Flow: Change Data Capture Architecture & Demo: https://sarnendude.com/azure-data-factory-data-flow-support-change-data-capture/