PySpark - create DataFrame from scratch
These snippets show how to make a DataFrame from scratch, using a list of values. This is mainly useful when creating small DataFrames for unit tests. Imagine we would like to have a table with an id
column describing a user and then two columns for the number of cats and dogs she has.
The below version uses the SQLContext approach. In order to test this directly in the pyspark shell, omit the line where sc
is created.
import pyspark
from pyspark.sql import SQLContext
sc = pyspark.SparkContext()
sqlContext = SQLContext(sc)
columns = ['id', 'dogs', 'cats']
vals = [
(1, 2, 0),
(2, 0, 1)
]
df = sqlContext.createDataFrame(vals, columns)
It is generally recommended to use SparkSession
instead of SQLContext
now, the same example is adapted for SparkSession
below. To run in the pypsark shell, skip to the # make some test data
section.
from pyspark.sql.session import SparkSession
# instantiate Spark
spark = SparkSession.builder.getOrCreate()
# make some test data
columns = ['id', 'dogs', 'cats']
vals = [
(1, 2, 0),
(2, 0, 1)
]
# create DataFrame
df = spark.createDataFrame(vals, columns)
To check the DataFrame you have created, try df.show()
and df.printSchema()
.