PySpark - create DataFrame from scratch

These snippets show how to make a DataFrame from scratch, using a list of values. This is mainly useful when creating small DataFrames for unit tests. Imagine we would like to have a table with an id column describing a user and then two columns for the number of cats and dogs she has.

The below version uses the SQLContext approach. In order to test this directly in the pyspark shell, omit the line where sc is created.

import pyspark
from pyspark.sql import SQLContext

sc = pyspark.SparkContext()
sqlContext = SQLContext(sc)

columns = ['id', 'dogs', 'cats']
vals = [
     (1, 2, 0),
     (2, 0, 1)
]

df = sqlContext.createDataFrame(vals, columns)

It is generally recommended to use SparkSession instead of SQLContext now, the same example is adapted for SparkSession below. To run in the pypsark shell, skip to the # make some test data section.

from pyspark.sql.session import SparkSession

# instantiate Spark
spark = SparkSession.builder.getOrCreate()

# make some test data
columns = ['id', 'dogs', 'cats']
vals = [
     (1, 2, 0),
     (2, 0, 1)
]

# create DataFrame
df = spark.createDataFrame(vals, columns)

To check the DataFrame you have created, try df.show() and df.printSchema().

PySpark - create DataFrame from scratch

Featured posts

Blog roll