Alternative Schedulers

Basic Spark Submit

The following sections contain Scala, PySpark and runtime configuration variables to use with custom orchestration solutions.

Scala Spark pipelines

Prerequisites:

Optional: Modify ivysettings.xml to point to a custom Maven mirror.

Given a Scala pipeline named “demo_pipeline” with a JAR artifact from PBT called demo_pipeline-1.0.jar you could call the following commands to invoke the Main class from the JAR file and run the pipeline on a local Spark cluster.

Make sure to use the correct version of io.prophecy:prophecy-libs_2.12 for your pipeline. Find this version in the pom.xml or pbt_project.yml in the pipeline’s source code directory. Alternatively use a tool like jdeps on the jar file itself.

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --driver-memory 8g \
  --executor-memory 4g \
  --executor-cores 4  \
  --packages io.prophecy:prophecy-libs_2.12:3.5.0-8.0.29 \
  --class io.prophecy.pipelines.demo_pipeline.Main \
  demo_pipeline-1.0.jar -i default -O "{}"

PySpark pipelines

Prerequisites:

Install Python dependencies by installing the WHL file using pip.
- pip install ./demo_pipeline-1.0-py3-none-any.whl
Gather necessary Maven dependencies and put into the --jars (local) or --packages (repo) option.
- PBT will have a command to generate dependencies or pom.xml for PySpark projects.
Optional: Modify ivysettings.xml to point to a custom Maven mirror or PyPI mirror.

Given a PySpark pipeline named “demo_pipeline” with a WHL artifact from PBT called demo_pipeline-1.0-py3-none-any.whl you could call the following commands to invoke the main() method from the WHL file using a customized launcher script.

 spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --driver-memory 8g \
  --executor-memory 4g \
  --executor-cores 4  \
  --packages io.prophecy:prophecy-libs_2.12:3.5.0-8.0.29 \
  --py-files demo_pipeline-1.0-py3-none-any.whl \
  launcher.py -i default -O "{}"

In this example launcher.py would import the whl file and call the main() entrypoint like so:

This launcher must import the name of your specific pipeline package!

from demo_pipeline import main

main()

Set Runtime Configuration variables

In some cases you may want to override runtime configuration variables of a pipeline. We offer several options for changing the pipeline configuration at runtime. Each example will show a sample as “parameters” (e.g. for a Databricks job) and as “sys args” (e.g. for passing at the end of a spark-submit command). Sample Configuration Schema for below examples:

Name	Type
str_var	string
bool_var	boolean
float_var	float

`-i` set the pipeline Configuration instance

A pipeline may be run with a different pipeline Configuration instance by using the -i option and providing the name of the configuration profile instance. For more information on configuration instances and overrides, see Pipeline configuration.

`-i` examples

as parameters: ['-i', 'default']
as sysargs: -i default

`-O` override many parameters as a json blob

This may be used in conjunction with -i and it will only override parameters which are given. You must specify the name and value of each variable that you want to override.

`-0` examples

as parameters: ['-O', '{"str_var":"overridden", "float_var":0.5}']
as sysargs: -O "{\"str_var\":\"overridden\",\"float_var\":0.5}"

`-C` override individual parameters

This may be used in conjunction with -i and it will only override parameters which are given. This option may be used more than once.

`-C` examples

as parameters: ['-C', 'str_var=test1', 'float_var=0.5']
as sysargs: -C str_var=test1 float_var=0.5

`-f` set configuration using a file

This option will set all parameters for a pipeline by using a json file which can be reached locally by the spark-submit command.

All Configuration Schema fields must be provided in this file.

`-f` examples

as parameters: ['-f', '/path/to/somefile.json']
as sysargs: -f /path/to/somefile.json

Example json file:

{
  "str_var": "vendor1",
  "bool_var": true,
  "float_var": 0.5
}

Getting started

Development

Production

Execution

Extensibility

Basic Spark Submit

Scala Spark pipelines

PySpark pipelines

Set Runtime Configuration variables

`-i` set the pipeline Configuration instance

`-i` examples

`-O` override many parameters as a json blob

`-0` examples

`-C` override individual parameters

`-C` examples

`-f` set configuration using a file

`-f` examples

Getting started

Development

Production

Execution

Extensibility

​Basic Spark Submit

​Scala Spark pipelines

​PySpark pipelines

​Set Runtime Configuration variables

​-i set the pipeline Configuration instance

-i examples

​-O override many parameters as a json blob

-0 examples

​-C override individual parameters

-C examples

​-f set configuration using a file

-f examples

Basic Spark Submit

Scala Spark pipelines

PySpark pipelines

Set Runtime Configuration variables

`-i` set the pipeline Configuration instance

`-i` examples

`-O` override many parameters as a json blob

`-0` examples

`-C` override individual parameters

`-C` examples

`-f` set configuration using a file

`-f` examples