Skip to main content
/images/icon.png
Available for Enterprise Edition only.

Basic Spark Submit

The following sections contain Scala, PySpark and runtime configuration variables to use with custom orchestration solutions.

Scala Spark pipelines

Prerequisites:
  • Optional: Modify ivysettings.xml to point to a custom Maven mirror.
Given a Scala pipeline named “demo_pipeline” with a JAR artifact from PBT called demo_pipeline-1.0.jar you could call the following commands to invoke the Main class from the JAR file and run the pipeline on a local Spark cluster.
Make sure to use the correct version of io.prophecy:prophecy-libs_2.12 for your pipeline. Find this version in the pom.xml or pbt_project.yml in the pipeline’s source code directory. Alternatively use a tool like jdeps on the jar file itself.
spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --driver-memory 8g \
  --executor-memory 4g \
  --executor-cores 4  \
  --packages io.prophecy:prophecy-libs_2.12:3.5.0-8.0.29 \
  --class io.prophecy.pipelines.demo_pipeline.Main \
  demo_pipeline-1.0.jar -i default -O "{}"

PySpark pipelines

Prerequisites:
  • Install Python dependencies by installing the WHL file using pip.
    • pip install ./demo_pipeline-1.0-py3-none-any.whl
  • Gather necessary Maven dependencies and put into the --jars (local) or --packages (repo) option.
    • PBT will have a command to generate dependencies or pom.xml for PySpark projects.
  • Optional: Modify ivysettings.xml to point to a custom Maven mirror or PyPI mirror.
Given a PySpark pipeline named “demo_pipeline” with a WHL artifact from PBT called demo_pipeline-1.0-py3-none-any.whl you could call the following commands to invoke the main() method from the WHL file using a customized launcher script.
 spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --driver-memory 8g \
  --executor-memory 4g \
  --executor-cores 4  \
  --packages io.prophecy:prophecy-libs_2.12:3.5.0-8.0.29 \
  --py-files demo_pipeline-1.0-py3-none-any.whl \
  launcher.py -i default -O "{}"
In this example launcher.py would import the whl file and call the main() entrypoint like so:
This launcher must import the name of your specific pipeline package!
from demo_pipeline import main

main()

Set Runtime Configuration variables

In some cases you may want to override runtime configuration variables of a pipeline. We offer several options for changing the pipeline configuration at runtime. Each example will show a sample as “parameters” (e.g. for a Databricks job) and as “sys args” (e.g. for passing at the end of a spark-submit command). Sample Configuration Schema for below examples:
NameType
str_varstring
bool_varboolean
float_varfloat

-i set the pipeline Configuration instance

A pipeline may be run with a different pipeline Configuration instance by using the -i option and providing the name of the configuration profile instance. For more information on configuration instances and overrides, see Pipeline configuration.
-i examples
  • as parameters: ['-i', 'default']
  • as sysargs: -i default

-O override many parameters as a json blob

This may be used in conjunction with -i and it will only override parameters which are given. You must specify the name and value of each variable that you want to override.
-0 examples
  • as parameters: ['-O', '{"str_var":"overridden", "float_var":0.5}']
  • as sysargs: -O "{\"str_var\":\"overridden\",\"float_var\":0.5}"

-C override individual parameters

This may be used in conjunction with -i and it will only override parameters which are given. This option may be used more than once.
-C examples
  • as parameters: ['-C', 'str_var=test1', 'float_var=0.5']
  • as sysargs: -C str_var=test1 float_var=0.5

-f set configuration using a file

This option will set all parameters for a pipeline by using a json file which can be reached locally by the spark-submit command.
All Configuration Schema fields must be provided in this file.
-f examples
  • as parameters: ['-f', '/path/to/somefile.json']
  • as sysargs: -f /path/to/somefile.json
Example json file:
{
  "str_var": "vendor1",
  "bool_var": true,
  "float_var": 0.5
}