Contrib

Contrib Datalayer.

Apache Spark

Apache Spark

Source on https://github.com/apache/spark.

Docs on https://spark.apache.org/docs/latest/running-on-kubernetes.html.

Automated build (Write K8S in the PR name).

Example

Build Command

Apache Spark K8S Fork

Source on https://github.com/apache-spark-on-k8s/spark.

Docs on https://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html.

Automated Build:

Build Command

Datalayer Branch

Create the datalayer branch.

cd $DLAHOME/repos/spark
git checkout branch-2.2-kubernetes
datalayer spark-merge
git push -f origin datalayer

Build Distribution

datalayer spark-build-dist
datalayer spark-build-dist-fork

Minikube

Follow Minikube Howto to setup Minikube.

APISERVER=$(kubectl config view | grep server | cut -f 2- -d ":" | tr -d " ")
RESOURCESTAGINGSERVER=$(kubectl get svc spark-resource-staging-service -o jsonpath='{.spec.clusterIP}')
echo -e """
APISERVER=$APISERVER
RESOURCESTAGINGSERVER=$RESOURCESTAGINGSERVER
"""
export APISERVER=https://192.168.99.100:8443
export RESOURCESTAGINGSERVER=10.98.123.170

Docker Images

# 2.2.0-fork Build and push to Docker Hub
datalayer spark-docker-build-push
# 2.2.0-fork Build and push to local registry
datalayer spark-docker-build-push-local
# 2.4.0
cd /opt/spark; ./bin/docker-image-tool.sh -r localhost:5000 -t 2.4.0 build
cd /opt/spark; ./bin/docker-image-tool.sh -r localhost:5000 -t 2.4.0 push

Shuffle Service

kubectl delete -f $DLAHOME/manifests/spark/spark-shuffle-service.yaml
kubectl create -f $DLAHOME/manifests/spark/spark-shuffle-service.yaml

Resource Staging Server

kubectl delete -f $DLAHOME/manifests/spark/spark-resource-staging-server.yaml
kubectl create -f $DLAHOME/manifests/spark/spark-resource-staging-server.yaml
kubectl get svc spark-resource-staging-service -o jsonpath='{.spec.clusterIP}'
RSS_POD=$(kubectl get pods -n default -l "spark-resource-staging-server-instance=default" -o jsonpath="{.items[0].metadata.name}")
echo $RSS_POD
kubectl exec -it $RSS_POD -- bash
minikube service spark-resource-staging-service
kubectl port-forward $RSS_POD 10000:10000
curl http://localhost:10000

Incremental Build

# 2.2.0-fork
cd $DLAHOME/repos/spark/resource-managers/kubernetes/core
datalayer spark-mvn clean -DskipTests
datalayer spark-mvn install -DskipTests
cp $DLAHOME/repos/spark/resource-managers/kubernetes/core/target/spark-kubernetes_*.jar /opt/spark/jars
# datalayer spark-docker-build-push-local
# 2.4.0
cd $DLAHOME/repos/spark/resource-managers/kubernetes/core
datalayer spark-mvn clean -DskipTests
datalayer spark-mvn install -DskipTests
cp $DLAHOME/repos/spark/resource-managers/kubernetes/core/target/spark-kubernetes_*.jar /opt/spark/jars
cd /opt/spark; ./bin/docker-image-tool.sh -r localhost:5000 -t 2.4.0 build
cd /opt/spark; ./bin/docker-image-tool.sh -r localhost:5000 -t 2.4.0 push

Integration Tests

# 2.2.0-fork
datalayer spark-integration-test
# datalayer spark-integration-test-pre
# datalayer spark-integration-test-run
kubectl apply -f $DLAHOME/repos/spark-integration/dev/spark-rbac.yaml
# 2.4.0
cd $DLAHOME/repos/spark-integration
./dev/dev-run-integration-tests.sh \
  --spark-tgz $DLAHOME/packages/spark-2.4.0-SNAPSHOT-bin-hdfs-2.9.0.tgz
cd $DLAHOME/repos/spark-integration
./dev/dev-run-integration-tests.sh \
  --spark-tgz $DLAHOME/packages/spark-2.4.0-SNAPSHOT-bin-hdfs-2.9.0.tgz \
  --image-repo localhost:5000 \
  --image-tag 2.4.0

Test Grid

https://k8s-testgrid.appspot.com/sig-big-data
https://k8s-testgrid.appspot.com/sig-big-data#spark-periodic-default-gke

IDE

-Dscala.usejavacp=true
# shell - use -D for additional properties
org.apache.spark.repl.Main
# submit - use --conf for additional properties
org.apache.spark.deploy.SparkSubmit

Out-Cluster

APP_NAME=spark-shell-client-mode-out-cluster \
APISERVER=https://192.168.99.100:8443 \
DEPLOY_MODE=client \
RESOURCESTAGINGSERVER=$(kubectl get svc spark-resource-staging-service -o jsonpath='{.spec.clusterIP}') \
DRIVER_POD_NAME=spark-driver \
datalayer spark-spl-shell
APP_NAME=submit-cluster-mode-out-cluster \
APISERVER=https://192.168.99.100:8443 \
DEPLOY_MODE=cluster \
DRIVER_POD_NAME=spark-driver \
RESOURCESTAGINGSERVER=$(kubectl get svc spark-resource-staging-service -o jsonpath='{.spec.clusterIP}') \
datalayer spark-spl-submit
APP_NAME=submit-client-mode-out-cluster \
APISERVER=https://192.168.99.100:8443 \
DEPLOY_MODE=client \
DRIVER_POD_NAME=$HOSTNAME \
RESOURCESTAGINGSERVER=$(kubectl get svc spark-resource-staging-service -o jsonpath='{.spec.clusterIP}') \
datalayer spark-spl-submit

In-Cluster

# 2.2.0-fork
kubectl delete pod spark-pod --grace-period 0 --force; kubectl run -it spark-pod --image-pull-policy=Always --image=localhost:5000/spark-driver:2.2.0 --restart=Never -- bash
# 2.4.0
kubectl delete pod spark-pod --grace-period 0 --force; kubectl run -it spark-pod --image-pull-policy=Always --image=localhost:5000/spark:2.4.0 --restart=Never -- sh
APP_NAME=shell-client-mode-in-cluster \
APISERVER=https://kubernetes:443 \
DEPLOY_MODE=client \
DRIVER_POD_NAME=$HOSTNAME \
RESOURCESTAGINGSERVER=10.102.217.130 \
datalayer spark-spl-shell
APP_NAME=submit-cluster-mode-in-cluster \
APISERVER=https://kubernetes:443 \
DEPLOY_MODE=cluster \
DRIVER_POD_NAME=spark-driver \
RESOURCESTAGINGSERVER=10.102.217.130 \
datalayer spark-spl-submit
APP_NAME=submit-client-mode-in-cluster \
APISERVER=https://kubernetes:443 \
DEPLOY_MODE=client \
DRIVER_POD_NAME=$HOSTNAME \
RESOURCESTAGINGSERVER=10.102.217.130 \
datalayer spark-spl-submit
# option-1
kubectl delete -f $DLAHOME/manifests/spark/spark-base.yaml
export POD_NAME=$(kubectl get pods -n default -l spark-base=base -o jsonpath="{.items[0].metadata.name}")
kubectl delete pod $POD_NAME --grace-period 0 --force
kubectl apply -f $DLAHOME/manifests/spark/spark-base.yaml
export POD_NAME=$(kubectl get pods -n default -l spark-base=base -o jsonpath="{.items[0].metadata.name}")
kubectl exec -it $POD_NAME bash
# option-2
kubectl attach -it spark-pod
kubectl delete pod spark-exec-1 --grace-period 0 --force; kubectl delete pod spark-exec-2 --grace-period 0 --force

Benchmarks

#  --class com.bbva.spark.benchmarks.dfsio.TestDFSIO \
#  /src/benchmarks/spark-benchmarks/dfsio/target/scala-2.11/spark-benchmarks-dfsio-0.1.0-with-dependencies.jar \
#    write --numFiles 10 --fileSize 1GB --outputDir hdfs://hadoop-k8s-hadoop-k8s-hdfs-nn:9000/benchmarks/DFSIO

Contributions

DESCRIPTION ISSUE REPOSITORY DOC PR STATUS
Client Mode SPARK-23146 Apache: datalayer-contrib:k8s-client-mode

Apache Fork: datalayer-contrib:client-mode-datalayer-dev
[WIP] Describe Spark submit in relation with client-mode (+ hadoop and dependencies) datalayer-contrib:k8s-client-mode

#456
OPEN
Integration Tests for Client Mode SPARK-23146 datalayer-contrib:client-mode #45 OPEN
Refactor Kubernetes code for configuring driver/executor pods to use consistent and cleaner abstraction SPARK-22839 mccheah:spark-22839-incremental Initial framework for pod construction architecture refactor #20910
Refactor Steps Orchestrator based on the Chain Pattern #604 Example: Include and exclude driver and executor steps (with etcd example)
[INTEGRATION_TESTS] Random failure of tests (java.net.ConnectException) https://github.com/apache-spark-on-k8s/spark/issues/571
Use a pre-installed Minikube instance for integration tests. #521
Application names should support whitespaces and special characters https://github.com/apache-spark-on-k8s/spark/issues/551
[ShuffleService] Need for spark.local.dir? https://github.com/apache-spark-on-k8s/spark/issues/549
Spark UI When Spark runs, it gives you a useful user interface to manage and monitor your jobs and configuration (` echo http://localhost:4040`).
This can be enhanced with a specific tab for Kubernetes
Docker Logging Handler datalayer-contrib:spark/docker-logging-handler #576 OPEN
Disable ssl test for staging server if current classpath contains the jetty shaded classes datalayer-contrib:spark/jetty-sslcontext #463 #573 OPEN
Develop and build Kubernetes modules in isolation without the other Spark modules datalayer-contrib:spark/kubernetes-parent #570 OPEN
Add libc6-compat in spark-bash to allow parquet datalayer-contrib:spark/libc6-compat #504 #550 OPEN
Add documentation for Zeppelin with Spark on Kubernetes datalayer-contrib:spark-docs/zeppelin #21 OPEN
[WIP] [SPARK-19552] [BUILD] Upgrade Netty version to 4.1.8 final https://github.com/apache/spark/pull/16888