PRODUCT
4 min read
Published on 04/15/2018
Last updated on 03/21/2024
Collecting Spark History Server event logs in the cloud
Share
Apache Spark on Kubernetes series: Introduction to Spark on Kubernetes Scaling Spark made simple on Kubernetes The anatomy of Spark applications on Kubernetes Monitoring Apache Spark with Prometheus Spark History Server on Kubernetes Spark scheduling on Kubernetes demystified Spark Streaming Checkpointing on Kubernetes Deep dive into monitoring Spark and Zeppelin with Prometheus Spark Streaming Checkpointing on Kubernetes Deep dive into monitoring Spark and Zeppelin with Prometheus Apache Spark application resilience on Kubernetes
Apache Zeppelin on Kubernetes series: Running Zeppelin Spark notebooks on Kubernetes Running Zeppelin Spark notebooks on Kubernetes - deep dive
Apache Kafka on Kubernetes series: Kafka on Kubernetes - using etcdIn our last blogpost we described how to configure
spark-submit
and Spark History Server
to enable gathering event logs to Amazon S3. Since then, we've added more supported providers to Pipeline, and broadened the available options to easily capture Spark event logs to Amazon AWS S3, Microsoft Azure WASB and Google Cloud Storage. Lets see how this works.
You can use our Helm deployment charts directly.
We have the following umbrella charts:
- spark deploys all the background infrastructure you need in place to run a
spark-submit
job:Spark Resource Staging Server
,Shuffle Service
andSpark History Server
should be enabled (by default they're not). - zeppelin-spark deploys all of the above, plus a Zeppelin Server deployment and an externally accessible service.
Behind the scenes
Note: the following steps are automated by Pipeline, but are listed in order to aid in understanding what goes on behind the scenes, and to serve as a comprehensive guide, in case you'd like to reproduce them in your own environment without using Pipeline.Let's see what's happening behind the scenes. If you prefer to do things manually, you'll need to resolve the following steps:
1. Configuration
Enable event logging in Spark Driver and configure the event logging folders for bothspark-submit
and History server
. We've already thoroughly covered this topic in our previous blog
2. Building images
You need an image that includes Hadoop FileSystem drivers for each cloud storage option:AWS
libraries are included by default in Spark's distributionAzure
SDK can be included using thehadoop-2.7
profileGoogle
Connector, has to be included as a dependency to thehadoop-cloud
module.
- SPARK-7481 introduces the
spark-hadoop-cloud
module, which is not present in Spark k8s and has to be cherry-picked from the master branch. - Add hadoop-cloud profile dependency. This is a necessary fix, since, by default, the
spark-hadoop-cloud
module is not included in the docker bundle. - Add gcs connector dependency to hadoop-cloud module. This includes the
Google Connector
dependency in thespark-hadoop-cloud
module. It also updates Guava to a newer version in the docker bundle, as the current one is quite old.
3. Access to different Cloud Storages
Access is granted either by providing different access keys - this works on all cloud providers - or on the basis of policies/rules. Let's see what you need for setup on each cloud provider:- on
Amazon
it's possible to gain access to S3 storage using policies. For example, you can add the following policies to your instance profile:{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ ... "s3:ListBucket", "s3:GetObject", "s3:PutObject", "s3:ListObjects", "s3:DeleteObject" ], "Resource": "*" } ] }
- on
Google Cloud
, if you create your bucket and cluster with the sameStorage Account
, the only thing you have to add is the following scopes to your node config:Config: &gke.NodeConfig{ MachineType: nodePoolModel.NodeInstanceType, ServiceAccount: nodePoolModel.ServiceAccount, OauthScopes: []string{ ... "https://www.googleapis.com/auth/devstorage.read_write", }, },
- on
Azure
there's no role-based access so far via the Hadoop FS connector, so you have to provide yourStorage Account
credentials, azureStorageAccountName and azureStorageAccessKey toHistory Server
, andspark-submit
options:-Dspark.hadoop.fs.azure.account.key.{{ azureStorageAccountName }}.blob.core.windows.net={{ azureStorageAccessKey }}
Kubernetes
operator that automatically creates buckets on any cloud provider.
Subscribe to
the Shift!
Get emerging insights on emerging technology straight to your inbox.
Unlocking Multi-Cloud Security: Panoptica's Graph-Based Approach
Discover why security teams rely on Panoptica's graph-based technology to navigate and prioritize risks across multi-cloud landscapes, enhancing accuracy and resilience in safeguarding diverse ecosystems.
Related articles
Subscribe
to
the Shift
!Get on emerging technology straight to your inbox.
emerging insights
The Shift keeps you at the forefront of cloud native modern applications, application security, generative AI, quantum computing, and other groundbreaking innovations that are shaping the future of technology.