Performance tuning

Performance tuning - SPSS Analytic Server

Analytic Server is a component in the Ambari framework that utilizes other components such as HDFS, Yarn and Spark. Common performance tuning techniques for Hadoop, HDFS and Spark apply to Analytic Server workloads.

Every Analytic Server workload is different therefore tuning
experimentation is required based on your speciﬁc deployments workload

The following properties and tuning tips are key changes that have impacted the results of the Analytic Server benchmarking and scaling tests.

When the ﬁrst job runs on Analytic Server, the server will start a persistent Spark application that will be active until the Analytic Server is shut down. The persistent Spark application will allocate and hold onto all the cluster resources allocated to it for the duration of the Analytic Server running, even if an Analytic Server job is not actively running. Careful thought should be given to the amount of resources allocated to the Analytic Server Spark application. If all cluster resources are allocated to the Analytic Server Spark application, then other jobs could be delayed or not run. These jobs could be queued waiting for sufﬁcient free
resources and those resources will be consumed by Analytic Server Spark application.

If multiple Analytic Server services are conﬁgured and deployed, each service instance could potentially allocate its own persistent Spark application.

For example, if two Analytic Server services are deployed to support high availability failover, then you could see two persistent Spark applications active, each allocating cluster resources.

An additional complexity is that in certain situations, Analytic Server may start a map reduce job that will require cluster resources. These map reduce jobs will require resources that are not allocated to the Spark application. The speciﬁc components that require map reduce jobs are PSM model builds.

The following properties can be conﬁgured to allocate resources to Spark application. If they are set in the spark-defaults.conf of the Spark installation, then they are allocated for all Spark jobs run in the environment. If they are set in the Analytic Server conﬁguration as custom poperties under the “Custom analytic.cfg” section, then they are allocated for the Analytic Server Spark application only.

#spark.executor.memory
Amount of memory to use per executor process.
#spark.executor.instances
The number of executor processes to start
#spark.executor.cores - 
The number of executor worker threads per executor process.

An example of setting the three key Spark properties. There are 10 data nodes in a HDFS cluster and each data node has 24 logical cores and 48 GB of memory and is only running HDFS processes. Here is one way to conﬁgure the properties for this environment, assuming you are only running Analytic Server jobs on this environment and desire maximum allocation to a single Analytic Server Spark application.

Set spark.executor.instances=20
This would attempt to run 2 Spark executor processes per data node.
Set spark.executor.memory=22G
This would set the max heap size for each Spark executor process to 22 GB, allocating 44 GB on each data node. Other JVMs and the OS need the extra memory.
Set spark.executor.cores=5
This will provide 5 worker threads for each Spark executor, for a total of 10 worker threads per data node.

Monitor the Spark UI for running jobs

If you see Spill to disk that could impact performance.
Some possible solutions are:

Increase memory and allocate it to Spark executors via
spark.executor.memory
Reduce the number of spark.executor.cores. This will reduce the number of concurrent work threads allocating memory, but it will also reduce the amount of parallelism for the jobs.
Change the Spark memory properties. spark.shuffle.memoryFraction and spark.storage.memoryFraction allocation percentage of the Spark executor heap for Spark.

Ensure the name node has enough memory
If the number of blocks in HDFS is large and growing, ensure you name node heap increases to accommodate this growth. This is a common HDFS tuning recommendation.

Alter the amount of memory used for caching
By default, spark.storage.memoryFraction has value 0.6. This can be increased up to 0.8 in case the HDFS block size of the data is 64MB. If the HDFS block size of the input data is greater than 64MB then this value could be increased only if the memory allocated per task is greater than 2GB.

Spark map-side join
The Analytic Server Spark join implementation does not support map-side join functionality (Spark join is mainly a reduce side). The implementation does not take advantage of map-side joins to optimize joins when one input is small. Not taking advantage of map-side join results in an extremely resource intensive Spark job that eventually fails.

References

Apache Spark
Hortonworks Blog
IBM SPSS Analytic Server