Cluster Validation - HDFS - HDP

For every PS engagement that involves installation, upgrade, or migration, it is your responsibility to ensure that the installation validation checklist is completely executed.This will ensure:

  • Proper configuration of the cluster
  • Increased customer satisfaction
  • Decreased volume to the Technical Support team

The Importance of the Checklist

The installation validation checklist is a tool that you should refer to throughout the duration of your engagement.It was assembled by the Technical Support team and contains information and best practices gathered while working on multiple projects with multiple clients.The checklists provide information on key areas that you should focus on to ensure proper configuration of the components of your cluster. This will result in an operational Customer Environment.

HDFS Checklist Items

There are multiple items that should be checked to ensure that HDFS is configured properly. The table below provides the key areas of focus along with the rationale behind them:

Checklist Item Notes/Rationale
Verify that TestDFSIO has targeted 10% of live nodes Refer to the section below , “Running TestDFSIO.”
Put 1 GB of data into HDFS This should take approximately 10-17 seconds depending on the infrastructure. Perform this several times to see if the resulting times vary. You can run the following commands:mkfile 1g testfile
hadoop fs –put testfile /tmp  
Verify the NameNode, DataNode, and JournalNode memory settings Young generation (Namenode new generation size) should be set up with 1/8 of total heap size. Refer to the Memory Sizing Document located in the Hortonworks Wiki.
Verify that young generation should be set up with 1/8 of total heap size In general, this is a good starting point.
Validate that log, data, and metadata mounts have enough room to grow Log directories should NOT (if possible) be on the same mount as the root OS. Create a separate partition for logs and check the partition to ensure at least a few gigs of space are available. Consider providing the hdp-log-archive.sh script to the operations team for log maintenance.
Verify ulimits for all service users Be sure to make sure that ulimits for root are set up correctly for secure clusters. Limits should be set to the following value for users, hdfs, MR, YARN, HBASE and root (in a secure cluster only):
- nofile 32768  
- nproc 65536  
   
You can modify the ulimit -l value that the DataNode runs with. This value is usually configured in /etc/security/limits.conf  
   
Configure the open fd ulimit  
- the default of 1024 is too low  
- use 16K for datanodes  
- use 64K for masternodes  
Verify that user groups, IDs, and home directories are set up uniformly on all nodes  
If the NFS Gateway is used, verify that it was mounted with soft mount The scratch directory needs to have enough room and at least 3 GB of memory.
   
You should have multiple redundant directories for namenode metadata:  
- one of the the dfs.name.dir should be NFS  
- NFS softmount - tcp,soft,intr, timeo=20, retrans=5  
Verify that all relevant mounts are used for dfs.data.dir property For example, this should not be set to /tmp or /var/log. Also, check if you have more than 12 mounts for DataNode directories, and that you have configured DataNode volume failure toleration to be at least 2.
Verify that configurations and environment variables are consistent on all nodes, especially if Ambari is not used If Ambari is not setup, it is best to have all configurations managed by Puppet (Chef etc.) or have one “Master” copy distributed across all nodes.
Validate that the Hadoop Classpath is set up consistently on all nodes Check this if Ambari is not setup. Below is an example for HDP 2.3
   
/usr/hdp/2.3.0.0-2557/hadoop/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/:/usr/hdp/2.3.0.0-2557/hadoop/.//:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//:::/usr/share/java/mysql-connector-java.jar  
Verify that all unsupported CDH and HDP 1.x configurations are removed Verify along with all remaining CDH and HDP1.x in the log and all CDH HDP1.x repositories.
Verify that transceivers and handler counts the NameNodes and DataNodes Be sure to verify:dfs.datanode.max.transfer.threads is at least 4096The NameNode Handler count is equal to 128 for more than 100 DataNodesThe DataNode handler count is between 10 - 20

Running TestDFSIO

Below are the steps you should take to run TestDFSIO to target 10% of live nodes:

  1. Run TestDFSIO in write mode and create data with the command:

    yarn jar $YARN_EXAMPLES/hadoop-mapreduce-client-jobclient-2.1.0-beta-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 1000

  2. Run TestDSFIO in read mode with the command:

    yarn jar $YARN_EXAMPLES/hadoop-mapreduce-client-jobclient-2.1.0-beta-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 1000

  3. Clean up the TestDFSIO data with the command:

    yarn jar $YARN_EXAMPLES/hadoop-mapreduce-client-jobclient-2.1.0-beta-tests.jar TestDFSIO -clean

Points to rememnber

  1. The HDFS checklist should be completed to ensure optimal configuration of HDFS in the cluster
  2. You should run TestDFSIO to target at least 10% of live nodes