饼摊少女史

hadoop0.23.0及以上版本配置[zhuan]

from  http://blog.sina.com.cn/s/blog_6813fb24010150l5.html

 

Hadoop 0.23.0 was released November 11, 2011. Being the future ofthe Hadoop platform, it’s worth checking out even though it is analpha release.

Note: Manyof the instructions in this article came from trial and error, andthere are lots of alternative (and possibly better ways) toconfigure the systems. Please feel free to suggest improvements inthe comments. Also, all commands were only tested on Mac OSX.

Download

To get started, download the hadoop-0.23.0.tar.gz filefrom one of the mirrors here:http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-0.23.0.

Once downloaded, decompress the file. The bundled documentation isavailable in share/doc/hadoop/index.html

Notes for Users of Previous Versions of Hadoop

The directory layout of the hadoop distribution changed in hadoop0.23.0 and 0.20.204 vs. previous versions. In particular, there arenow sbin, libexec, and etc directories in the root of distributiontarball.

scripts and executables

In hadoop 0.23.0, a number of commonly used scripts from the bindirectory have been removed or drastically changed. Specifically, the followingscripts were removed (vs 0.20.205.0):

  • hadoop-config.sh

  • hadoop-daemon(s).sh

  • start-balancher.sh and stop-balancer.sh

  • start-dfs.sh and stop-dfs.sh

  • start-jobhistoryserver.sh and stop-jobhistoryserver.sh

  • start-mapred.sh and stop-mapred.sh

  • task-controller

The start/stop mapred-related scripts have been replaced by“map-reduce 2.0″ scripts called yarn-*.  Thestart-all.sh and stop-all.sh scripts no longer start or stop HDFS,but they are used to start and stop the yarn daemons. Finally, bin/hadoop has been deprecated. Instead,users should use bin/hdfs and bin/mapred.

Hadoop distributions now also include scripts in a sbin directory.The scripts include start-all.sh, start-dfs.sh, andstart-balancer.sh (and the stop versions of those scripts).

configuration directories and files

The conf directory that comes with Hadoop is no longer the defaultconfiguration directory.  Rather, Hadoop looks inetc/hadoop for configuration files.  The libexecdirectory contains scripts hadoop-config.sh and hdfs-config.sh forconfiguring where Hadoop pulls configuration information, and it’spossible to override the location of the configuration directorythe following ways:

  • hdfs-config.sh calls hadoop-config.sh in$HADOOP_COMMON_HOME/libexec and $HADOOP_HOME/libexec

  • hadoop-config.sh accepts a –config option for specifying a configdirectory, or the directory can be specified using$HADOOP_CONF_DIR.

    • This scripts also accepts a –hosts parameter to specify the hosts /slaves

    • This script uses variables typically set in hadoop-env.sh, such as:$JAVA_HOME, $HADOOP_HEAPSIZE, $HADOOP_CLASSPATH, $HADOOP_LOG_DIR,$HADOOP_LOGFILE and more.  See the file for a fulllist of variables.

Configure HDFS

To start hdfs, we will use sbin/start-dfs.sh which pullsconfiguration from etc/hadoop by default. We’ll be puttingconfiguration files in that directory, starting with core-site.xml. In core-site.xml, we must specify afs.default.name:

<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration>   <property>     <name>fs.default.name</name>     <value>hdfs://localhost:9000</value>   </property> </configuration> 

Next, we want to override the locations that the NameNode andDataNode store data so that it’s in a non-transient location. Thetwo relevant parameters are dfs.namenode.name.dir anddfs.datanode.data.dir.  We also set replication to1, since we’re using a single datanode.

<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration>   <property>     <name>dfs.replication</name>     <value>1</value>   </property>   <property>     <name>dfs.namenode.name.dir</name>     <value>file:/Users/joecrow/Code/hadoop-0.23.0/data/hdfs/namenode</value>   </property>   <property>     <name>dfs.datanode.data.dir</name>     <value>file:/Users/joecrow/Code/hadoop-0.23.0/data/hdfs/datanode</value>   </property> </configuration> 

Notes:

  • as of HDFS-456 and HDFS-873,the namenode and datanode dirs should be specified with a fullURI.

  • by default, hadoop starts up with 1000 megabytes of RAM allocatedto each daemon. You can change this by adding a hadoop-env.sh toetc/hadoop. There’s a template that can be addedwith: $ cp./share/hadoop/common/templates/conf/hadoop-env.sh etc/hadoop

    • The template sets up a bogus value for HADOOP_LOG_DIR

    • HADOOP_PID_DIR defaults to /tmp, so you might want to change thatvariable, too.

Start HDFS

Start the NameNode:

sbin/hadoop-daemon.sh start namenode

Start a DataNode:

sbin/hadoop-daemon.sh start datanode

(Optionally) start the SecondaryNameNode (this is not required forlocal development, but definitely for production).

sbin/hadoop-daemon.sh start secondarynamenode

To confirm that the processes are running, issue jps and look forlines for NameNode, DataNode and SecondaryNameNode:

$ jps 55036 Jps 55000 SecondaryNameNode 54807 NameNode 54928 DataNode

Notes:

  • the hadoop daemons log to the “logs” dir.  Stdoutgoes to a file ending in “.out” and a logfile ends in “.log”. If adaemon doesn’t start up, check the file that includes the daemonname(e.g. logs/hadoop-joecrow-datanode-jcmba.local.out).

  • the commands might say “Unable to load realm info fromSCDynamicStore” (at least on Mac OS X). This appears to be harmlessoutput, see HADOOP-7489 fordetails.

Stopping HDFS

Eventually you’ll want to stop HDFS. Here are the commands toexecute, in the given order:

sbin/hadoop-daemon.sh stop secondarynamenode sbin/hadoop-daemon.sh stop datanode sbin/hadoop-daemon.sh stop namenode

Use jps to confirm that the daemons are no longer running.

Running an example MR Job

This section just gives the commands for configuring and startingthe Resource Manager, Node Manager, and Job History Server, but itdoesn’t explain the details of those. Please refer tothe Referencesand Links section for more details.

The Yarn daemons use the conf directory in the distribution forconfiguration by default. Since we used etc/hadoop as theconfiguration directory for HDFS, it would be nice to use that asthe config directory for mapreduce, too.  As aresult, we update the following files:

In conf/yarn-env.sh, add the following lines under the definitionof YARN_CONF_DIR:

export HADOOP_CONF_DIR="${HADOOP_CONF_DIR:-$YARN_HOME/etc/hadoop}" export HADOOP_COMMON_HOME="${HADOOP_COMMON_HOME:-$YARN_HOME}" export HADOOP_HDFS_HOME="${HADOOP_HDFS_HOME:-$YARN_HOME}" 

In conf/yarn-site.xml, update the contents to:

<?xml version="1.0"?> <configuration>   <property>     <name>yarn.nodemanager.aux-services</name>     <value>mapreduce.shuffle</value>   </property>   <property>     <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>     <value>org.apache.hadoop.mapred.ShuffleHandler</value>   </property> </configuration> 

Set the contents of etc/hadoop/mapred-site.xml to:

<?xml version="1.0"?> <?xml-stylesheet href="configuration.xsl"?> <configuration>   <property>     <name>mapreduce.framework.name</name>     <value>yarn</value>   </property> </configuration> 

Now, start up the yarn daemons:

 $ bin/yarn-daemon.sh start resourcemanager  $ bin/yarn-daemon.sh start nodemanager  $ bin/yarn-daemon.sh start historyserver

A bunch of example jobs are available via the hadoop-examples jar.For example, to run the program that calculates pi:

$ bin/hadoop jar hadoop-mapreduce-examples-0.23.0.jar pi \ -Dmapreduce.clientfactory.class.name=org.apache.hadoop.mapred.YarnClientFactory \ -libjars modules/hadoop-mapreduce-client-jobclient-0.23.0.jar 16 10000

The command will output a lot of output, but towards the end you’llsee:

Job Finished in 67.705 seconds Estimated value of Pi is 3.14127500000000000000

Notes

  • By default, the resource manager uses a number of IPC ports,including 8025, 8030, 8040, and 8141.  The web UIis exposed on port 8088.

  • By default, the JobHistoryServer uses port 19888 for a web UI andport 10020 for IPC.

  • By default, the node manager uses port 9999 for a web UI and port4344 for IPC. Port 8080 is used for something?Also so random port… 65176 ?

  • The resource manager has a “proxy” url that it uses to link-throughto the JobHistoryServer UI. e.g.:$ curl -I http://0.0.0.0:8088/proxy/application_1322622103371_0001/jobhistory/job/job_1322622103371_1_1 HTTP/1.1 302 Found Content-Type: text/plain; charset=utf-8 Location: http://192.168.1.12:19888/jobhistory/job/job_1322622103371_1_1/jobhistory/job/job_1322622103371_1_1 Content-Length: 0 Server: Jetty(6.1.26) 

Conclusion

While Hadoop 0.23 is an alpha-release, getting it up and running inpsuedo-distributed mode isn’t too difficult.  Thenew architecture will take some getting used to for users ofprevious releases of Hadoop, but it’s an exciting step forward.

Observations and NotesThere are a few bugs or gotchas that I discovered or verified tokeep an eye on as you’re going through these steps. These include: 

  • HADOOP-7837 log4j isn’t setup correctly whenusing sbin/start-dfs.sh

  • HDFS-2574 Deprecated parameters appear in thehdfs-site.xml templates.

  • HDFS-2595 misleading message whenfs.default.name not set and running sbin/start-dfs.sh

  • HDFS-2553 BlockPoolScanner spinning in a loop(causes DataNode to peg one cpu to 100%).

  • HDFS-2608 NameNode webui references missinghadoop.css

References and Links


评论
©饼摊少女史 | Powered by LOFTER