Wednesday, June 17, 2009

Hadoop cluster setup (0.20.0)

On the NameNode (master) machine:

1. In file /etc/hosts, define the ip address of the namenode machine and all the datanode machines. Make sure you define the actual ip (eg. 192.168.1.9) and not the localhost ip (eg. 127.0.0.1) for all the machines including the namenode, otherwise the datanodes will not be able to connect to namenode machine).

    192.168.1.9    hadoop-namenode
    192.168.1.8    hadoop-datanode1
    192.168.1.7    hadoop-datanode2

    Note: Check to see if the namenode machine ip is being resolved to actual ip not localhost ip using "ping hadoop-namenode".

2. Configure password less login from namenode to all datanode machines. Refer to Configure passwordless ssh access for instructions on how to setup password less ssh access.

3. Download and unpack hadoop-0.20.0.tar.gz from Hadoop website to some path in your computer (We'll call the hadoop installation root as $HADOOP_INSTALL_DIR now on).

4. Edit the file $HADOOP_INSTALL_DIR/conf/hadoop-env.sh and define the $JAVA_HOME.

    export JAVA_HOME=/usr/lib/jvm/java-6-sun

5. Edit the file $HADOOP_INSTALL_DIR/conf/core-site.xml and add the following properties. (These configurations are required on all the node in the cluster)

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
        <property>
            <name>fs.default.name</name>
            <value>hdfs://hadoop-namenode:54310</value>
            <description>The name of the default file system.  A URI whose
            scheme and authority determine the FileSystem implementation. The
            uri's scheme determines the config property (fs.SCHEME.impl) naming
            the fleSystem implementation class.  The uri's authority is used to
            determine the host, port, etc. for a filesystem.
            </description>
        </property>

        <property>
            <name>hadoop.tmp.dir</name>
            <value>/opt/hdfs/tmp</value>
            <description>A base for other temporary directories.</description>
        </property>
    </configuration>

    Note: Remeber to replace namenode and datanode machine names with real machine names here

6. Edit the file $HADOOP_INSTALL_DIR/conf/hdfs-site.xml and add the following properties. (This file defines the properties of the namenode and datanode).

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <configuration>
        <property>
            <name>dfs.name.dir</name>
            <value>/opt/hdfs/name</value>
            <description>Determines where on the local filesystem an DFS name
            node should store its blocks.  If this is a comma-delimited list of
            directories, then data will be stored in all named directories,
            typically on different devices. Directories that do not exist are
            ignored.
            </description>
        </property>

        <property>
            <name>dfs.data.dir</name>
            <value>/opt/hdfs/data</value>
            <description>Determines where on the local filesystem an DFS data
            node should store its blocks.  If this is a comma-delimited list of
            directories, then data will be stored in all named directories,
            typically on different devices. Directories that do not exist are
            ignored.
            </description>
        </property>

        <property>
            <name>dfs.replication</name>
            <value>1</value>
            <description>Default block replication.
            The actual number of replications can be specified when the file is
            created. The default is used if replication is not specified in
            create time.
            </description>
        </property>

        <property>
            <name>dfs.datanode.du.reserved</name>
            <value>53687090000</value>
            <description>This is the reserved space for non dfs use</description>
        </property>
    </configuration>

    Note: Remeber to replace namenode and datanode machine names with real machine names here

7. Edit $HADOOP_INSTALL_DIR/conf/mapred.xml and add the following configuration. (This file defines the configuration for the maper and reduser)

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <configuration>
    </configuration>

    Note:
    1. If you only need to use HDFS, leave this file empty.
    2. Remeber to replace namenode and datanode machine names with real machine names here.

8. Edit $HADOOP_INSTALL_DIR/conf/masters and add the machine names where a secondary namenodes will run.

    hadoop-secondarynamenode1
    hadoop-secondarynamenode2
    
    Note: If you want to secondary name to to run on the same machine as the primary namenode, enter the machine name of primary namenode machine.

9. Edit $HADOOP_INSTALL_DIR/conf/slaves and add all the datanodes machine names. If you are running a datanode on the namenode machine do remember to add that also.

    hadoop-namenode
    hadoop-datanode1
    hadoop-datanode2

    Note: Add namenode machine name only if you are running a datanode on namenode machine.

On DataNode (slave) machine:

1. In file /etc/hosts, define the ip address of the namenode machine. Make sure you define the actual ip (eg. 192.168.1.9) and not the localhost ip (eg. 127.0.0.1).

    192.168.1.9    hadoop-namenode

    Note: Check to see if the namenode machine ip is being resolved to actual ip not localhost ip using "ping hadoop-namenode".

2. Configure password less login from all datanode machines to namenode machines. Refer to Configureing password less ssh access for instructions on how to setup password less ssh access.

3. Download and unpack hadoop-0.20.0.tar.gz from Hadoop website to some path in your computer (We'll call the hadoop installation root as $HADOOP_INSTALL_DIR now on).

4. Edit the file $HADOOP_INSTALL_DIR/conf/hadoop-env.sh and define the $JAVA_HOME.

    export JAVA_HOME=/usr/lib/jvm/java-6-sun

5. Edit the file $HADOOP_INSTALL_DIR/conf/core-site.xml and add the following properties. (These configurations are required on all the node in the cluster)

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://hadoop-namenode:54310</value>
        <description>The name of the default file system.  A URI whose scheme
        and authority determine the FileSystem implementation. The uri's scheme
        determines the config property (fs.SCHEME.impl) naming the FileSystem
        implementation class.  The uri's authority is used to determine the
        host, port, etc. for a filesystem.
        </description>
    </property>

    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/hdfs/tmp</value>
        <description>A base for other temporary directories.</description>
    </property>
    </configuration>

6. Edit the file $HADOOP_INSTALL_DIR/conf/hdfs-site.xml and add the following properties. (This file defines the properties of the namenode and datanode).

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <configuration>
        <property>
            <name>dfs.name.dir</name>
            <value>/opt/hdfs/name</value>
            <description>Determines where on the local filesystem an DFS name
            node
should store its blocks.  If this is a comma-delimited
            list of directories, then data will be stored in all named
            directories, typically on different devices.
            Directories that do not exist are ignored.
            </description>
        </property>

        <property>
            <name>dfs.data.dir</name>
            <value>/opt/hdfs/data</value>
            <description>Determines where on the local filesystem an DFS data
            node
should store its blocks.  If this is a comma-delimited
            list of directories, then data will be stored in all named
            directories, typically on different devices.
            Directories that do not exist are ignored.
            </description>
        </property>

        <property>
            <name>dfs.replication</name>
            <value>1</value>
            <description>Default block replication.
            The actual number of replications can be specified when the file is
            created. The default is used if replication is not specified in
            create
time.
            </description>
        </property>

        <property>
            <name>dfs.datanode.du.reserved</name>
            <value>53687090000</value>
            <description>This is the reserved space for non dfs use</description>
        </property>
    </configuration>

7. Edit $HADOOP_INSTALL_DIR/conf/mapred.xml and add the following configuration. (This file defines the configuration for the maper and reduser)

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <configuration>
    </configuration>

    Note: If you only need to use HDFS, leave this file empty.

Start and Stop hadoop daemons:

1. Before you start the Hadoop daemons, you need to format the filesystem. Execute the following command to format the file system.

    $HADOOP_INSTALL_DIR/bin/hadoop namenode -format

2. You need to start/stop the daemons only on the master machine, it will start/stop the daemons in all slave machines.

    To start/stop all the daemons execute the following command.

    $HADOOP_INSTALL_DIR/bin/start-all.sh
    or
    $HADOOP_INSTALL_DIR/bin/stop-all.sh

    To start only the dfs daemons execute the following command

    $HADOOP_INSTALL_DIR/bin/start-dfs.sh
    or
    $HADOOP_INSTALL_DIR/bin/stop-dfs.sh
Post a Comment