Wednesday, December 09, 2009

HBase setup (0.20.0)

Before you begin:

Before you start configure HBase, you need to have a running Hadoop cluster, which will be the storage for hbase. Please refere to Hadoop cluster setup document before continuing.

On the HBaseMaster (master) machine:

1. In file /etc/hosts, define the ip address of the namenode machine and all the datanode machines. Make sure you define the actual ip (eg. 192.168.1.9) and not the localhost ip (eg. 127.0.0.1) for all the machines including the namenode, otherwise the datanodes will not be able to connect to namenode machine).

    192.168.1.9    hbase-masterserver
    192.168.1.8    hbase-regionserver1
    192.168.1.7    hbase-regionserver2
    192.168.1.6    hadoop-nameserver

    Note: Check to see if the namenode machine ip is being resolved to actual ip not localhost ip using "ping hbase-namenode".

2. Configure password less login from masterserver to all regionserver machines. Refer to Configuring passwordless ssh access for instructions on how to setup password less ssh access.

3. Download and unpack hbase-0.20.0.tar.gz from HBase website to some path in your computer (We'll call the hbase installation root as $HBASE_INSTALL_DIR now on).

4. Edit the file $HBASE_INSTALL_DIR/conf/hbase-env.sh and define the $JAVA_HOME.

    export JAVA_HOME=/usr/lib/jvm/java-6-sun

5. Edit the file $HBASE_INSTALL_DIR/conf/hbase-site.xml and add the following properties. (These configurations are required on all the node in the cluster)

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <configuration>
        <property>
            <name>hbase.master</name>
            <value>localhost:60000</value>
            <description>The host and port that the HBase master runs at.

                A value of 'local' runs the master and a regionserver in

                a single process.
            </description>
        </property>

        <property>

            <name>hbase.rootdir</name>
            <value>hdfs://hadoop-nameserver:9000/hbase</value>
            <description>The directory shared by region servers.</description>
        </property>

        <property>
            <name>hbase.cluster.distributed</name>
            <value>true</value>
            <description>The mode the cluster will be in. Possible values are
            false: standalone and pseudo-distributed setups with managed
            Zookeeper true: fully-distributed with unmanaged Zookeeper
            Quorum (see hbase-env.sh)
            </description>
        </property>

    </configuration>
                
    Note: Remeber to replace masterserver and regionserver machine names with real machine names here.

6. Edit $HBASE_INSTALL_DIR/conf/regionservers and add the namenode machine

    hbase-regionserver1
    hbase-regionserver2
    hbase-masterserver

    Note: Add masterserver machine name only if you are running a regionserver on masterserver machine.

On HRegionServer (slave) machine:


1. In file /etc/hosts, define the ip address of the namenode machine. Make sure you define the actual ip (eg. 192.168.1.9) and not the localhost ip (eg. 127.0.0.1).

    192.168.1.9    bhase-masterserver

Note: Check to see if the masterserver machine ip is being resolved to actual ip not localhost ip using "ping hbase-masterserver".

2. Configure password less login from all regionserver machines to masterserver machines. Refer to Configuring passwordless ssh access for instructions on how to setup password less ssh access.

3. Download and unpack hbase-0.20.0.tar.gz from HBase website to some path in your computer (We'll call the hadoop installation root as $HBASE_INSTALL_DIR now on).

4. Edit the file $HBASE_INSTALL_DIR/conf/hadoop-env.sh and define the $JAVA_HOME.

    export JAVA_HOME=/usr/lib/jvm/java-6-sun

5. Edit the file $HBASE_INSTALL_DIR/conf/hbase-site.xml and add the following properties. (These configurations are required on all the node in the cluster)

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <configuration>
        <property>
            <name>hbase.master</name>
            <value>localhost:60000</value>
            <description>The host and port that the HBase master runs at.
                A value of 'local' runs the master and a regionserver in
                a single process
            </description>
        </property>

        <property>

            <name>hbase.rootdir</name>
            <value>hdfs://hadoop-nameserver:9000/hbase</value>
            <description>The directory shared by region servers.</description>
        </property>

        
<property>
            <name>hbase.cluster.distributed</name>
            <value>true</value>
            <description>The mode the cluster will be in. Possible values are
            false: standalone and pseudo-distributed setups with managed
            Zookeeper true: fully-distributed with unmanaged Zookeeper 
            Quorum (see hbase-env.sh)
            </description>
        </property>
    </configuration>

Start and Stop hbase daemons:

You need to start/stop the daemons only on the masterserver machine, it will start/stop the daemons in all regionserver machines. Execute the following command to start/stop the hbase.

    $HBASE_INSTALL_DIR/bin/start-hbase.sh
    or
    $HBASE_INSTALL_DIR/bin/stop-hbase.sh



Thursday, June 18, 2009

Secondary indexes in HBase

Creating secondary indexes in HBase-0.19.3:

You need to enable indexing in HBase before you can create a secondary index on columns. Edit the file $HBASE_INSTALL_DIR/conf/hbase-site.xml and add the following property to it.

    <property>
        <name>hbase.regionserver.class</name>
        <value>org.apache.hadoop.hbase.ipc.IndexedRegionInterface</value>
    </property>

    <property>
        <name>hbase.regionserver.impl</name>
        <value>
        org.apache.hadoop.hbase.regionserver.tableindexed.IndexedRegionServer
        </value>
    </property>

Adding secondary index while creating table:

    HBaseConfiguration conf = new HBaseConfiguration();
    conf.addResource(new Path("/opt/hbase-0.19.3/conf/hbase-site.xml"));

    HTableDescriptor desc = new HTableDescriptor("test_table");

    desc.addFamily(new HColumnDescriptor("columnfamily1:"));
    desc.addFamily(new HColumnDescriptor("columnfamily2:"));

    desc.addIndex(new IndexSpecification("column1",
        Bytes.toBytes("columnfamily1:column1")));

    desc.addIndex(new IndexSpecification("column2",
        Bytes.toBytes("columnfamily1:column2")));


    IndexedTableAdmin admin = null;
    admin = new IndexedTableAdmin(conf);

    admin.createTable(desc);

Adding index in an existing table:

    HBaseConfiguration conf = new HBaseConfiguration();
    conf.addResource(new Path("/opt/hbase-0.19.3/conf/hbase-site.xml"));

    IndexedTableAdmin admin = null;
    admin = new IndexedTableAdmin(conf);

    admin.addIndex(Bytes.toBytes("test_table"), new IndexSpecification("column2",
    Bytes.toBytes("columnfamily1:column2")));

Deleting existing index from a table.

    HBaseConfiguration conf = new HBaseConfiguration();
    conf.addResource(new Path("/opt/hbase-0.19.3/conf/hbase-site.xml"));

    IndexedTableAdmin admin = null;
    admin = new IndexedTableAdmin(conf);

    admin.removeIndex(Bytes.toBytes("test_table"), "column2");

Reading from secondary indexed columns:

To read from a secondary index, get a scanner for the index and scan through the data.

    HBaseConfiguration conf = new HBaseConfiguration();
    conf.addResource(new Path("/opt/hbase-0.19.3/conf/hbase-site.xml"));

    IndexedTable table = new IndexedTable(conf, Bytes.toBytes("test_table"));

    // You need to specify which columns to get
    Scanner scanner = table.getIndexedScanner("column1",
        HConstants.EMPTY_START_ROW, null, null, new byte[][] {
        Bytes.toBytes("columnfamily1:column1"),
        Bytes.toBytes("columnfamily1:column2") });

    for (RowResult rowResult : scanner) {
        String value1 = new String(
            rowResult.get(Bytes.toBytes("columnfamily1:column1")).getValue());

        String value2 = new String(
            rowResult.get(Bytes.toBytes("columnfamily1:column2")).getValue());

        System.out.println(value1 + ", " + value2);
    }

    table.close();

To get a scanner to a subset of the rows specify a column filter.

    ColumnValueFilter filter =
        new ColumnValueFilter(Bytes.toBytes("columnfamily1:column1"),

        CompareOp.LESS, Bytes.toBytes("value1-10"));

    scanner = table.getIndexedScanner("column1", HConstants.EMPTY_START_ROW,
        null,
filter, new byte[][] { Bytes.toBytes("columnfamily1:column1"),
        Bytes.toBytes("columnfamily1:column2")
);

    for (RowResult rowResult : scanner) {
        String value1 = new String(
            rowResult.get(Bytes.toBytes("columnfamily1:column1")).getValue());

        String value2 = new String(
            rowResult.get(Bytes.toBytes("columnfamily1:column2")).getValue());

        System.out.println(value1 + ", " + value2);
    }

Example Code:

import java.io.IOException;
import java.util.Date;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HConstants;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.client.Scanner;
import org.apache.hadoop.hbase.client.tableindexed.IndexSpecification;
import org.apache.hadoop.hbase.client.tableindexed.IndexedTable;
import org.apache.hadoop.hbase.client.tableindexed.IndexedTableAdmin;
import org.apache.hadoop.hbase.filter.ColumnValueFilter;
import org.apache.hadoop.hbase.filter.ColumnValueFilter.CompareOp;
import org.apache.hadoop.hbase.io.BatchUpdate;
import org.apache.hadoop.hbase.io.RowResult;
import org.apache.hadoop.hbase.util.Bytes;

public class SecondaryIndexTest {
    public void writeToTable() throws IOException {
        HBaseConfiguration conf = new HBaseConfiguration();
        conf.addResource(new Path("/opt/hbase-0.19.3/conf/hbase-site.xml"));

        IndexedTable table = new IndexedTable(conf, Bytes.toBytes("test_table"));

        String row = "test_row";
        BatchUpdate update = null;

        for (int i = 0; i < 100; i++) {
            update = new BatchUpdate(row + i);
            update.put("columnfamily1:column1", Bytes.toBytes("value1-" + i));
            update.put("columnfamily1:column2", Bytes.toBytes("value2-" + i));
            table.commit(update);
        }

        table.close();
    }

    public void readAllRowsFromSecondaryIndex() throws IOException {
        HBaseConfiguration conf = new HBaseConfiguration();
        conf.addResource(new Path("/opt/hbase-0.19.3/conf/hbase-site.xml"));

        IndexedTable table = new IndexedTable(conf, Bytes.toBytes("test_table"));

        Scanner scanner = table.getIndexedScanner("column1",
            HConstants.EMPTY_START_ROW, null, null, new byte[][] {
            Bytes.toBytes("columnfamily1:column1"),
                Bytes.toBytes("columnfamily1:column2") });


        for (RowResult rowResult : scanner) {
            System.out.println(Bytes.toString(
                rowResult.get(Bytes.toBytes("columnfamily1:column1")).getValue())
                + ", " + Bytes.toString(rowResult.get(
                Bytes.toBytes("columnfamily1:column2")).getValue()
                ));
        }

        table.close();
    }

    public void readFilteredRowsFromSecondaryIndex() throws IOException {
        HBaseConfiguration conf = new HBaseConfiguration();
        conf.addResource(new Path("/opt/hbase-0.19.3/conf/hbase-site.xml"));

        IndexedTable table = new IndexedTable(conf, Bytes.toBytes("test_table"));

        ColumnValueFilter filter =
            new ColumnValueFilter(Bytes.toBytes("columnfamily1:column1"),

            CompareOp.LESS, Bytes.toBytes("value1-40"));

        Scanner scanner = table.getIndexedScanner("column1",
            HConstants.EMPTY_START_ROW, null, filter,
            new byte[][] { Bytes.toBytes("columnfamily1:column1"),
                Bytes.toBytes("columnfamily1:column2")

            });

        for (RowResult rowResult : scanner) {
            System.out.println(Bytes.toString(
                rowResult.get(Bytes.toBytes("columnfamily1:column1")).getValue())
                + ", " + Bytes.toString(rowResult.get(
                Bytes.toBytes("columnfamily1:column2")).getValue()
                ));
        }

        table.close();
    }

    public void createTableWithSecondaryIndexes() throws IOException {
        HBaseConfiguration conf = new HBaseConfiguration();
        conf.addResource(new Path("/opt/hbase-0.19.3/conf/hbase-site.xml"));

        HTableDescriptor desc = new HTableDescriptor("test_table");

        desc.addFamily(new HColumnDescriptor("columnfamily1:column1"));
        desc.addFamily(new HColumnDescriptor("columnfamily1:column2"));

        desc.addIndex(new IndexSpecification("column1",
            Bytes.toBytes("columnfamily1:column1")));
        desc.addIndex(new IndexSpecification("column2",
            Bytes.toBytes("columnfamily1:column2")));

        IndexedTableAdmin admin = null;
        admin = new IndexedTableAdmin(conf);

        if (admin.tableExists(Bytes.toBytes("test_table"))) {
            if (admin.isTableEnabled("test_table")) {
                admin.disableTable(Bytes.toBytes("test_table"));
            }

            admin.deleteTable(Bytes.toBytes("test_table"));
        }

        if (admin.tableExists(Bytes.toBytes("test_table-column1"))) {
            if (admin.isTableEnabled("test_table-column1")) {
                admin.disableTable(Bytes.toBytes("test_table-column1"));
            }

            admin.deleteTable(Bytes.toBytes("test_table-column1"));
        }

        admin.createTable(desc);
    }

    public void addSecondaryIndexToExistingTable() throws IOException {
        HBaseConfiguration conf = new HBaseConfiguration();
        conf.addResource(new Path("/opt/hbase-0.19.3/conf/hbase-site.xml"));

        IndexedTableAdmin admin = null;
        admin = new IndexedTableAdmin(conf);

        admin.addIndex(Bytes.toBytes("test_table"),
            new IndexSpecification("column2",
            Bytes.toBytes("columnfamily1:column2")));

    }

    public void removeSecondaryIndexToExistingTable() throws IOException {
        HBaseConfiguration conf = new HBaseConfiguration();
        conf.addResource(new Path("/opt/hbase-0.19.3/conf/hbase-site.xml"));

        IndexedTableAdmin admin = null;
        admin = new IndexedTableAdmin(conf);

        admin.removeIndex(Bytes.toBytes("test_table"), "column2");
    }

    public static void main(String[] args) throws IOException {
        SecondaryIndexTest test = new SecondaryIndexTest();

        test.createTableWithSecondaryIndexes();
        test.writeToTable();
        test.addSecondaryIndexToExistingTable();
        test.removeSecondaryIndexToExistingTable();
        test.readAllRowsFromSecondaryIndex();
        test.readFilteredRowsFromSecondaryIndex();

        System.out.println("Done!");
    }
}

Using HBase in Java (0.19.3)

Using HBase in java

Create a HBaseConfiguration object to connect to a HBase server. You need to tell configuration object that where to read the HBase configuration from. to do this add a resource to the HBaseConfiguration object.
    
    HBaseConfiguration conf = new HBaseConfiguration();
    conf.addResource(new Path("/opt/hbase-0.19.3/conf/hbase-site.xml"));

Create a HTable object to a table in HBase. HTable object connects you to a table in HBase.

    HTable table = new HTable(conf, "test_table");

Create a BatchUpdate object on a row to perform update operations (like put and delete)

    BatchUpdate batchUpdate = new BatchUpdate("test_row1");
    batchUpdate.put("columnfamily:column1", Bytes.toBytes("some value"));
    batchUpdate.delete("column1");

Commit the changes to table using HTable#commit() method.

    table.commit(batchUpdate);

To read one column value from a row use HTable#get() method.

    Cell cell = table.get("test_row1", "columnfamily1:column1");
    if (cell != null) {
        String valueStr = Bytes.toString(cell.getValue());
        System.out.println("test_row1:columnfamily1:column1 " + valueStr);
    }

To read one row with given columns, use HTable#getRow() method.
 
    RowResult singleRow = table.getRow(Bytes.toBytes("test_row1"));
    Cell cell = singleRow.get(Bytes.toBytes("columnfamily1:column1"));
    if(cell!=null) {
        System.out.println(Bytes.toString(cell.getValue()));
    }

    cell = singleRow.get(Bytes.toBytes("columnfamily1:column2"));
    if(cell!=null) {
        System.out.println(Bytes.toString(cell.getValue()));
    }

To get multiple rows use Scanner and iterate throw to get values.

    Scanner scanner = table.getScanner(
        new String[] { "columnfamily1:column1" });


    //First aproach to iterate the scanner.

    RowResult rowResult = scanner.next();
    while (rowResult != null) {
        System.out.println("Found row: " + Bytes.toString(rowResult.getRow())
            + " with value: " +
            rowResult.get(Bytes.toBytes("columnfamily1:column1")));

        rowResult = scanner.next();
    }

    // The other approach is to use a foreach loop. Scanners are iterable!
    for (RowResult result : scanner) {
        System.out.println("Found row: " + Bytes.toString(result.getRow())
            + " with value: " +
            result.get(Bytes.toBytes("columnfamily1:column1")));

    }

    scanner.close();

Example Code:

import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Scanner;
import org.apache.hadoop.hbase.io.BatchUpdate;
import org.apache.hadoop.hbase.io.Cell;
import org.apache.hadoop.hbase.io.RowResult;
import org.apache.hadoop.hbase.util.Bytes;

public class HBaseExample {

    public static void main(String args[]) throws IOException {

        HBaseConfiguration conf = new HBaseConfiguration();
        conf.addResource(new Path("/opt/hbase-0.19.3/conf/hbase-site.xml"));

        HTable table = new HTable(conf, "test_table");

        BatchUpdate batchUpdate = new BatchUpdate("test_row1");
        batchUpdate.put("columnfamily1:column1", Bytes.toBytes("some value"));
        batchUpdate.delete("column1");
        table.commit(batchUpdate);

        Cell cell = table.get("test_row1", "columnfamily1:column1");
        if (cell != null) {
            String valueStr = Bytes.toString(cell.getValue());
            System.out.println("test_row1:columnfamily1:column1 " + valueStr);
        }

        RowResult singleRow = table.getRow(Bytes.toBytes("test_row1"));
        Cell cell = singleRow.get(Bytes.toBytes("columnfamily1:column1"));
        if(cell!=null) {
            System.out.println(Bytes.toString(cell.getValue()));
        }

        cell = singleRow.get(Bytes.toBytes("columnfamily1:column2"));
        if(cell!=null) {
            System.out.println(Bytes.toString(cell.getValue()));
        }

        Scanner scanner = table.getScanner(
            new String[] { "columnfamily1:column1" });

        //First approach to iterate a scanner
        RowResult rowResult = scanner.next();
        while (rowResult != null) {
            System.out.println("Found row: " + Bytes.toString(rowResult.getRow())
                + " with value: " +
                rowResult.get(Bytes.toBytes("columnfamily1:column1")));

            rowResult = scanner.next();
        }

        // The other approach is to use a foreach loop. Scanners are iterable!
        for (RowResult result : scanner) {
            // print out the row we found and the columns we were looking for
            System.out.println("Found row: " + Bytes.toString(result.getRow())
                + " with value: " +
                result.get(Bytes.toBytes("columnfamily1:column1")));

        }

        scanner.close();
        table.close();
    }
}

Using HDFS in java (0.20.0)

Below is a code sample of how to read from and write to HDFS in java.

1. Creating a configuration object: To be able to read from or write to HDFS, you need to create a Configuration object and pass configuration parameter to it using hadoop configuration files.
 
    // Conf object will read the HDFS configuration parameters from these XML
    // files. You may specify the parameters for your own if you want.


    Configuration conf = new Configuration();
    conf.addResource(new Path("/opt/hadoop-0.20.0/conf/core-site.xml"));
    conf.addResource(new Path("/opt/hadoop-0.20.0/conf/hdfs-site.xml"));

    If you do not assign the configurations to conf object (using hadoop xml file) your HDFS operation will be performed on the local file system and not on the HDFS.

2. Adding file to HDFS:
Create a FileSystem object and use a file stream to add a file.

    FileSystem fileSystem = FileSystem.get(conf);
   
    // Check if the file already exists

    Path path = new Path("/path/to/file.ext");
    if (fileSystem.exists(path)) {
        System.out.println("File " + dest + " already exists");
        return;
    }

    // Create a new file and write data to it.
    FSDataOutputStream out = fileSystem.create(path);
    InputStream in = new BufferedInputStream(new FileInputStream(
        new File(source)));


    byte[] b = new byte[1024];
    int numBytes = 0;
    while ((numBytes = in.read(b)) > 0) {
        out.write(b, 0, numBytes);
    }

    // Close all the file descripters
    in.close();
    out.close();
    fileSystem.close();

3. Reading file from HDFS: Create a file stream object to a file in HDFS and read it.

    FileSystem fileSystem = FileSystem.get(conf);

    Path path = new Path("/path/to/file.ext");

    if (!fileSystem.exists(path)) {
        System.out.println("File does not exists");
        return;
    }

    FSDataInputStream in = fileSystem.open(path);


    String filename = file.substring(file.lastIndexOf('/') + 1,
        file.length());


    OutputStream out = new BufferedOutputStream(new FileOutputStream(
        new File(filename)));


    byte[] b = new byte[1024];
    int numBytes = 0;
    while ((numBytes = in.read(b)) > 0) {
        out.write(b, 0, numBytes);
    }

    in.close();
    out.close();
    fileSystem.close();

3. Deleting file from HDFS: Create a file stream object to a file in HDFS and delete it.

    FileSystem fileSystem = FileSystem.get(conf);

    Path path = new Path("/path/to/file.ext");
    if (!fileSystem.exists(path)) {
        System.out.println("File does not exists");
        return;
    }

    // Delete file
    fileSystem.delete(new Path(file), true);


    fileSystem.close();

3. Create dir in HDFS: Create a file stream object to a file in HDFS and read it.

    FileSystem fileSystem = FileSystem.get(conf);

    Path path = new Path(dir);
    if (fileSystem.exists(path)) {
        System.out.println("Dir " + dir + " already not exists");
        return;
    }

    // Create directories
    fileSystem.mkdirs(path);


    fileSystem.close();

Code:

import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

public class HDFSClient {
    public HDFSClient() {

    }

    public void addFile(String source, String dest) throws IOException {
        Configuration conf = new Configuration();

        // Conf object will read the HDFS configuration parameters from these
        // XML files.
        conf.addResource(new Path("/opt/hadoop-0.20.0/conf/core-site.xml"));
        conf.addResource(new Path("/opt/hadoop-0.20.0/conf/hdfs-site.xml"));

        FileSystem fileSystem = FileSystem.get(conf);

        // Get the filename out of the file path
        String filename = source.substring(source.lastIndexOf('/') + 1,
            source.length());


        // Create the destination path including the filename.
        if (dest.charAt(dest.length() - 1) != '/') {
            dest = dest + "/" + filename;
        } else {
            dest = dest + filename;
        }

        // System.out.println("Adding file to " + destination);

        // Check if the file already exists
        Path path = new Path(dest);
        if (fileSystem.exists(path)) {
            System.out.println("File " + dest + " already exists");
            return;
        }

        // Create a new file and write data to it.
        FSDataOutputStream out = fileSystem.create(path);
        InputStream in = new BufferedInputStream(new FileInputStream(
            new File(source)));


        byte[] b = new byte[1024];
        int numBytes = 0;
        while ((numBytes = in.read(b)) > 0) {
            out.write(b, 0, numBytes);
        }

        // Close all the file descripters
        in.close();
        out.close();
        fileSystem.close();
    }

    public void readFile(String file) throws IOException {
        Configuration conf = new Configuration();
        conf.addResource(new Path("/opt/hadoop-0.20.0/conf/core-site.xml"));

        FileSystem fileSystem = FileSystem.get(conf);

        Path path = new Path(file);
        if (!fileSystem.exists(path)) {
            System.out.println("File " + file + " does not exists");
            return;
        }

        FSDataInputStream in = fileSystem.open(path);

        String filename = file.substring(file.lastIndexOf('/') + 1,
            file.length());


        OutputStream out = new BufferedOutputStream(new FileOutputStream(
            new File(filename)));


        byte[] b = new byte[1024];
        int numBytes = 0;
        while ((numBytes = in.read(b)) > 0) {
            out.write(b, 0, numBytes);
        }

        in.close();
        out.close();
        fileSystem.close();
    }

    public void deleteFile(String file) throws IOException {
        Configuration conf = new Configuration();
        conf.addResource(new Path("/opt/hadoop-0.20.0/conf/core-site.xml"));

        FileSystem fileSystem = FileSystem.get(conf);

        Path path = new Path(file);
        if (!fileSystem.exists(path)) {
            System.out.println("File " + file + " does not exists");
            return;
        }

        fileSystem.delete(new Path(file), true);

        fileSystem.close();
    }

    public void mkdir(String dir) throws IOException {
        Configuration conf = new Configuration();
        conf.addResource(new Path("/opt/hadoop-0.20.0/conf/core-site.xml"));

        FileSystem fileSystem = FileSystem.get(conf);

        Path path = new Path(dir);
        if (fileSystem.exists(path)) {
            System.out.println("Dir " + dir + " already not exists");
            return;
        }

        fileSystem.mkdirs(path);

        fileSystem.close();
    }

    public static void main(String[] args) throws IOException {

        if (args.length < 1) {
            System.out.println("Usage: hdfsclient add/read/delete/mkdir" +
                " [<local_path> <hdfs_path>]");

            System.exit(1);
        }

        HDFSClient client = new HDFSClient();
        if (args[0].equals("add")) {
            if (args.length < 3) {
                System.out.println("Usage: hdfsclient add <local_path> " +
                "<hdfs_path>");

                System.exit(1);
            }

            client.addFile(args[1], args[2]);
        } else if (args[0].equals("read")) {
            if (args.length < 2) {
                System.out.println("Usage: hdfsclient read <hdfs_path>");
                System.exit(1);
            }

            client.readFile(args[1]);
        } else if (args[0].equals("delete")) {
            if (args.length < 2) {
                System.out.println("Usage: hdfsclient delete <hdfs_path>");
                System.exit(1);
            }

            client.deleteFile(args[1]);
        } else if (args[0].equals("mkdir")) {
            if (args.length < 2) {
                System.out.println("Usage: hdfsclient mkdir <hdfs_path>");
                System.exit(1);
            }

            client.mkdir(args[1]);
        } else {  
            System.out.println("Usage: hdfsclient add/read/delete/mkdir" +
                " [<local_path> <hdfs_path>]");
            System.exit(1);

        }

        System.out.println("Done!");
    }
}

Wednesday, June 17, 2009

HBase setup (0.19.3)

Before you begin:

Before you start configure HBase, you need to have a running Hadoop cluster, which will be the storage for hbase. Please refere to Hadoop cluster setup document before continuing.

On the HBaseMaster (master) machine:

1. In file /etc/hosts, define the ip address of the namenode machine and all the datanode machines. Make sure you define the actual ip (eg. 192.168.1.9) and not the localhost ip (eg. 127.0.0.1) for all the machines including the namenode, otherwise the datanodes will not be able to connect to namenode machine).

    192.168.1.9    hbase-masterserver
    192.168.1.8    hbase-regionserver1
    192.168.1.7    hbase-regionserver2
    192.168.1.6    hadoop-nameserver

    Note: Check to see if the namenode machine ip is being resolved to actual ip not localhost ip using "ping hbase-namenode".

2. Configure password less login from masterserver to all regionserver machines. Refer to Configuring passwordless ssh access for instructions on how to setup password less ssh access.

3. Download and unpack hbase-0.19.3.tar.gz from HBase website to some path in your computer (We'll call the hbase installation root as $HBASE_INSTALL_DIR now on).

4. Edit the file $HBASE_INSTALL_DIR/conf/hbase-env.sh and define the $JAVA_HOME.

    export JAVA_HOME=/user/lib/jvm/java-6-sun

5. Edit the file $HBASE_INSTALL_DIR/conf/hbase-site.xml and add the following properties. (These configurations are required on all the node in the cluster)

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <configuration>
        <property>
            <name>hbase.master</name>
            <value>hbase-masterserver:60000</value>
            <description>The host and port that the HBase master runs at.
            A value of 'local' runs the master and a regionserver in
            a single process.
            </description>
        </property>

        <property>
            <name>hbase.rootdir</name>
            <value>hdfs://hadoop-nameserver:9000/hbase</value>
            <description>The directory shared by region servers.</description>
        </property>

        <property>
            <name>hbase.regionserver.class</name>
            <value>org.apache.hadoop.hbase.ipc.IndexedRegionInterface</value>
            <description>This configuration is required to enable indexing on
            hbase and to be able to create secondary indexes
            </description>
        </property>

        <property>
            <name>hbase.regionserver.impl</name>
            <value>
            org.apache.hadoop.hbase.regionserver.tableindexed.IndexedRegionServer
            </value>
            <description>This configuration is required to enable indexing on
            hbase and to be able to create secondary indexes
            </description>
        </property>
    </configuration>
                
    Note: Remeber to replace masterserver and regionserver machine names with real machine names here.

6. Edit $HBASE_INSTALL_DIR/conf/regionservers and add the namenode machine

    hbase-regionserver1
    hbase-regionserver2
    hbase-masterserver

    Note: Add masterserver machine name only if you are running a regionserver on masterserver machine.

On HRegionServer (slave) machine:


1. In file /etc/hosts, define the ip address of the namenode machine. Make sure you define the actual ip (eg. 192.168.1.9) and not the localhost ip (eg. 127.0.0.1).

    192.168.1.9    bhase-masterserver

Note: Check to see if the masterserver machine ip is being resolved to actual ip not localhost ip using "ping hbase-masterserver".

2. Configure password less login from all regionserver machines to masterserver machines. Refer to Configuring passwordless ssh access for instructions on how to setup password less ssh access.

3. Download and unpack hbase-0.19.3.tar.gz from HBase website to some path in your computer (We'll call the hadoop installation root as $HBASE_INSTALL_DIR now on).

4. Edit the file $HBASE_INSTALL_DIR/conf/hadoop-env.sh and define the $JAVA_HOME.

    export JAVA_HOME=/user/lib/jvm/java-6-sun

5. Edit the file $HBASE_INSTALL_DIR/conf/hbase-site.xml and add the following properties. (These configurations are required on all the node in the cluster)

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <configuration>
        <
property>
            <name>hbase.rootdir</name>
            <value>hdfs://rajeevks-lx:9000/hbase</value>
            <description>The directory shared by region servers.</description>
        </property>

        <property>
            <name>hbase.regionserver.class</name>
            <value>org.apache.hadoop.hbase.ipc.IndexedRegionInterface</value>
            <description>This configuration is required to enable indexing on
            hbase and to be able to create secondary indexes
            </description>
        </property>

        <property>
            <name>hbase.regionserver.impl</name>
            <value>
            org.apache.hadoop.hbase.regionserver.tableindexed.IndexedRegionServer
            </value>
            <description>This configuration is required to enable indexing on
            hbase and to be able to create secondary indexes.
            </description>
        </property>
    </configuration>

Start and Stop hbase daemons:

You need to start/stop the daemons only on the masterserver machine, it will start/stop the daemons in all regionserver machines. Execute the following command to start/stop the hbase.

    $HBASE_INSTALL_DIR/bin/start-hbase.sh
    or
    $HBASE_INSTALL_DIR/bin/stop-hbase.sh

Hadoop cluster setup (0.20.0)

On the NameNode (master) machine:

1. In file /etc/hosts, define the ip address of the namenode machine and all the datanode machines. Make sure you define the actual ip (eg. 192.168.1.9) and not the localhost ip (eg. 127.0.0.1) for all the machines including the namenode, otherwise the datanodes will not be able to connect to namenode machine).

    192.168.1.9    hadoop-namenode
    192.168.1.8    hadoop-datanode1
    192.168.1.7    hadoop-datanode2

    Note: Check to see if the namenode machine ip is being resolved to actual ip not localhost ip using "ping hadoop-namenode".

2. Configure password less login from namenode to all datanode machines. Refer to Configure passwordless ssh access for instructions on how to setup password less ssh access.

3. Download and unpack hadoop-0.20.0.tar.gz from Hadoop website to some path in your computer (We'll call the hadoop installation root as $HADOOP_INSTALL_DIR now on).

4. Edit the file $HADOOP_INSTALL_DIR/conf/hadoop-env.sh and define the $JAVA_HOME.

    export JAVA_HOME=/usr/lib/jvm/java-6-sun

5. Edit the file $HADOOP_INSTALL_DIR/conf/core-site.xml and add the following properties. (These configurations are required on all the node in the cluster)

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
        <property>
            <name>fs.default.name</name>
            <value>hdfs://hadoop-namenode:54310</value>
            <description>The name of the default file system.  A URI whose
            scheme and authority determine the FileSystem implementation. The
            uri's scheme determines the config property (fs.SCHEME.impl) naming
            the fleSystem implementation class.  The uri's authority is used to
            determine the host, port, etc. for a filesystem.
            </description>
        </property>

        <property>
            <name>hadoop.tmp.dir</name>
            <value>/opt/hdfs/tmp</value>
            <description>A base for other temporary directories.</description>
        </property>
    </configuration>

    Note: Remeber to replace namenode and datanode machine names with real machine names here

6. Edit the file $HADOOP_INSTALL_DIR/conf/hdfs-site.xml and add the following properties. (This file defines the properties of the namenode and datanode).

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <configuration>
        <property>
            <name>dfs.name.dir</name>
            <value>/opt/hdfs/name</value>
            <description>Determines where on the local filesystem an DFS name
            node should store its blocks.  If this is a comma-delimited list of
            directories, then data will be stored in all named directories,
            typically on different devices. Directories that do not exist are
            ignored.
            </description>
        </property>

        <property>
            <name>dfs.data.dir</name>
            <value>/opt/hdfs/data</value>
            <description>Determines where on the local filesystem an DFS data
            node should store its blocks.  If this is a comma-delimited list of
            directories, then data will be stored in all named directories,
            typically on different devices. Directories that do not exist are
            ignored.
            </description>
        </property>

        <property>
            <name>dfs.replication</name>
            <value>1</value>
            <description>Default block replication.
            The actual number of replications can be specified when the file is
            created. The default is used if replication is not specified in
            create time.
            </description>
        </property>

        <property>
            <name>dfs.datanode.du.reserved</name>
            <value>53687090000</value>
            <description>This is the reserved space for non dfs use</description>
        </property>
    </configuration>

    Note: Remeber to replace namenode and datanode machine names with real machine names here

7. Edit $HADOOP_INSTALL_DIR/conf/mapred.xml and add the following configuration. (This file defines the configuration for the maper and reduser)

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <configuration>
    </configuration>

    Note:
    1. If you only need to use HDFS, leave this file empty.
    2. Remeber to replace namenode and datanode machine names with real machine names here.

8. Edit $HADOOP_INSTALL_DIR/conf/masters and add the machine names where a secondary namenodes will run.

    hadoop-secondarynamenode1
    hadoop-secondarynamenode2
    
    Note: If you want to secondary name to to run on the same machine as the primary namenode, enter the machine name of primary namenode machine.

9. Edit $HADOOP_INSTALL_DIR/conf/slaves and add all the datanodes machine names. If you are running a datanode on the namenode machine do remember to add that also.

    hadoop-namenode
    hadoop-datanode1
    hadoop-datanode2

    Note: Add namenode machine name only if you are running a datanode on namenode machine.

On DataNode (slave) machine:

1. In file /etc/hosts, define the ip address of the namenode machine. Make sure you define the actual ip (eg. 192.168.1.9) and not the localhost ip (eg. 127.0.0.1).

    192.168.1.9    hadoop-namenode

    Note: Check to see if the namenode machine ip is being resolved to actual ip not localhost ip using "ping hadoop-namenode".

2. Configure password less login from all datanode machines to namenode machines. Refer to Configureing password less ssh access for instructions on how to setup password less ssh access.

3. Download and unpack hadoop-0.20.0.tar.gz from Hadoop website to some path in your computer (We'll call the hadoop installation root as $HADOOP_INSTALL_DIR now on).

4. Edit the file $HADOOP_INSTALL_DIR/conf/hadoop-env.sh and define the $JAVA_HOME.

    export JAVA_HOME=/usr/lib/jvm/java-6-sun

5. Edit the file $HADOOP_INSTALL_DIR/conf/core-site.xml and add the following properties. (These configurations are required on all the node in the cluster)

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://hadoop-namenode:54310</value>
        <description>The name of the default file system.  A URI whose scheme
        and authority determine the FileSystem implementation. The uri's scheme
        determines the config property (fs.SCHEME.impl) naming the FileSystem
        implementation class.  The uri's authority is used to determine the
        host, port, etc. for a filesystem.
        </description>
    </property>

    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/hdfs/tmp</value>
        <description>A base for other temporary directories.</description>
    </property>
    </configuration>

6. Edit the file $HADOOP_INSTALL_DIR/conf/hdfs-site.xml and add the following properties. (This file defines the properties of the namenode and datanode).

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <configuration>
        <property>
            <name>dfs.name.dir</name>
            <value>/opt/hdfs/name</value>
            <description>Determines where on the local filesystem an DFS name
            node
should store its blocks.  If this is a comma-delimited
            list of directories, then data will be stored in all named
            directories, typically on different devices.
            Directories that do not exist are ignored.
            </description>
        </property>

        <property>
            <name>dfs.data.dir</name>
            <value>/opt/hdfs/data</value>
            <description>Determines where on the local filesystem an DFS data
            node
should store its blocks.  If this is a comma-delimited
            list of directories, then data will be stored in all named
            directories, typically on different devices.
            Directories that do not exist are ignored.
            </description>
        </property>

        <property>
            <name>dfs.replication</name>
            <value>1</value>
            <description>Default block replication.
            The actual number of replications can be specified when the file is
            created. The default is used if replication is not specified in
            create
time.
            </description>
        </property>

        <property>
            <name>dfs.datanode.du.reserved</name>
            <value>53687090000</value>
            <description>This is the reserved space for non dfs use</description>
        </property>
    </configuration>

7. Edit $HADOOP_INSTALL_DIR/conf/mapred.xml and add the following configuration. (This file defines the configuration for the maper and reduser)

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <configuration>
    </configuration>

    Note: If you only need to use HDFS, leave this file empty.

Start and Stop hadoop daemons:

1. Before you start the Hadoop daemons, you need to format the filesystem. Execute the following command to format the file system.

    $HADOOP_INSTALL_DIR/bin/hadoop namenode -format

2. You need to start/stop the daemons only on the master machine, it will start/stop the daemons in all slave machines.

    To start/stop all the daemons execute the following command.

    $HADOOP_INSTALL_DIR/bin/start-all.sh
    or
    $HADOOP_INSTALL_DIR/bin/stop-all.sh

    To start only the dfs daemons execute the following command

    $HADOOP_INSTALL_DIR/bin/start-dfs.sh
    or
    $HADOOP_INSTALL_DIR/bin/stop-dfs.sh