Thursday, June 18, 2009

Secondary indexes in HBase

Creating secondary indexes in HBase-0.19.3:

You need to enable indexing in HBase before you can create a secondary index on columns. Edit the file $HBASE_INSTALL_DIR/conf/hbase-site.xml and add the following property to it.

    <property>
        <name>hbase.regionserver.class</name>
        <value>org.apache.hadoop.hbase.ipc.IndexedRegionInterface</value>
    </property>

    <property>
        <name>hbase.regionserver.impl</name>
        <value>
        org.apache.hadoop.hbase.regionserver.tableindexed.IndexedRegionServer
        </value>
    </property>

Adding secondary index while creating table:

    HBaseConfiguration conf = new HBaseConfiguration();
    conf.addResource(new Path("/opt/hbase-0.19.3/conf/hbase-site.xml"));

    HTableDescriptor desc = new HTableDescriptor("test_table");

    desc.addFamily(new HColumnDescriptor("columnfamily1:"));
    desc.addFamily(new HColumnDescriptor("columnfamily2:"));

    desc.addIndex(new IndexSpecification("column1",
        Bytes.toBytes("columnfamily1:column1")));

    desc.addIndex(new IndexSpecification("column2",
        Bytes.toBytes("columnfamily1:column2")));


    IndexedTableAdmin admin = null;
    admin = new IndexedTableAdmin(conf);

    admin.createTable(desc);

Adding index in an existing table:

    HBaseConfiguration conf = new HBaseConfiguration();
    conf.addResource(new Path("/opt/hbase-0.19.3/conf/hbase-site.xml"));

    IndexedTableAdmin admin = null;
    admin = new IndexedTableAdmin(conf);

    admin.addIndex(Bytes.toBytes("test_table"), new IndexSpecification("column2",
    Bytes.toBytes("columnfamily1:column2")));

Deleting existing index from a table.

    HBaseConfiguration conf = new HBaseConfiguration();
    conf.addResource(new Path("/opt/hbase-0.19.3/conf/hbase-site.xml"));

    IndexedTableAdmin admin = null;
    admin = new IndexedTableAdmin(conf);

    admin.removeIndex(Bytes.toBytes("test_table"), "column2");

Reading from secondary indexed columns:

To read from a secondary index, get a scanner for the index and scan through the data.

    HBaseConfiguration conf = new HBaseConfiguration();
    conf.addResource(new Path("/opt/hbase-0.19.3/conf/hbase-site.xml"));

    IndexedTable table = new IndexedTable(conf, Bytes.toBytes("test_table"));

    // You need to specify which columns to get
    Scanner scanner = table.getIndexedScanner("column1",
        HConstants.EMPTY_START_ROW, null, null, new byte[][] {
        Bytes.toBytes("columnfamily1:column1"),
        Bytes.toBytes("columnfamily1:column2") });

    for (RowResult rowResult : scanner) {
        String value1 = new String(
            rowResult.get(Bytes.toBytes("columnfamily1:column1")).getValue());

        String value2 = new String(
            rowResult.get(Bytes.toBytes("columnfamily1:column2")).getValue());

        System.out.println(value1 + ", " + value2);
    }

    table.close();

To get a scanner to a subset of the rows specify a column filter.

    ColumnValueFilter filter =
        new ColumnValueFilter(Bytes.toBytes("columnfamily1:column1"),

        CompareOp.LESS, Bytes.toBytes("value1-10"));

    scanner = table.getIndexedScanner("column1", HConstants.EMPTY_START_ROW,
        null,
filter, new byte[][] { Bytes.toBytes("columnfamily1:column1"),
        Bytes.toBytes("columnfamily1:column2")
);

    for (RowResult rowResult : scanner) {
        String value1 = new String(
            rowResult.get(Bytes.toBytes("columnfamily1:column1")).getValue());

        String value2 = new String(
            rowResult.get(Bytes.toBytes("columnfamily1:column2")).getValue());

        System.out.println(value1 + ", " + value2);
    }

Example Code:

import java.io.IOException;
import java.util.Date;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HConstants;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.client.Scanner;
import org.apache.hadoop.hbase.client.tableindexed.IndexSpecification;
import org.apache.hadoop.hbase.client.tableindexed.IndexedTable;
import org.apache.hadoop.hbase.client.tableindexed.IndexedTableAdmin;
import org.apache.hadoop.hbase.filter.ColumnValueFilter;
import org.apache.hadoop.hbase.filter.ColumnValueFilter.CompareOp;
import org.apache.hadoop.hbase.io.BatchUpdate;
import org.apache.hadoop.hbase.io.RowResult;
import org.apache.hadoop.hbase.util.Bytes;

public class SecondaryIndexTest {
    public void writeToTable() throws IOException {
        HBaseConfiguration conf = new HBaseConfiguration();
        conf.addResource(new Path("/opt/hbase-0.19.3/conf/hbase-site.xml"));

        IndexedTable table = new IndexedTable(conf, Bytes.toBytes("test_table"));

        String row = "test_row";
        BatchUpdate update = null;

        for (int i = 0; i < 100; i++) {
            update = new BatchUpdate(row + i);
            update.put("columnfamily1:column1", Bytes.toBytes("value1-" + i));
            update.put("columnfamily1:column2", Bytes.toBytes("value2-" + i));
            table.commit(update);
        }

        table.close();
    }

    public void readAllRowsFromSecondaryIndex() throws IOException {
        HBaseConfiguration conf = new HBaseConfiguration();
        conf.addResource(new Path("/opt/hbase-0.19.3/conf/hbase-site.xml"));

        IndexedTable table = new IndexedTable(conf, Bytes.toBytes("test_table"));

        Scanner scanner = table.getIndexedScanner("column1",
            HConstants.EMPTY_START_ROW, null, null, new byte[][] {
            Bytes.toBytes("columnfamily1:column1"),
                Bytes.toBytes("columnfamily1:column2") });


        for (RowResult rowResult : scanner) {
            System.out.println(Bytes.toString(
                rowResult.get(Bytes.toBytes("columnfamily1:column1")).getValue())
                + ", " + Bytes.toString(rowResult.get(
                Bytes.toBytes("columnfamily1:column2")).getValue()
                ));
        }

        table.close();
    }

    public void readFilteredRowsFromSecondaryIndex() throws IOException {
        HBaseConfiguration conf = new HBaseConfiguration();
        conf.addResource(new Path("/opt/hbase-0.19.3/conf/hbase-site.xml"));

        IndexedTable table = new IndexedTable(conf, Bytes.toBytes("test_table"));

        ColumnValueFilter filter =
            new ColumnValueFilter(Bytes.toBytes("columnfamily1:column1"),

            CompareOp.LESS, Bytes.toBytes("value1-40"));

        Scanner scanner = table.getIndexedScanner("column1",
            HConstants.EMPTY_START_ROW, null, filter,
            new byte[][] { Bytes.toBytes("columnfamily1:column1"),
                Bytes.toBytes("columnfamily1:column2")

            });

        for (RowResult rowResult : scanner) {
            System.out.println(Bytes.toString(
                rowResult.get(Bytes.toBytes("columnfamily1:column1")).getValue())
                + ", " + Bytes.toString(rowResult.get(
                Bytes.toBytes("columnfamily1:column2")).getValue()
                ));
        }

        table.close();
    }

    public void createTableWithSecondaryIndexes() throws IOException {
        HBaseConfiguration conf = new HBaseConfiguration();
        conf.addResource(new Path("/opt/hbase-0.19.3/conf/hbase-site.xml"));

        HTableDescriptor desc = new HTableDescriptor("test_table");

        desc.addFamily(new HColumnDescriptor("columnfamily1:column1"));
        desc.addFamily(new HColumnDescriptor("columnfamily1:column2"));

        desc.addIndex(new IndexSpecification("column1",
            Bytes.toBytes("columnfamily1:column1")));
        desc.addIndex(new IndexSpecification("column2",
            Bytes.toBytes("columnfamily1:column2")));

        IndexedTableAdmin admin = null;
        admin = new IndexedTableAdmin(conf);

        if (admin.tableExists(Bytes.toBytes("test_table"))) {
            if (admin.isTableEnabled("test_table")) {
                admin.disableTable(Bytes.toBytes("test_table"));
            }

            admin.deleteTable(Bytes.toBytes("test_table"));
        }

        if (admin.tableExists(Bytes.toBytes("test_table-column1"))) {
            if (admin.isTableEnabled("test_table-column1")) {
                admin.disableTable(Bytes.toBytes("test_table-column1"));
            }

            admin.deleteTable(Bytes.toBytes("test_table-column1"));
        }

        admin.createTable(desc);
    }

    public void addSecondaryIndexToExistingTable() throws IOException {
        HBaseConfiguration conf = new HBaseConfiguration();
        conf.addResource(new Path("/opt/hbase-0.19.3/conf/hbase-site.xml"));

        IndexedTableAdmin admin = null;
        admin = new IndexedTableAdmin(conf);

        admin.addIndex(Bytes.toBytes("test_table"),
            new IndexSpecification("column2",
            Bytes.toBytes("columnfamily1:column2")));

    }

    public void removeSecondaryIndexToExistingTable() throws IOException {
        HBaseConfiguration conf = new HBaseConfiguration();
        conf.addResource(new Path("/opt/hbase-0.19.3/conf/hbase-site.xml"));

        IndexedTableAdmin admin = null;
        admin = new IndexedTableAdmin(conf);

        admin.removeIndex(Bytes.toBytes("test_table"), "column2");
    }

    public static void main(String[] args) throws IOException {
        SecondaryIndexTest test = new SecondaryIndexTest();

        test.createTableWithSecondaryIndexes();
        test.writeToTable();
        test.addSecondaryIndexToExistingTable();
        test.removeSecondaryIndexToExistingTable();
        test.readAllRowsFromSecondaryIndex();
        test.readFilteredRowsFromSecondaryIndex();

        System.out.println("Done!");
    }
}

21 comments :

Ramesh said...

Hi Rajeev,

Its a wonderful work, thanks.

BTW, I'm experimenting HBase 0.19.1 using HTable and IndexedTable with filter, for both of these I am using ColumnValueFilter.

As you have mentioned in this post, I have an IndexedTable with 3 indexes.

I'm using getIndexedScanner method to read data from IndexedTable, and getScanner method to read from HTable. Inserted the same dataset (3.5 millions of records, each has 1 family and 12 qualifiers) into both of the tables.

But, what I wonder is I'm not getting any improved performance using IndexedTable than HTable.

What i notice is, on these following statement execution with IndexedTable, it takes more time than HTable

1. scanner = tableObj.getIndexedScanner(indexId, HConstants.EMPTY_START_ROW, indexColumns, filter, baseColumns);

2. for (RowResult rowResult : scanner) {
.... }

HTable just pass these above, in 120ms and 87ms respectively. But, the IndexedTable takes 3412ms and 10487ms respectively.

Any clues, is this the expected behavior?

TIA,
Ramesh

Raj said...

nice tutorial rajeev.. will help me a lot. Thanks.

Kevin said...

I'm not sure if this is just my machine, but the value for hbase.regionserver.impl wraps which doesn't actually work in the config files. You need to put it all on one line. Took me a while to figure out what was happening.

Sandeep Kath said...

Hi Rajeev,

I am trying to compile this program on 0.20 and getting compilation error.

org.apache.hadoop.hbase.client.tableindexed
does not exist. Is any change in .20 for secondary indexes.

Anonymous said...

I'm getting the same exception in 0.20 - ClassNotFoundException org.apache.hadoop.hbase.regionserver.tableindexed.IndexedRegionServer (even though the apidocs inform me otherwise)

Rajeev said...

Hi,

I have tried it with HBase-0.19.0. I can't say if this class has been deprecated in new versions.

Anonymous said...

Good work rajeev. I believe the classes are in the contrib section of the distribution if you get hbase from the svn repository. you'll have to build the jar with the ant build and add it to you hbase class path. hope that helps. -checkwriter

Anonymous said...

For HBase 0.20, you can use IndexedTable by including hbase-0.20.0-transactional.jar found in the contrib directory.

Anonymous said...

For HBase 0.20, you should also copy hbase-0.20-transactional from contrib/directory to lib directory before launching start-hbase.sh.

Anonymous said...

Thanks for this help. I was able to add indexing to a existing table and it created another table. Now the problem I am having is that I am unable to add the data to the table. Can you please advise me in what to do. Thank you.

Paul said...

Hi Rajeev,

I also wanted to thank you for this example. I was able to get it to work with HBase version 0.20.6. Let me know if you would like me to send you my updated version of SecondaryIndexTest or I can post it to my blog with your permission and credit.

Paul

Paul said...

Rajeev, thanks again for this example. It was very helpful. I have been able to update and run the example with HBase Version: 0.20.6, r965666, on Linux. Let me know if you wish me to post the updated version.

Rajeev said...

Hi Paul, thanks for updating the code for the newer version. I couldn't find a way to reach you directly. I would love to put your code on my blogs if that fine with you. Also, Please do add the code to your blog as well.

Thanks,
Rajeev

Paul said...

Hi Rajeev,

You can contact me at jazzfan159 at the big search engine company. I will send the new code to you for posting.

Paul

shawny said...

Hi @Rajeev, from the comments i got you will update the test case to hbase 0.20.6, but have you been done it?

Sunny said...

Hi,
I'm using Hbase-0.20.6. In this its giving compile time error when I try to add index.

desc.addIndex() - It says this method does not exist in HTableDescriptor.

can anyone help me out

Sunny said...

i have created the index table on an existing table. but now if I add a new record to the existing table.. how will it get reflected in the index table??

Anonymous said...

Have you implemented secondary index with Hadoop map-reduce in Hbase 0.90? and Second I need to insert data into multiple tables by method bulk import using one map-reduce job.? Can you share your thoughts about it?

Paul Sterk said...

Hi Rajeev,

I am responding to the last post. I have a test secondary index class that works with HBase 20.6. If you would like the code, please email me directly at jazzfan159 at gmail dot com.

Paul

Ruby on Rails said...

Can I have Partial Row Key Scan sample

Thanks
Hussain

Shengjie Min said...

the HTableDescriptor becomes read-only now, now what's the best way to create 2ndary indices? coprocessor or Ihbase