Hbase for storing Users?

We had a meeting last night about the state of the Sproozi project, where we wanted to be and the results of a few tests we’d run. We more or less came to the conclusion that we’re going to need to push on with the social aspect of the site sooner rather than later. We’re still trying to figure out what that means in terms of funding, if we need to raise any and how we go about it if we decide we do. It does start to pose some interesting questions the first and biggest is how we’re going to store user data.

We’re already running Hbase and storing lots of data in there and I’d like the application to scale as easily as possible. The idea of running another framework or service just to store user data seems overkill and seems like one more system to worry about. So I’m going to run a little experiment in storing users in Hbase.

The downside to Hbase is that it’s not easy to search for things when you only want one of them by something other than the id. It’s easy to pick a row by it’s id and even to scan the table in order from there or to start and the beginning and go all the way through, but it’s not very easy to quickly pick out a random row by the value of one of it’s other fields. You’d have to start a map reduce task and start crunching the data until you found what you were looking for.

Given the simple example of a user with a long id an email address, a username and a password it would be easy to get the user by it’s id, but not very easy to get the user by the email address or username. So I’m toying with how to get it to work, probably by creating some additional tables to store keys for columns I want to search that link back to the correct user. Sort of like making my own indices.

Once I get some code written and tested I’ll probably throw up another post with some more details on whether or not it worked.

Any thoughts?

Update: Check out this Vitamin article

  • TIm, That sound interesting, I've not played with Solr but I am creating a lucene index using some of the field in some of the tables - the cluster is running some highly customised nutch jobs based on the code here: http://github.com/andrewmccall/nutchbase. I considered putting the user Ids in a luncene index and using that to find users, but I was a bit reticent to implement it because I felt there was too much I didn't know.

    Thinking about it again, I may just look at both implementations in more depth because it may be a better way to go especially as indexes start to pile up.
  • Just an idea.
    Have you considered using a search server for the indexing/searching and hbase for storing?
    You'd have to keep them in sync of course, but solr is quite useful. You can optimize for just returning ids, by indexing fields and not storing them and there is progress on sharding if you really need it. It doesn't scale to billions of rows of course, but it unlikely that will be a problem for users. You can do exact matches on any of the fields, and of course utilise full text searches where appropriate.
  • Thanks for that Jonathan, good to know I'm more or less on the right path. Now that you mention it I remember reading about it in the doc but since forgot. Looked again and saw this:

    http://hadoop.apache.org/hbase/docs/current/api/index.html?org/apache/hadoop/hbase/regionserver/transactional/package-summary.html

    Which I'll look into and post about if it's useful.
  • Basic secondary indexing on HBase is done as you describe. Create an additional table for each index where the row id is the indexed field. This is also included as an integrated feature using TransactionalHBase which will take care of managing the secondary tables for you. It uses OCC (optimistic concurrency control) for safety.

    In my own usage, I manage the secondary tables at the application level. This is faster but less safe.

    I have plans to add a less safe, but fast server-side implementation of this in the future for my own purposes. But I also heard there's a chance OCC will be pluggable for the current implementation, in which case I'd just use that. Sign up to the mailing list, 0.20.0 release coming up soon and that will be determined for that release.
blog comments powered by Disqus