Archive for the 'Hadoop' tag
EC2 Spot instances
Posted on December 14, 2009
The thing I really like about Amazon’s cloud stuff is they’re constantly undermining themselves with new innovations – spot instances are another great example. Taking the utility metaphor a step further you can now rent their services when nobody else is for cheaper, like buying electricity at night.
A lot of the tasks I’m envisioning for Sproozi aren’t really time dependent. While it’s important to show you a page in a timely manner, crawl and index a brand new website you add quickly and basically be interactive there is also a lot I have to do in the background. The huge and growing list of URLs people add all need to be re-crawled and re-indexed regularly is just one of many examples of processing vast amounts of data. These tasks are always running, always in the background.
Spot instances are a perfect fit – I can bid the price I want on extra capacity spin up some extra instances to join the cluster when they’re cheap. Over the next few weeks I’ll probably try to add some spot instances to my Hadoop dev cluster and see what happens.
Scaling up vs scaling out
Posted on June 24, 2009
Jeff Atwood goes into some calculations about the cost of scaling up vs scaling out and makes an interesting point, it quickly becomes impractical if you’re not using open source software. I think Jeff slightly missed the point though, it’s not about open or closed source, it’s that scaling out is simply impractical if you’re paying traditional software licences.
This is something we came across when building Sproozi. If we wanted to store petabytes of data and run hundreds or thousands of concurrent processors there was no way we could ever afford to do it on machines running windows we were paying for by the box. But it’s not because we’d have to pay for software, per se, it’s how we’d have to pay for it.
Software has traditionally been licensed by machine, when machines got bigger vendors wanted to cash in so the licences got a little bigger. They had to cover their losses when you threw a few new processors in the machine rather than getting a new one to put alongside after all. It has always been in their best interest though for you to get a bigger box than to get more cheap ones – scaling out is very hard and the software doesn’t do it well. Most RDBMS just can’t do it well and they certainly can’t get anywhere near the the scale of something like Hadoop. If you want to scale out, forget SQL servers, you need software that’s going to scale out.
But let’s forget the specific software for the time being and just assume that the big boys (MS, Oracle, IBM) will have a scaling out solution soon – don’t worry this isn’t going to kill them, but it will change them. They will still want to licence an operating system and a data storage and retrieval system to you.
What I’m almost positive you’re going to see is these companies introduce new pricing schemes to meet the needs of the cloud, they have to or they’re going to lose all that revenue to the open source projects that have a head start on them. Just look at EC2, you can already provision MS and other software and I think that’s a trend that’s just going to continue.
So while Jeff is right that if I want to buy as many cheap boxes as I could for the hardware cost of a big iron server and put windows and SQL on them and it would all cost a small fortune. It’s not really a fair argument, you’re taking an old big iron way of thinking and trying to apply it to the cloud. What it fails to take into account is how much more powerful your new cloud cluster is than the big iron box, let the software vendors figure out the economics of making their software an attractive ROI when compared to OSS because if they want to compete in the cloud they’re going to have to.
Related articles by Zemanta
- Hadoop Summit: We Have 10 Tickets to Give Away (gigaom.com)
- Watch out, Oracle: Google tests cloud-based database (computerworld.com)
- Yahoo Releases Internal Hadoop Source Code (techcrunchit.com)
Reverse HTTP and the cloud
Posted on March 13, 2009

- Image via Wikipedia
I recently read the IETF draft RFC for Reverse HTTP, and it looks like a pretty simple and elegant solution to a number of problems I’ve seen, especially with the move to cloud computing.
The cloud brings with it some great possibilites but with them some great challenges. Computing on demand is great, if I need more power for a computationally intensive task I can just spin up a few instances for as long as I need them and shut them down when I’m done. Great in an ideal world, but RPC, cluster management and many tasks you’d have to take to run nodes in the cloud can be troublesome.
Apache Hadoop for example, is a great, free, opensource Map/Reduce framework but it makes assumptions based on a traditional datacenter full of real hardware that is always there view of the world. One of the biggest and most troblesome for the cloud is the fact that a master needs to be aware of the slaves before they try to connect. Implementing access controls in a secure manner for nodes connection is no small task because the whole system, from end to end is based on a custom client/server model written specifically for the task.
I’m not singling Hadoop out here, just using it as an example because it’s well known and I’m familiar with it.
Let’s take a very simple API, imagine there is no cluster, just one node. A client submits a job to the server, the server processes it and returns the result. Now let’s make it a little more complicated, let’s make it a Map/Reduce job and add a few nodes to the cluster. As far as the client is concerned the same thing is happening. They’re just submitting the jo to the server and it’s handling everything else, it breaks the job down into work units, submits them to the nodes in the cluster, all the results are merged together and passed back to the client.
In order to implement this you’re going to need at least a basic client/server API between the master and each slave. You could do it using traditional HTTP but you’d run into a scalability issue, imagine you have 10,000 nodes in your cluster. The server is going to need to have 10,000 open HTTP connections and each of them is going to have to poll the server at fixed intervals just to ask “Any work boss?”, “Nope, not at the moment. Take 5.” Sure you could increase the interval between asking, but 10,000 nodes doing nothing for 30 seconds is almost 3 1/2 days of computing power wasted.
To get around the problem you’ve got to design your architecture to push jobs to the nodes as soon as they come in. Which means writing your own client/server architecture and your own access control mechanisms amongst other things. If we flipped things around though, and the slave connected to the master over HTTP and then told the master it wanted to be the server we’ve achieved exactly what we wanted. The master knowing nothing about a slave, can now interact with the slave as if it were a client and it can submit a job as soon as it comes in.
An added benefit, the master/slave API can be the same as the user/master! After all, the master would be doing almost the exact same thing on a slave as the user is doing connecting via HTTP and submitting a job. No more custom client/server and vastly simplified code.
It would be easy to make it even more robust and allow for multiple tiers of masters and sub nodes. Just add a call to the API which asks the server how many slots it has free for jobs. Useful to a user from a management perspective, but also it would allow the master to partition the work into chunks based on the cluster size and based on the number of nodes served by any particular master. This would also be useful in terms of best use of resources given network topology issues – not all nodes are in the same rack or even datacentre.
Add to this the simplicity and power of simply adding something like HTTP AUTH-DIGEST at the server end and you’ve got ready made access controls. One certificate for clients, one for slaves. Clients can submit jobs, slaves get the work and there is no real need to know of anything about a slave before the first time it connects.
Why this is better than something like XMPP I can hear you asking. It’s not better. Not for any real reason, and yes it has some cross over in functionality with other technologies that are already out there. In the right situations though, it gives developers the option to simply things, and that’s never a bad thing.
Related articles by Zemanta
- A new HTTP header that might be useful (clubtroppo.com.au)
- Cloud platforms of the future: Hadoop and Eucalyptus (news.cnet.com)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_c.png?x-id=b47b266f-e82c-456e-80c3-4a14b7d0272d)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_c.png?x-id=93eacfca-1806-4aaf-b765-378d607b650a)