Archive for the 'random' category
Hbase for storing Users?
Posted on June 26, 2009
We had a meeting last night about the state of the Sproozi project, where we wanted to be and the results of a few tests we’d run. We more or less came to the conclusion that we’re going to need to push on with the social aspect of the site sooner rather than later. We’re still trying to figure out what that means in terms of funding, if we need to raise any and how we go about it if we decide we do. It does start to pose some interesting questions the first and biggest is how we’re going to store user data.
We’re already running Hbase and storing lots of data in there and I’d like the application to scale as easily as possible. The idea of running another framework or service just to store user data seems overkill and seems like one more system to worry about. So I’m going to run a little experiment in storing users in Hbase.
The downside to Hbase is that it’s not easy to search for things when you only want one of them by something other than the id. It’s easy to pick a row by it’s id and even to scan the table in order from there or to start and the beginning and go all the way through, but it’s not very easy to quickly pick out a random row by the value of one of it’s other fields. You’d have to start a map reduce task and start crunching the data until you found what you were looking for.
Given the simple example of a user with a long id an email address, a username and a password it would be easy to get the user by it’s id, but not very easy to get the user by the email address or username. So I’m toying with how to get it to work, probably by creating some additional tables to store keys for columns I want to search that link back to the correct user. Sort of like making my own indices.
Once I get some code written and tested I’ll probably throw up another post with some more details on whether or not it worked.
Any thoughts?
Update: Check out this Vitamin article
Scaling up vs scaling out
Posted on June 24, 2009
Jeff Atwood goes into some calculations about the cost of scaling up vs scaling out and makes an interesting point, it quickly becomes impractical if you’re not using open source software. I think Jeff slightly missed the point though, it’s not about open or closed source, it’s that scaling out is simply impractical if you’re paying traditional software licences.
This is something we came across when building Sproozi. If we wanted to store petabytes of data and run hundreds or thousands of concurrent processors there was no way we could ever afford to do it on machines running windows we were paying for by the box. But it’s not because we’d have to pay for software, per se, it’s how we’d have to pay for it.
Software has traditionally been licensed by machine, when machines got bigger vendors wanted to cash in so the licences got a little bigger. They had to cover their losses when you threw a few new processors in the machine rather than getting a new one to put alongside after all. It has always been in their best interest though for you to get a bigger box than to get more cheap ones – scaling out is very hard and the software doesn’t do it well. Most RDBMS just can’t do it well and they certainly can’t get anywhere near the the scale of something like Hadoop. If you want to scale out, forget SQL servers, you need software that’s going to scale out.
But let’s forget the specific software for the time being and just assume that the big boys (MS, Oracle, IBM) will have a scaling out solution soon – don’t worry this isn’t going to kill them, but it will change them. They will still want to licence an operating system and a data storage and retrieval system to you.
What I’m almost positive you’re going to see is these companies introduce new pricing schemes to meet the needs of the cloud, they have to or they’re going to lose all that revenue to the open source projects that have a head start on them. Just look at EC2, you can already provision MS and other software and I think that’s a trend that’s just going to continue.
So while Jeff is right that if I want to buy as many cheap boxes as I could for the hardware cost of a big iron server and put windows and SQL on them and it would all cost a small fortune. It’s not really a fair argument, you’re taking an old big iron way of thinking and trying to apply it to the cloud. What it fails to take into account is how much more powerful your new cloud cluster is than the big iron box, let the software vendors figure out the economics of making their software an attractive ROI when compared to OSS because if they want to compete in the cloud they’re going to have to.
Related articles by Zemanta
- Hadoop Summit: We Have 10 Tickets to Give Away (gigaom.com)
- Watch out, Oracle: Google tests cloud-based database (computerworld.com)
- Yahoo Releases Internal Hadoop Source Code (techcrunchit.com)
I dropped the ball
Posted on June 23, 2009
I really dropped the ball on the whole post a day for this month thing. It’s been over a week and I haven’t written anything. It’s like all the other things you’re meant to do, then time goes by and you don’t and it gets to a point where addressing the issue becomes awkward.
Things between us dear reader, have become awkward, it’s my fault and I’m sorry. I’ve just been quite busy, I know it’s not an excuse, but there it is anyway.
So let’s try to get back on track in the form of an update.
Sproozi is coming along, I’ve had a few setbacks with the code and we’ve changed the scope some. We had to make a decision about content for a real launch we’re selling to investors and actual users as opposed to the few that happen upon our non functional demo.
The problem was that although the crawler worked pretty well, better than I’d hoped, the results were a bit lacklustre. We weren’t confident that the data we brought back was going to set the world on fire. So it then became a question of what is worse no data or crap data?
I’ve also been working on getting a coworking space and some geeky events going going in Hebden Bridge. Lots of work getting emails sent to see if there would be funding available, trying to gauge interest and all sorts of other aspects about it. I’ll have a lot more to say about this in the next few days once I’ve had a chance to digest where I am.
In the meantime, go take the survey- it’s here.
Lastly I’ve actually been working on this site in the background. It’s on github already over here. I like the simplicity of the look but I want to bring a few more things on to the front page. I really want to integrate my tumblr feed and start using that a lot more for posting links, pictures and music. Also now that I’m back into posting a little more regularly (the irony I’m sure isn’t lost on you) I also want to put the blog posts on the front page. Which is where I start to get into things I don’t like about the hemingway theme.
Toying with ideas: geotagged podcast
Posted on June 8, 2009

- Image via Wikipedia
Came up with a good distraction tonight and thought briefly about being an iPhone developer. Was chatting with @Simon_Chapman (not sure I’d bother clicking there – nothing but tumble weed) about the various services out there and was trying to come up with a unique way to use some of the new features offered by the iPhone 3.0 software.
Specifically we were talking about how to use the new support for in app purchases and location to build a compelling service. The first thing that came to mind was pretty obvious and no doubt you’d just end up a small fish in a big pond with some monsters, create a service to search for and buy tickets for events near you.
The next idea that came to mind is to create a map based podcasting application. Allow any geotagged podcasts to be places on a map. Browse the map and get some audio or video about things around you. Revenue could be either generated through advertising or through access to premium content.
There you go, that one’s free, unless I find the spare time to develop it myself. ![]()
Related articles by Zemanta
- iPhone 3GS (manolith.com)
User privacy
Posted on June 2, 2009
Sproozi generates a lot of data and when we launch we’re going to be generating a lot more. Some of it overt and displayed on the website for all to see. Some of it in a database somewhere. Things like searches, locations, session information and clicks on outbound links. Some of that information could lead back to the people that are searching for it. It’s happened before to other sites, even after they’d thought they took steps to protect users.
We’re also not the only ones, it’s pretty much par for the course in search and on the web. If you didn’t know it before now, rest assured that every click you make on any major website of any significance is tracked – including if they can, that you clicked it. Also know that from that data they can learn an awful lot about you, if not even who you are.
Which brings up an obvious question which many people rightly ask – If a company is concerned at all about privacy why in the would would they keep any of this data? Well, there are a few very good reasons first and foremost for us is realted to a previous post of mine about Testing for Search Result Quality. It boils down to a very real problem testing user interaction when the results the system produces for any given input change between submissions of the same input – by design. To measure how we’re performing we need to measure the actual user behaviour as opposed to measuring what we spit back. Without collecting the data to measure how changes are improving (or worsening) user experience we’re more or less releasing code and hoping it’s better. Not exactly a professional, or analytical approach to take.
On the other hand my privacy is important to me and it makes me uncomfortable to think about all the data I generate lying about the web. I’m not sure what makes me uncomfortable, but I also know it’s not just me. I want to be in control of what other know about me. I feel like I need to treat everyone else’s data with the same respect I want for mine.
Let’s face it, I don’t want to keep personally identifiable data about anyone for any longer than I need to. After a while they become just a statistic anyway. So we’re looking into what to do with the data and at which points we can start to anonymise and filter the data to remove personally identifiable information, but we’re starting the process from the beginning as opposed to the end – if we’re keeping something, I want a reason to keep it.
Some argue that data is the most valuable thing a business has in the modern world, but we tend to take a smaller, more pragmatic, more local view – we can build up a data set for some metric we want to start measuring, but we can never rebuild trust if we lose it. Plus what value will a click stream have in 2, 3, 6 or even 12 months time? So why bother collecting what we’re not using?
Some might ask, why bring this all up, before you’ve launched? Well partly because I’ve been thinking about it recently and partly because I think it’s important to be upfront about the data we collect, what we use it for and what we’ll do with it. I also think it’s important to show that privacy isn’t an afterthought, it’s something we considered before we ever collected any data, and something we continue to not only consider as we develop, but that we take privacy of our users very seriously.
A post a day for the month of June – Day 2
Related articles by Zemanta
- Where Euro Parliament candidates stand on digital rights (boingboing.net)
- Registrant Privacy Should Not Be Ignored (blacknight.com)
- Facebook: Privacy Now Optional (techcrunch.com)
A post a day for the month of June
Posted on June 1, 2009
Once upon a time I tried to set myself the goal of a post a day forever – it was hard and I failed almost immediately. Forever, it turns out, is a long time. Days, weeks or months later I look back and think, this was supposed to be the future; Where are the flying cars, the robots, my posts?
So I’m going to try an experiment. I’m going to write something here every day for the whole of June. I’m going to try to avoid writing the posts all in a day and scheduling them, because that’s not writing something here everyday.
I’ve just been to BarCamp Leeds 2009, sproozi is coming along well, we’re re-launching GoRoam to focus more on consulting work we do and want to get more involved with, I’m getting involved setting up some coworking, open coffees and other things to try to get more involved and collaborate more with the local entrepreneurs and the digital community in Hebden Bridge; so really there isn’t any excuse, I have enough to say to easily fill 30 days.
So, a post a day for the month of June – Day 1.
Does older browser support matter anymore?
Posted on May 29, 2009
I’ve run into a few problems with some of the web apps I’ve been building lately as I’m sure others have. The explosion of the cloud is only going to lead more and more people down the exact same path. Let me just briefly explain the architecture which will help you understand why it’s a problem.
Basically I have a site which is all too commonly static html, css, images and javascript. The magic all happens through the XMLHttpRequest.
Ideally what I want to be able to do is to push all this static stuff somewhere fast and as close to the user as possible. CloudFront is an obvious choice, but then I get hit by the same origin policy. To work around the issue and because I want to publish the API for developers to use I’ve implemented JSONP callbacks.
Though I did come across the W3C access control work and noted that we may be close to a solution, at least for the newest browsers. That brought up a whole new question that’s been plaguing web designers and developers for years – how old does a browser need to be before you take it out back and shoot it?
Look through browser market share and my tiny sample set I’m also seeing that browsers are more or less up to date. Which is great news. There are a few IE6 hold outs showing up my visitor stats but for the most part everyone appears to be more or less up to date. Has software update actually solved the issue, probably not by I for one, welcome any progress.
So that brings us to the real question, with the vast majority of the browsers out there running at or beyond the most current release do we need to care about those that don’t? Do we even want to deal with the train wreck that is IE6 compatibility anymore?
Reverse HTTP and the cloud
Posted on March 13, 2009

- Image via Wikipedia
I recently read the IETF draft RFC for Reverse HTTP, and it looks like a pretty simple and elegant solution to a number of problems I’ve seen, especially with the move to cloud computing.
The cloud brings with it some great possibilites but with them some great challenges. Computing on demand is great, if I need more power for a computationally intensive task I can just spin up a few instances for as long as I need them and shut them down when I’m done. Great in an ideal world, but RPC, cluster management and many tasks you’d have to take to run nodes in the cloud can be troublesome.
Apache Hadoop for example, is a great, free, opensource Map/Reduce framework but it makes assumptions based on a traditional datacenter full of real hardware that is always there view of the world. One of the biggest and most troblesome for the cloud is the fact that a master needs to be aware of the slaves before they try to connect. Implementing access controls in a secure manner for nodes connection is no small task because the whole system, from end to end is based on a custom client/server model written specifically for the task.
I’m not singling Hadoop out here, just using it as an example because it’s well known and I’m familiar with it.
Let’s take a very simple API, imagine there is no cluster, just one node. A client submits a job to the server, the server processes it and returns the result. Now let’s make it a little more complicated, let’s make it a Map/Reduce job and add a few nodes to the cluster. As far as the client is concerned the same thing is happening. They’re just submitting the jo to the server and it’s handling everything else, it breaks the job down into work units, submits them to the nodes in the cluster, all the results are merged together and passed back to the client.
In order to implement this you’re going to need at least a basic client/server API between the master and each slave. You could do it using traditional HTTP but you’d run into a scalability issue, imagine you have 10,000 nodes in your cluster. The server is going to need to have 10,000 open HTTP connections and each of them is going to have to poll the server at fixed intervals just to ask “Any work boss?”, “Nope, not at the moment. Take 5.” Sure you could increase the interval between asking, but 10,000 nodes doing nothing for 30 seconds is almost 3 1/2 days of computing power wasted.
To get around the problem you’ve got to design your architecture to push jobs to the nodes as soon as they come in. Which means writing your own client/server architecture and your own access control mechanisms amongst other things. If we flipped things around though, and the slave connected to the master over HTTP and then told the master it wanted to be the server we’ve achieved exactly what we wanted. The master knowing nothing about a slave, can now interact with the slave as if it were a client and it can submit a job as soon as it comes in.
An added benefit, the master/slave API can be the same as the user/master! After all, the master would be doing almost the exact same thing on a slave as the user is doing connecting via HTTP and submitting a job. No more custom client/server and vastly simplified code.
It would be easy to make it even more robust and allow for multiple tiers of masters and sub nodes. Just add a call to the API which asks the server how many slots it has free for jobs. Useful to a user from a management perspective, but also it would allow the master to partition the work into chunks based on the cluster size and based on the number of nodes served by any particular master. This would also be useful in terms of best use of resources given network topology issues – not all nodes are in the same rack or even datacentre.
Add to this the simplicity and power of simply adding something like HTTP AUTH-DIGEST at the server end and you’ve got ready made access controls. One certificate for clients, one for slaves. Clients can submit jobs, slaves get the work and there is no real need to know of anything about a slave before the first time it connects.
Why this is better than something like XMPP I can hear you asking. It’s not better. Not for any real reason, and yes it has some cross over in functionality with other technologies that are already out there. In the right situations though, it gives developers the option to simply things, and that’s never a bad thing.
Related articles by Zemanta
- A new HTTP header that might be useful (clubtroppo.com.au)
- Cloud platforms of the future: Hadoop and Eucalyptus (news.cnet.com)
Customer Support, a tale of three companies.
Posted on March 2, 2009
Over the last few months I’ve deal with a whole bunch of companies, as a customer. Three stand out. One awful, one iritating and one today, fantastic.
Virgin Media
Last year when we moved home we moved out of a cable area, generally being happy with Virgin Media though and not wanting to pay the early termination fee we opted to go with their ADSL service. Of course these things never go smoothly and we had to have the ADSL line already here terminated, a new account created when that happened and it was all going to take a few weeks.
So I cancelled my account, they told me I’d have to pay the termination fee, but that it would be refunded when our new account was opened. A few weeks later, more or less on time the ADSL line was connected and we were online. About a week later I get a nasty letter from the Virgin Medial collections department warning me to pay what I owe.
I rang them – not a free phone number might I add and after some time got to the bottom of it. They dropped the termination fee from their demand but I did have to pay a month that was still outstanding on the account. I paid that on the card, on the spot.
fail – Virgin really should have some sense and the ability to link my new account, old account and know that I’m just moving home and shouldn’t need to pay the early termination fee. The fact that I can cancel within the connection period just isn’t good enough, I shouldn’t be penalised for shortcomings in your systems. Also, what is up with needing me to pay the last month by card? What happened to the direct debit straight from my bank account which they used every other month? And all that’s before we get into the fact that I’m an existing customer and they’re sending me nasty letters. If the connection hadn’t been setup already we probably would have gone elsewhere, just because of that- having said that, they’re a decent ISP most of the time.
3
Last year some time I signed up with 3 for a USB data card, when I was in the store I offered a £15/month skype phone contract, which would half the cost of the data card. It seemed like a good deal and I though at the very least the skype phone might be useful (it wasn’t) so I agreed. Some time later, after the 6 month contract was over I called to cancel the phone. I was on the phone for at least 30 minutes and must have told the operator on the other end at least a dozen times that I just want to cancel the contract. I said I was happy with my existing options for a backup phone and that even at £5/month with a brand new phone, it just wasn’t worth it for me to continue; I said it again, and again. Eventually he put me on hold for 5 minutes and came back an announced my account was cancelled.
fail - The problem with someone hard selling to me when I’m trying t0 cancel the account is that I can’t just put the phone down, if I do I won’t get the account cancelled. No means no, after saying no 5 or 6 times I wasn’t likely to change my mind. If he’d looked into the account he would have seen that I hadn’t even switched the phone on in months – if he can’t see that, he should be able to.
Apple
Last week, completely different. My poor used and abused macbook needed some fixing. It’s over 2 1/2 years old and to be fair has travelled thousands of miles a bit of wear and tear is to be expected. The top case was a bit cracked, a somewhat common problem and the wire at the magsafe connector was a bir frayed . So I booked an appointment at the Genius Bar in the Trafford Centre this morning. Got in the car and drove an hour there. Since the power supply was out of warenty I could get it replaced under a repair and it would cost less than the retail part. Then he took a look, said that they’d replace the top case for free and book the power supply in under the same ticket so I’d get that free too. They said to drop back in an hour and it should be finished. So I left the machine with them and half an hour later they called to tell me it was fixed. 30 minutes, and my mac looks as good as new!
impressed – I honestly couldn’t be more impressed. I booked myself in online in a few minutes and since I had an appointment I was served as soon as I got in the store. It took the Genius behind the bar a few minutes to asses the problems and book the machine in. 30 minutes later they had it fixed.
Related articles by Zemanta
- O2 tops broadband satisfaction lists (vnunet.com)
- Bitter experience with yatra.com (computerknowledge1.blogspot.com)
- Top 10 Reasons Why Your Customers are Being Difficult (conversationagent.com)
- Broadband: 25pc of users face extra charges (telegraph.co.uk)
To follow or to nofollow…
Posted on February 19, 2009
Sorry if you came here wondering if you should nofollow comments or something else on your site or blog, but I’m looking at it more from a crawler’s perspective. Though if that is why you’re here you have something to add.
Basically the nofollow either in meta tags or as a rel attribute of an a tag is a hint to a crawler telling it, in it’s most basic terms, not to follow the link. But what does that really mean to me, if I run the crawler? More importantly, how is it actually being used?
On blogs, personal websites and even wikipedia the nofollow policy is pretty clear and transparently aimed at preventing spam – and it works.
But how are they applied? Well, not even the big search engines treat them consistently: Google completely blanks them, Yahoo indexes them but adds no juice, and Ask doesn’t even support them [Granted this article may be somewhat out of date].
One interesting thing that came out of the answers though, was the one from Google:
On a related note, though, and echoing Matt’s earlier sentiments… we hope and expect that more and more sites — including Wikipedia — will adopt a less-absolute approach to no-follow… expiring no-follows, not applying no-follows to trusted contributors, and so on.
So within even Google, with the strictest application of a nofollow policy, there is certainly a strong argument and use case for treating this as a hint as opposed to a policy. I’m not even sure why Google needs wikipedia to make a policy change, everything they hope for could be implemented at their end.
A link on wikipeida, if it’s been there long enough, probably deserves some juice. And that same logic applies to any site, even your blog – nofollow is a hint to a search engine. It’s there to deter spam, but if a link sticks around then by not removing it, to a point, the site is endorsing it.
For our application we’re working from a slightly different angle – there aren’t a lot of geotagged urls and there isn’t much span so at least initially we want as many as we can get. So we may not index or pass juice on to the site, at least initially, but we do want to follow the link.
Related articles by Zemanta
- Q&A;: What is the Nofollow Link Attribute? (list-your-blog.com)
- What Is NoFollow Used For? (takeoverpageone.com)
- Does Having Lots Of DoFollow Links Mean Your Page Rank Could Be Penalized? (onthenetdollars.com)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_c.png?x-id=b47b266f-e82c-456e-80c3-4a14b7d0272d)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_c.png?x-id=6cdb68ec-6315-496d-81bd-60528eed976e)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_c.png?x-id=96aef298-4930-4620-aa23-3bd8000705a3)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_c.png?x-id=932e1904-a7f9-4c3c-89f2-fd2414f17250)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_c.png?x-id=150b51bb-6192-415d-a714-558ef9d53c1c)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_c.png?x-id=795d7a3d-abfc-4aea-bf8e-e1e9f4d56ba1)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_c.png?x-id=93eacfca-1806-4aaf-b765-378d607b650a)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_c.png?x-id=50f8c68e-d073-49ea-b78c-8f98caac5d1c)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_c.png?x-id=69b7d34b-491d-4be4-a979-20b54b1bcefc)