Archive for June, 2009

Welcome to the cloud

Posted on June 3, 2009

The great thing about the cloud is the extremely low barrier to entry. It’s very cheap to get up and running, it’s very cheap to scale and it’s very cheap to store data. I’m still not fully convinced that from a long term perspective with thousands of nodes it’s going to be cheaper than provisioning your own servers and hosting but I’m more than happy to be convinced – saving money one way or another can’t be a bad thing.

Long term though one thing businesses really need to be wary of is being tied to tightly to a platform. There is of course the issue of being tied to an API and not being able to choose a new provider, but this isn’t really what concerns me – we use debian AMIs on EC2 with a full opensource stack top to bottom. I’m taking about the the issue of economic lock in. The massive scale and masses of data the cloud allows users to store could quickly lead a company to a very expensive decision when it chooses to or is forced into moving providers.

Moving data in or out costs about $0.10/GB (a nice easy number to work with), let’s pretend it’s about the same for all other provider, so to shift your data from one to another is going to cost you $0.20/GB. That could quickly add up to a massive cost just to choose a new provider. 1TB will cost you over $200, 1 petabyte, which isn’t going to be an unheard of amount of data in the next few years, is going to cost over a whopping $200,000! Just to move to a new provider. That’s some kind of lock-in, probably a lot more than the cost of any new or changed APIs. Not to mention how long it’s going to take to transfer that amount of data and the fact that you’ve already paid $100,000 to get it there!

Before anyone starts, yes I’m aware you can send Amazon a big storage device and they’ll put all your data on that and send it back to you. Then you could probably send that data to your new provider and they’d put it in the cloud for you- I won’t get into how good a solution to the problem this is, because I haven’t really thought it through nor do I have any idea what that sort of storage would cost and what sort of redundancy you’d want it to have to safely truck the thing around.

What I’m trying to get across is that an open stack you control is probably something you really want to own. And it’s probably something you want to deploy across more than one provider for redundancy and your own piece of mind. Just imagine if your cloud provider launches a competitive service, shuts down or for whatever reason decides not to service your account anymore.

There are mitigation strategies you can apply to this. With sproozi for example we’re holding a lot of data on the nodes so that we can work with it. A lot of this can be rebuilt and reacquired, so we don’t need to truck all our data about. Saving the list of places submitted by users and that we’ve discovered is more than enough to re-crawl everything and rebuild all the indexes. This is just one example though and we’re not saving things like images and other data critical for users, so we’re likely the exception here not the rule.

We run only on EC2 at the moment, but when we actually start getting more data we’re going to spread it out across a few providers – just in case.

User privacy

Posted on June 2, 2009

Sproozi generates a lot of data and when we launch we’re going to be generating a lot more. Some of it overt and displayed on the website for all to see. Some of it in a database somewhere. Things like searches, locations, session information and clicks on outbound links. Some of that information could lead back to the people that are searching for it. It’s happened before to other sites, even after they’d thought they took steps to protect users.

We’re also not the only ones, it’s pretty much par for the course in search and on the web. If you didn’t know it before now, rest assured that every click you make on any major website of any significance is tracked – including if they can, that you clicked it. Also know that from that data they can learn an awful lot about you, if not even who you are.

Which brings up an obvious question which many people rightly ask – If a company is concerned at all about privacy why in the would would they keep any of this data? Well, there are a few very good reasons first and foremost for us is realted to a previous post of mine about Testing for Search Result Quality. It boils down to a very real problem testing user interaction when the results the system produces for any given input change between submissions of the same input – by design. To measure how we’re performing we need to measure the actual user behaviour as opposed to measuring what we spit back. Without collecting the data to measure how changes are improving (or worsening) user experience we’re more or less releasing code and hoping it’s better. Not exactly a professional, or analytical approach to take.

On the other hand my privacy is important to me and it makes me uncomfortable to think about all the data I generate lying about the web. I’m not sure what makes me uncomfortable, but I also know it’s not just me. I want to be in control of what other know about me. I feel like I need to treat everyone else’s data with the same respect I want for mine.

Let’s face it, I don’t want to keep personally identifiable data about anyone for any longer than I need to. After a while they become just a statistic anyway. So we’re looking into what to do with the data and at which points we can start to anonymise and filter the data to remove personally identifiable information, but we’re starting the process from the beginning as opposed to the end – if we’re keeping something, I want a reason to keep it.

Some argue that data is the most valuable thing a business has in the modern world, but we tend to take a smaller, more pragmatic, more local view – we can build up a data set for some metric we want to start measuring, but we can never rebuild trust if we lose it. Plus what value will a click stream have in 2, 3, 6 or even 12 months time? So why bother collecting what we’re not using?

Some might ask, why bring this all up, before you’ve launched? Well partly because I’ve been thinking about it recently and partly because I think it’s important to be upfront about the data we collect, what we use it for and what we’ll do with it. I also think it’s important to show that privacy isn’t an afterthought, it’s something we considered before we ever collected any data, and something we continue to not only consider as we develop, but that we take privacy of our users very seriously.

A post a day for the month of June – Day 2

Reblog this post [with Zemanta]

A post a day for the month of June

Posted on June 1, 2009

Once upon a time I tried to set myself the goal of a post a day forever – it was hard and I failed almost immediately. Forever, it turns out, is a long time. Days, weeks or months later I look back and think, this was supposed to be the future; Where are the flying cars, the robots, my posts?

So I’m going to try an experiment. I’m going to write something here every day for the whole of June. I’m going to try to avoid writing the posts all in a day and scheduling them, because that’s not writing something here everyday.

I’ve just been to BarCamp Leeds 2009, sproozi is coming along well, we’re re-launching GoRoam to focus more on consulting work we do and want to get more involved with, I’m getting involved setting up some coworking, open coffees and other things to try to get more involved and collaborate more with the local entrepreneurs and the digital community in Hebden Bridge; so really there isn’t any excuse, I have enough to say to easily fill 30 days.

So, a post a day for the month of June – Day 1.

Reblog this post [with Zemanta]