Saturday, 5 July 2008

Cloud or Mesh. Relational or Heirachical. Highly Distributed Logical Data Centres

As you know I'm really interested in how web applications are going to be architected as the internet age moves on. One of the dichotomies I'm trying to resolve in my mind is how data is stored with highly distributed applications.
What do I mean by distributed? For the purposes of this post let's just assume this means an application that is accessible from different devices, and is not bound to a single machine. Classically, this is a web site, or a client application that uses some kind of API to store data on the web.

Seems that the approach up until recently was to store your data on servers in a co-lo or dedicated data centre. Meaning that as an application developer and/or operations dude I have to scale my application based on physical architecture I know about. Generally as my app scales that means I need, eventually, to horizontally scale my database across more than one logical database. This is not straightforward, and even with the introduction of Hibernate Shards I really need to think about that. And this probably means I'm going to denormalize my database and have to work out how to synchronize some of the data I'm storing across these different logical DBs.

It strikes me though that with "cloud storage", things like Amazon SimpleDB, or Google's App Engine, that I may want to start with a herichacal database that is denormalized by default. No more Joins. I guess we've had this option for a long time with things like Oracle Objects, but seriously, have you as a developer ever tried to use that beast? Not fun. Google and Amazon (and soon Microsoft, with Sql Server Data Services) will have solved that "synchronize data in a denormalized logically partitioned database in many data centres" problem for me. So should I start by using that approach? Should I offload my database to these guys and just pay transactionally for what I do? This means a significant mindset change for me, I'm so used to drawing out relational diagrams, and I'm so used to using ORM or other mapping tools to abstract me from that. I need to change my mindset to think differently. But I guess the benefit of this approach is that from the beginning, provided these big guys aren't lying to me, I have an app that will scale, that will respond consistently, is backed up and disaster-resistant and that I only need to pay for on demand. This is Goodness.

Can't help thinking that this approach still requires a bunch of datacentres, the associated power and this, as an app developer, will have an eventual cost for me.

This brings me to Mesh, or Grid computing. If you're reading this, your PC is on right now, and, as I am using Blogger to host my blog, you're pulling data back from Google. Now, I don't have the worlds most read blog, I don't get thousands of hits a second, but still, for everyone who has read this blog there's a good chance that all this text is cached on their machines. And it's originated from the machine that I type this on.

You're familiar with swarm based file sharing right? Where somebody seeds a file, and then others leech it, and when they have downloaded it, they become another seed on the network? Indeed they can start sharing partial data as soon as they've downloaded it? There's no central store of this data, just some metadata that tracks where the bits are. This is how BitTorrent works (and indeed, how the BBC iPlayer works in offline mode, which is why they ask you to dedicate 20GB of hard drive space)

Why don't we have this approach for other forms of application?

I envisage a future where logical heirachical databases are partitioned across end nodes, such as the PC you're reading this on, and where your PC can take part in large map/reduce calculations, and that (best of all) you can have your PC and broadband for free, because application developers are renting space on it.

Google and Amazon are busy building out compute and storage in the cloud with all their data centres, for which they have to pay for power. Good for them. But, TELCOS already have the makings of a grid which could, with some clever software compete with all this, and at a much lower cost base.

In my house I have a BT Home Hub. This is a wireless router, and connects back to the internet through BT as my ISP. What's more, it's just an embedded Linux Device. Further, unlike my PC, I tend to leave it on the whole time. There's also enough space in it to throw in a hard drive, or some solid state storage. It could act as a node in this grid I'm envisaging. It could even negotiate with the PCs connected to it and utilise their storage and CPU.

BT could give this to me - for free - and then charge back to application developers the cost of storage and compute. Without the need to ever build data centres, and offloading the cost of all the power required to run server farms.

I understand that there are issues around latency, concurrency, routing, and a whole bunch of other problems to solve. But I reckon, that rather than attempting to replicate the approach that Amazon and Google and a host of others are doing, telcos should concentrate on taking their existing deployed Customer Premise Equipment assets and building out storage, compute and content distribution based on this.

What do you think? Am I in cloud cuckoo land again?

Tim Stevens

Tim Stevens
Be Silent