NoSQL vs SQL
Mark Woodward
markw-FJ05HQ0HCKaWd6l5hS35sQ at public.gmane.org
Tue Dec 7 09:08:47 EST 2010
On 12/07/2010 07:29 AM, Dan Ritter wrote:
> On Tue, Dec 07, 2010 at 12:07:13AM -0500, Mark Woodward wrote:
>
>> The most important aspect of these scalability discussions, and one
>> which I frequently find lacking in the "NoSQL" camp is a critique of
>> just how many full scale transactions a SQL database can have.
>>
> ...
>
>
>> average head positioning time. This gives us a worst case average of
>> 77 arbitrary write procedures per second.
>>
>> The actual iops is higher based on seek time these days, but lets use
>> the worst case scenario. (We could use a RAID system and multiply
>> performance)
>>
>> In an ACID database configuration, assume 1/2 maximum, i.e 36 writes per
>> second on the database. That's 36 transactions (read/write) per second.
>> Assuming 1 transaction per page view, that amounts to about 90 million
>> page views a month (on average) as a sustainable number.
>>
>> So, a good SQL database with no scaling tricks on a bog stock modern PC
>> based server will serve a web site as busy as all but the very most
>> popular sites on the web. Someone, please tell me, what are the NoSQL
>> guys going on about with regard to scalability?
>>
> There are basically no sites that use a database and only want one record
> view/update per page view.
>
For the sake of argument, "one transaction" is short hand for one
write/update operation. I'm assuming a databases page cache and/or
memcache (or something like it) will cope with repeat session based
data. But, that's a design detail, but your point is taken.
> Let's say we have, oh, a dating site. The login page may only need one DB
> query. How about the second page, where they will helpfully show you the
> names, photos and blurbs of ten or so eligible people, plus a check on
> how many messages you've received, plus another on how many you've sent,
> and... you can get into 50 queries pretty quickly. Add some billing and
> logging records and you could have a couple of updates in there, too.
>
Again, not to be pedantic, common queries where nothing is modified are
a lot less expensive. Obviously Facebook is a completely different
animal, i mean, jeez, they practically have the population of the earth
on-line. That's not what I'm talking about, a "dating site" is an
interesting case.
Lets break it down, how many members? How many members on-line at any
time? What are the usage patters? Let's assume you use PHP with some
sort of distributed session cache mechanism. I'm using a system I
designed for DMN a few years back as the basis of my response.
A login, yes one transaction two disk I/O operations, retrieval of data
and storage of change of state. The login also brings the user's
information into the page cache of the database, so most of the
subsequent select operations that deal with deal directly with user info
will be hitting the page cache and avoiding the disk. The first page
view will always be the hardest, that brings the most data into the
various caching subsystems (php session cache, any memcache object
definitions, and, of course, database disk page buffers.) Subsequent
page views as the user uses the system, tend to hit the page cache and
only alterations of data and retrieval of data not currently in the page
cache force I/O operations.
The efficacy of the page cache is directly proportional to the number of
members in total, and the number of members on-line and active at any
one time. You aren't going to have a thousand users synchronizing their
watches and pressing a button at the exact same time. So, unless your
load forces the database to read disk pages for every single select
operation because the variability of the requests exceeds your page
cache, user actions (multiple disk I/Os per page) can be averaged out
over a number of page views.
> Now, a clever system will aggregate the most often requested data into
> a single row, when it can... that helps with the second page, but not
> with profile pages or a mail-analogue system. And a clever system will
> cache as much frequently-requested data as possible.
>
Many databases do this transparently, but yes, systems like memcached
can be very helpfull.
> But then there's Facebook, and anything even vaguely like it. The obvious
> first thing to do is to optimize time at the cost of storage: store a
> copy of the new message with each of the recipients, so that when they
> read their update page, that takes one transaction instead of as many
> as they have friends. And pretty quickly you run out of disk space. So
> you decide that you have a read-mostly environment, and have many, many
> read-only slave databases to handle lookups, and most of those lookups
> aren't needing anything near the power of SQL -- really, you could get
> away with a key-value lookup, if only it were FAST -- and before you
> know it, you've re-invented MongoDB.
>
Facebook, like the video I posted, Facebook *has* to implement these
things and for good reason, the scale at which they operate is HUGE.
Unbelievably huge. But, it took them several years to get where they are
and, like I was saying, all but the most busy web sites would well be
handled by a standard database.
Yes, if you are facebook, you have a different set of problems. Hell, if
you are L.L. Bean you have a different set of problems. However, 99% of
the web sites out there will never have those problems. As the saying
goes, "Let's hope we have to address these problems later."
> By way of contrast, I know there are sites which do real work on
> data people care about in sophisticated ways -- and they
> generally use full SQL databases, because that's what they need.
>
> SQL and ACID solve some tough problems. Key/value systems solve
> some speed issues, as do caches. There are lots of solutions
> because there are lots of problems.
>
My issue with the NoSQL mentality is the idea of "no." You can't have a
conversation when it starts with "no." Then there is always the refrain:
"Use the right tool for the job" which is a euphemism for limiting your
options and usually the defense of the indefensible. Well, more often
than not you can't ever really know what the right tool is or the level
of sophistication required at the beginning of a project, especially
with AGILE type development. So, you start down the road myopically
focusing on one aspect of the project "scaling" whilst completely
ignoring vast parts of the project, like, what are you going to do with
the data once you get it? What MIGHT you WANT to do with the data? Sure,
you're covered for the next 5 years of (wishful thinking) scaling at the
expense of accessing your data.
Most all of the NoSQL solutions are dreadfully lacking in any sort of
capability to allow you to capitalize on your data beyond mere
regurgitation to a web page. Your data will never become "information"
and information is where the money comes from.
My general rule of thumb is, "it is better to have it and not need it,
than to need it and not have it." So, I would say, the question
shouldn't be "Do I need a SQL database" it should be "Are there areas
where a SQL database doesn't work?"
Mark.
More information about the Discuss
mailing list