Re: Database design issue: seems really bad, yet I'm having trouble coming up with anything better -- MIDRANGE-L

On Wed, Dec 10, 2014 at 9:09 PM, James H. H. Lampert
<jamesl@xxxxxxxxxxxxxxxxx> wrote:

On 12/10/14 5:27 PM, Evan Harris wrote:
Something about [the mapping tables] strikes me as being not much of
an improvement over the initial hierarchical solution that seemed so wrong.

It's not much of an improvement (if any), in terms of disk storage.
Actually, the thing you are calling the "hierarchical" solution and
the thing that Matt was describing, which I guess we've kind of agreed
is called the "mapping table" solution, really amount to the same
thing, conceptually. Matt was just a bit more explicit about having
proper IDs for everything. Your original description of the
hierarchical tables was a little more hand-wavy.

Honestly, I don't think you should be worrying about disk space. You
should be worrying about what is easy to implement, while being
correct, robust, and reasonably efficient in terms of execution time.

Let the database do what it's built to do. Finding records where one
field exactly matches your desired value is pretty much the simplest
SQL there is (or the simplest RLA loop there is, if you absolutely
must), and it will be blazing fast if you have the proper index.

Cramming multiple values into one field and then scanning that field
is NOT what the database was built to do. It can do it, but it's not
as happy (or as fast) as when it's just matching whole fields. You
also increase complexity and decrease maintainability. Neither of
those is usually worth the savings in disk space.

Putting space for up to 10 "protein" codes, space-delimited, in each
"starch" and "vegetable" record, would add 39 to 41 characters (depending on
whether we need leading and/or trailing spaces for the entire field) to each
"starch" and "vegetable" record. With 20 "protein" codes per record, it
would be 79-81 characters. Then, the search would work out to nothing more
than, say, looking for all "starch" records for which the "protein list"
field contains the code for the desired "protein."

I would guess that the above proposal is one of the slowest things you
could do. If you are going to try ANY fancy list-in-a-field type of
scheme, you might as well just throw all the valid starches and all
the valid veggies into big fields in the protein record. This is
similar to your two-bitfields-in-the-protein-record idea, except with
a variable number of actual IDs in long fields, rather than a fixed
number of flags.

This would be especially fast if ALL you needed were the starch IDs
and the veggie IDs. But if you are going to have to look up
additional information for each starch and each veggie, then pretty
much all the speed advantage evaporates, and you might as well have
just done the query to grab the starch and veggie records in the first
place.

I think you will find that the hierarchical or mapping solutions are
actually NOT very redundant. They are conceptually quite clean. They
just have a big constant factor in terms of disk space used. If you
can get over that, I think you'll be happy.

(Hey, a lot of people use date tables instead of doing date
calculations. I am a bit of a mathematician at heart, and I'm old
enough to remember the days when disk was precious; so I get a little
nauseous when I think of all the disk space being used when you could
instead just have a few clever algorithms and zero disk usage. But
really, when your big tool is the database, just let it do the heavy
lifting. It's already designed and optimized for serving disk and
running queries. Doing feats of raw calculation is probably the thing
it is least suited for. Indeed, my desktop PC can run circles around
our IBM i when it comes to straight number-crunching. That doesn't
mean I want my PC trying to handle the workload our i handles.)

John Y.