> Does
> that mean that deduplication applies to objects that are copied to the
> cloud only? Or does it apply to objects stored in the VTL?
"It Depends." :-)
Different Vendors do it different ways. ETI SPHiNX for example can do
it either way. Either Dedup local and then send the dedupped file or
have the local copy not be deduped and deduplication happens on the way
to the cloud.
Cybernetics does it locally and for the cloud. They do have two ways to
do the deduplication, on the fly and post backup. Only from memory here
but the post backup performs better. The 49 to 1 I mentioned was a post
backup dedup situation.
VTL = Virtual "TAPE" Library, so yeah they are tape devices. They behave
to IBM i exactly as an IBM supported tape library.
I have had arguments that 'DeDuplication IS Compression.' OK I guess but
they are such different ways to accomplish it that I treat them
differently.
Effectively deduplication is looking for things that are the same and
eliminating the multiple copies. There are many ways to do this, for
example the ZFS file system will hash every data block and that hash
becomes the key for that block. When a new block is written ZFS looks in
the hash table (Kept in memory so have lots of that!) to see if it
already exists. If so, no need to write that new block, it thus is
'de-dupped'. No nobody I know USES ZFS Dedup but it exists.
Better deduplication occurs when there is knowledge of the data. Cyber
has that. They base the deduplication on a master tape. So you do a SAVE
21, tag is as master. Subsequent backups compare to that master tape and
need not write the duplicate parts to disk again. It's likely they
identify the beginning of libraries and objects as they are being written.
A simple example was the TAATOOL Source Archive. It had an option for
compression. Turns out that all that Jim did was replace long strings of
blanks and * characters with a two character value one was the 'flag'
that this was a compressed string od one of those two characters and the
other was a binary value of how many. It compressed source code by often
85 to 90%! It worked because he had knowledge of what most source looked
like.
If you think about deduplication take this example. It's real by the way
only the names are changed.
Library on the system full of data files. We'll call it BOBLIB.
First file object in the library is AAAFile. It is just your basic file
that grows and shrinks daily with records added deleted and changed. I
trends larger over time by several MB a day.
Second file object is BBBFile. Big sucker. ONLY Writes occur. NEVER are
records changed. File is over 40GB.
Third file object is CCCFile. Just like BBBFile but shorter records and
about 25GB.
And lots more.
So you're backing up BOBLIB every day. Nearly every day you need a
little more space for AAAFile because it is growing. If you purely write
the data coming in the Fiber port to the disk on your VTL that will
'shift' the data that comes from BBBFile behind it. This means that
while you expect that 99.5% of BBBFile is the same as yesteray and
should be dedupped to only the new records, that ain't happening. This
would happen for example if you used ZFS Deduplication. Only when the
beginning of BBBFILE lined up on a block boundary would it dedup well.
The same problem occurs with CCCFILE because both AAAFile and BBBFIle
grow daily.
With KNOWLEDGE though you would see the beginning of BBBFILE and line it
up as you write it to disk every day and thus you achieve the 99.5%
Dedup you expected.
Each vendor does it differently. From my experience so far, Cybernetics
does it best, Dell/EMC Data Domain is second, and ETI SPHiNX is third
best. Note that I have NOT worked with every vendor!!
- Larry "DrFranken" Bolhuis
www.Frankeni.com
www.iDevCloud.com - Personal Development IBM i timeshare service.
www.iInTheCloud.com - Commercial IBM i Cloud Hosting.
On 1/18/2020 10:05 AM, Nathan Andelin wrote:
On Fri, Jan 17, 2020 at 5:44 PM DrFranken <midrange@xxxxxxxxxxxx> wrote:
But the VTL vendors like to throw around 100 to 1 so 49 to 1 is clearly
not an unexpected number.
Rob Berendt indicated in his last message that BRMS enables you to select
the date of the file or object that you want to restore. One can imagine a
lot of duplication day by day for objects that don't change regularly. Does
that mean that deduplication applies to objects that are copied to the
cloud only? Or does it apply to objects stored in the VTL? I seem to recall
Cypernetic VTL literature indicating that deduplication is a post-save
operation as opposed to something that occurs during a save. Other
literature indicates that deduplication differs from compression in that
deduplication eliminates duplicate files, while compression eliminates bits
from bytes.
IBM i SAV and SAVLIB commands support the CLEAR(*REPLACE), which if I
understand correctly behaves differently based on whether the output device
is type *OPT or type *TAP. In the case of *OPT, the existing "label" is
replaced, whereas with *TAP, the entire device is cleared. This is a way to
eliminate duplication with *OPT type devices. That was one reason I asked
whether a VTL is categorized as *TAP or *OPT?
As an Amazon Associate we earn from qualifying purchases.