Re: Physical Tape Alternatives? -- MIDRANGE-L

> Does
> that mean that deduplication applies to objects that are copied to the
> cloud only? Or does it apply to objects stored in the VTL?

"It Depends." :-)

Different Vendors do it different ways. ETI SPHiNX for example can do it either way. Either Dedup local and then send the dedupped file or have the local copy not be deduped and deduplication happens on the way to the cloud.

Cybernetics does it locally and for the cloud. They do have two ways to do the deduplication, on the fly and post backup. Only from memory here but the post backup performs better. The 49 to 1 I mentioned was a post backup dedup situation.

VTL = Virtual "TAPE" Library, so yeah they are tape devices. They behave to IBM i exactly as an IBM supported tape library.

I have had arguments that 'DeDuplication IS Compression.' OK I guess but they are such different ways to accomplish it that I treat them differently.

Effectively deduplication is looking for things that are the same and eliminating the multiple copies. There are many ways to do this, for example the ZFS file system will hash every data block and that hash becomes the key for that block. When a new block is written ZFS looks in the hash table (Kept in memory so have lots of that!) to see if it already exists. If so, no need to write that new block, it thus is 'de-dupped'. No nobody I know USES ZFS Dedup but it exists.

Better deduplication occurs when there is knowledge of the data. Cyber has that. They base the deduplication on a master tape. So you do a SAVE 21, tag is as master. Subsequent backups compare to that master tape and need not write the duplicate parts to disk again. It's likely they identify the beginning of libraries and objects as they are being written.

A simple example was the TAATOOL Source Archive. It had an option for compression. Turns out that all that Jim did was replace long strings of blanks and * characters with a two character value one was the 'flag' that this was a compressed string od one of those two characters and the other was a binary value of how many. It compressed source code by often 85 to 90%! It worked because he had knowledge of what most source looked like.

If you think about deduplication take this example. It's real by the way only the names are changed.

Library on the system full of data files. We'll call it BOBLIB.
First file object in the library is AAAFile. It is just your basic file that grows and shrinks daily with records added deleted and changed. I trends larger over time by several MB a day.
Second file object is BBBFile. Big sucker. ONLY Writes occur. NEVER are records changed. File is over 40GB.
Third file object is CCCFile. Just like BBBFile but shorter records and about 25GB.
And lots more.

So you're backing up BOBLIB every day. Nearly every day you need a little more space for AAAFile because it is growing. If you purely write the data coming in the Fiber port to the disk on your VTL that will 'shift' the data that comes from BBBFile behind it. This means that while you expect that 99.5% of BBBFile is the same as yesteray and should be dedupped to only the new records, that ain't happening. This would happen for example if you used ZFS Deduplication. Only when the beginning of BBBFILE lined up on a block boundary would it dedup well. The same problem occurs with CCCFILE because both AAAFile and BBBFIle grow daily.

With KNOWLEDGE though you would see the beginning of BBBFILE and line it up as you write it to disk every day and thus you achieve the 99.5% Dedup you expected.

Each vendor does it differently. From my experience so far, Cybernetics does it best, Dell/EMC Data Domain is second, and ETI SPHiNX is third best. Note that I have NOT worked with every vendor!!

- Larry "DrFranken" Bolhuis

www.Frankeni.com
www.iDevCloud.com - Personal Development IBM i timeshare service.
www.iInTheCloud.com - Commercial IBM i Cloud Hosting.

On 1/18/2020 10:05 AM, Nathan Andelin wrote:

On Fri, Jan 17, 2020 at 5:44 PM DrFranken <midrange@xxxxxxxxxxxx> wrote:

But the VTL vendors like to throw around 100 to 1 so 49 to 1 is clearly
not an unexpected number.

Rob Berendt indicated in his last message that BRMS enables you to select
the date of the file or object that you want to restore. One can imagine a
lot of duplication day by day for objects that don't change regularly. Does
that mean that deduplication applies to objects that are copied to the
cloud only? Or does it apply to objects stored in the VTL? I seem to recall
Cypernetic VTL literature indicating that deduplication is a post-save
operation as opposed to something that occurs during a save. Other
literature indicates that deduplication differs from compression in that
deduplication eliminates duplicate files, while compression eliminates bits
from bytes.

IBM i SAV and SAVLIB commands support the CLEAR(*REPLACE), which if I
understand correctly behaves differently based on whether the output device
is type *OPT or type *TAP. In the case of *OPT, the existing "label" is
replaced, whereas with *TAP, the entire device is cleared. This is a way to
eliminate duplication with *OPT type devices. That was one reason I asked
whether a VTL is categorized as *TAP or *OPT?