Re: Replacing records in a large file (1M+ records) -- MIDRANGE-L

On 09 Sep 2013 11:31, Jeff Young wrote:

I have a file that contains more that 1M records that I need to copy
to an identical file that will be empty to start.

So /filling/ vs /replacing/ is the topic; i.e. there are additional considerations for replacing rows, so the difference is germane.

Will it be faster using SQL INSERT or CPYF Fromrcd(1) ?

Most likely CPYF. Some nuances of the actual data, files, system, system activity, and requirements, each could make the choice of CPYF less obvious as the faster choice.

If the fast-copy is enabled for the particular invocation of the Copy File utility, then the CPYF surely will be fastest. The "fast copy" feature of CPYF is effectively what CRTDUPOBJ does [effectively just copies the entire dataspace], and requires both that the FROMMBR can be exclusively allocated and that "identical" is an accurate claim for the Record Format including the RcdFmt Level Identifier [the FMTOPT(*NONE) is required either way, IIRC]. The use of CRTFILE(*YES) helps ensure the /identical/ requirements and thus that part of eligibility for the fast-copy is met.

If the FMTOPT(*NOCHK) is acceptable [NULL values precludes that choice], again CPYF is likely to be faster, because there is no data mapping. If *MAP and\or *DROP are required Format Options, then the choice between SQL and CPYF becomes less obvious... except some specific optimization based on the TOFILE() attributes that are built into the CPYF but must be effected by the user when using SQL INSERT; i.e. CPYF may implicitly disable and re-enable keyed access paths according to some metrics, choosing what to do based on a presumption of what actions make the overall request complete faster.

Both requests can benefit from maximizing SEQONLY() assuming the system is not already I\O constrained when copying the data. And...

If the 1M rows can fit in memory [esp. if protected from being paged-out], then the request to Set Object Access (SETOBJACC) of the data would benefit either request; i.e. prevent page faulting for database read, albeit sequential read is optimized to limit faults anyhow, and maximizing number of rows in the sequential buffer further optimizes sequential read.

For other than fast-copy, the best results for a large number of rows often will have the logical access paths over the target member disabled for by-row maintenance [when allowed; i.e. non-unique] and then re-enabled to rebuild or catch-up in parallel after the copy completes.

Irrespective of SQL or CPYF, when an existing target member is pre-allocated for the known number of rows to be copied [optionally with some additional allocation for presumed additional growth]. The SIZE() and ALLOCATE(*YES) of the Change Physical File (CHGPF) command effect the pre-allocation.

The use of the combination of FROMRCD() and TORCD() with CPYF can enable a somewhat simplistic parallel implementation; across multiple jobs [or... presumably across multiple threads]. Doing the same with the RRN() of the SQL is probably not a good option, and selection via portions of a key is probably expensive per random I\O, but selection implemented via an EVI that identifies big sequentially aligned chunks of data potentially could avoid the random effect of reading via the keyed access path [probably best not relied upon].