RE: Suggestions for finding unique values in million+ records -- MIDRANGE-L

Not sure I fully understand the requirement but would something like this
work

Create this view called mytemp
(Select distinct field1 as myfield from myfile union select distinct field2
as myfield from myfile union select distinct field3 as myfield from myfile)

Then run this

Select distinct myfield from mytemp

Or something along those lines using a fancy cte? Throw in some encoded
vector indexs over the 3 fields and it might even perform ok.

Neill

-----Original Message-----
From: midrange-l-bounces@xxxxxxxxxxxx
[mailto:midrange-l-bounces@xxxxxxxxxxxx] On Behalf Of Vinay Gavankar
Sent: 14 December 2009 22:47
To: Midrange Systems Technical Discussion
Subject: Re: Suggestions for finding unique values in million+ records

Create 3 logicals, one on each field. In the program, read the first record,
SETGT the found value, read next record etc. Store in an array. Process all
other 2 files, to get only unique value.

Not very elegant, probably not very fast, but you will be reading maybe 1500
records (assuming 500 unique values exist in all 3 fields).

Vinay Gavankar

On Mon, Dec 14, 2009 at 4:37 PM, John Allen <jallen@xxxxxxxxxxx> wrote:

I have a file with over a million records.
The file has three fields that can contain various values (yes the values
can be in any of the three fields)
Example:
Record 1 Field 1 = ABC Field 2 = ABC Field 3 = 123
Record 2 Field 1 = Field 2 = XYZ Field 3 =
Record 3 Field 1 = 123 Field 2= ABC Field 3 =
Record 4 Field 1 = 456 Field 2 = Field 3 =

I need to display a list of unique values from the combined three fields
such as this for the 4 record example above:
Blank
ABC
XYZ
123
456

This file will have over a million records in it
and the resulting list will usually have less then 500 unique values

I am trying to determine what is the best way to get the list of unique
values into a list (in an interactive job) to be displayed to the user.

I am pretty sure reading a million + records and looking up every value in
an array and having the array only contain values that are unique will be
VERY SLOW

I thought about keeping a secondary file containing a separate list of the
unique values as they are entered into the primary file. But then I have

maintain this file by removing values that are removed from the primary
file
and determining when a value is no longer in the primary file and time to
remove it from the secondary file would become an issue in itself.

Anybody have any suggestions?

Thanks

John

--
This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailing

list

To post a message email: MIDRANGE-L@xxxxxxxxxxxx
To subscribe, unsubscribe, or change list options,
visit: http://lists.midrange.com/mailman/listinfo/midrange-l
or email: MIDRANGE-L-request@xxxxxxxxxxxx
Before posting, please take a moment to review the archives
at http://archive.midrange.com/midrange-l.