Re: Suggestions for finding unique values in million+ records -- MIDRANGE-L

Hi John
Am I having deja vu all over again?
Wasn't this question posted yesterday?
You have had multiple replies, but one question that comes to mind is
assuming that each field is 3 characters and that each character can have a
digit from A through Z (26), 0 through 9 (10) and blank
that means there are 37 combinations for each of the 3 characters.
This means that there are 50,653 different combinations, yet you stipulate
that there are usually less than 500 unique values
Is this figure of 500 from experience with dealing with these fields,
or .....
The reason why I am asking is because your answer could determine the
choice that is to be made
Although I must admit I like the one supplied by Mike Wills

Alan Shore
Programmer/Analyst, Direct Response
E:AShore@xxxxxxxx
P:(631) 200-5019
C:(631) 880-8640
"If you're going through Hell, keep going" - Winston Churchill

"John Allen"
<jallen@xxxxxxxxx
om> To
Sent by: <MIDRANGE-L@xxxxxxxxxxxx>
midrange-l-bounce cc
s@xxxxxxxxxxxx
Subject
Suggestions for finding unique
12/15/2009 12:25 values in million+ records
PM

Please respond to
Midrange Systems
Technical
Discussion
<midrange-l@midra
nge.com>

I have a file with over a million records.
The file has three fields that can contain various values (yes the values
can be in any of the three fields)
Example:
Record 1 Field 1 = ABC Field 2 = ABC Field 3 = 123
Record 2 Field 1 = Field 2 = XYZ Field 3 =
Record 3 Field 1 = 123 Field 2= ABC Field 3 =
Record 4 Field 1 = 456 Field 2 = Field 3 =

I need to display a list of unique values from the combined three fields
such as this for the 4 record example above:
Blank
ABC
XYZ
123
456

This file will have over a million records in it
and the resulting list will usually have less then 500 unique values

I am trying to determine what is the best way to get the list of unique
values into a list (in an interactive job) to be displayed to the user.

I am pretty sure reading a million + records and looking up every value in
an array and having the array only contain values that are unique will be
VERY SLOW

I thought about keeping a secondary file containing a separate list of the
unique values as they are entered into the primary file. But then I have to
maintain this file by removing values that are removed from the primary
file
and determining when a value is no longer in the primary file and time to
remove it from the secondary file would become an issue in itself.

Anybody have any suggestions?

Thanks

John

--
This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailing list
To post a message email: MIDRANGE-L@xxxxxxxxxxxx
To subscribe, unsubscribe, or change list options,
visit: http://lists.midrange.com/mailman/listinfo/midrange-l
or email: MIDRANGE-L-request@xxxxxxxxxxxx
Before posting, please take a moment to review the archives
at http://archive.midrange.com/midrange-l.