Re: Locate duplicate accounts in a file -- MIDRANGE-L

On 24-Feb-2015 14:19 -0600, John McKee wrote:

I have a file that is supposed to have one record per account
number. There is at least one duplicate account in the file.

Either a PRIMARY KEY constraint or UNIQUE KEY constraint would be beneficial to be preventive of that negative effect. Otherwise a non-constraint keyed access path could protect the integrity of the data as well, either a UNIQUE INDEX with SQL DDL or similarly a DDS file-level UNIQUE keyword in the PF or a LF.

I'm thinking there is a simple(?) way to locate the duplicates with
SQL.

At least somewhat dependent on what is the definition of /locate/ being alluded, but probably still /simple/ irrespective that definition.

But, I am missing something. I have:

select siacct, count(siacct)
from spbsobdp
group by siacct
order by siacct

The above query really intends to count the rows, not the values of SIACCT. Very often the coding of COUNT(field_name) is seen being coded, where clearly the intent is COUNT(*). Failing to code COUNT(*) where intended possibly can lead to unexpected results if the file definition and\or the data changes, and cause poorer performance due to the query having to perform more work than necessary; in any case, avoid counting _values_ of a column when actual intent is to count rows.

Which produces a nice list. How can I change the output so only
those records with count > 1 show?

Add the HAVING clause to the existing query to see the list of duplicate values [however the detail /records/ having that aggregation will still not be output by that query]; e.g.:
HAVING count(*) > 1

My first thought is to redirect to a file and then query that. Is
there another option?

Unfortunately that single query even with the HAVING clause is not necessarily all that helpful, directly, because the actual RRN of each of the rows and the other field values for the row with the duplicated keys are not revealed in the result-set. Knowing only the duplicate key values requires probing the data again, to discover those other details. The following query can directly reveal the entire row along with RRN, and would be formatted with the conspicuous intent to display in a report to visually emphasize both the RRN and the Key-value, with the entire row appearing beyond the colon character; the possibly differing values of the fields across the rows sharing the same\duplicate key value, aligned according to the report writer, from which the ability to /see/ which row has the more desirable values than another row is likely:

select rrn(a), a.siacct, char(':'), a.*
from spbsobdp a
where a.siacct in ( select s.siacct
from spbsobdp s
group by s.siacct
having count(*) > 1
)
order by siacct

This mailing list archive is Copyright 1997-2025 by midrange.com and David Gibbs as a compilation work. Use of the archive is restricted to research of a business or technical nature. Any other uses are prohibited. Full details are available on our policy page. If you have questions about this, please contact [javascript protected email address].

Operating expenses for this site are earned using the Amazon Associate program and Google Adsense.