The query to identify duplicate records in Oracle in a single table is fairly straightforward. Things get tricky in a two-table (master-detail) scenario, in which the master table holds the header information (id, name, etc) and the detail table has sets of values associated with the master records.
What if we need to find those master records that have identical sets of values in the detail table? Can a single SQL list master records with identical detail records?
This post explains how this can be done.
The Master:Detail Case
Given two tables:
- list_group: The table that stores header information (list_id, list_name,etc) for list_val.
- list_val: The table that stores values associated with each element of list_group
The contents of the tables are:
SQL> select list_id 2 , list_name 3 from list_group; LIST_ID LIST_NAME -------- ------------ CALLSTAT Call Status MAILSTAT Mail Status FAXSTAT Fax Status POSTSTAT Post Status SQL> select list_id 2 , value_code 3 , value_name 4 from list_val; LIST_ID VALUE_CO VALUE_NAME -------- -------- ------------------ CALLSTAT NOANS Unanswered CALLSTAT NUMNA Number Unavailable CALLSTAT SUCCESS Call Successful MAILSTAT DELIV Delivered MAILSTAT TRANSIT In-Transit MAILSTAT RETURNED Returned FAXSTAT DELIV Delivered FAXSTAT TRANSIT In-Transit POSTSTAT DELIV Delivered POSTSTAT TRANSIT In-Transit POSTSTAT RETURNED Returned 11 rows selected.
The sets of values are considered duplicate if the value "codes" in list_val are the same for the same master. In this example, MAILSTAT and POSTSTAT have duplicate values in list_val. FAXSTAT has a subset of those values, so it is not considered a "duplicate" though there is partial match.
The aim is to write a SQL to identify records in master table list_group that have duplicate sets of values in list_val.
Cut to the Chase: The Final SQL
Here’s a SQL that will identify master records with duplicate sets of detail records. An explanation and step-wise breakdown follows in the next section.
with cte as (select g1.list_id glist1 , g2.list_id glist2 from list_group g1 , list_group g2 where g1.list_id <> g2.list_id and not exists ( select 1 from list_val v1 where v1.list_id = g1.list_id and not exists ( select 1 from list_val v2 where v2.list_id = g2.list_id and v2.value_code = v1.value_code ) ) ) select c1.* from cte c1 , cte c2 where c1.glist1 = c2.glist2 and c1.glist2 = c2.glist1 and c1.glist1 < c1.glist2;
The Approach
The SQL above follows this algorithm:
- Pair every element in the master table list_group with every other element, except itself. Say the master pair is (GLIST1, GLIST2).
- Exclude from the master pairs, those in which GLIST1 is not a subset of GLIST2 (i.e. all list_val detail values of GLIST1 are not present in GLIST2)
- Exclude from the master pairs, those in which GLIST2 is not a subset of GLIST1 (i.e. all list_val detail values of GLIST2 are not present in GLIST1)
What remains are master record pairs with perfect equivalence i.e. GLIST1 = GLIST2, i.e. all detail records of GLIST1 are present in GLIST2, and vice versa.
A step-wise run of parts of the SQL is shown below.
1. Pair every element in the master table list_group with every other element, except itself.
Say the master pair is (GLIST1, GLIST2).
-- Pair every element in master table -- list_group with every other element, -- except itself select g1.list_id glist1 , g2.list_id glist2 from list_group g1 , list_group g2 where g1.list_id <> g2.list_id;
When run:
SQL> -- Pair every element in master table SQL> -- list_group with every other element, SQL> -- except itself SQL> select 2 g1.list_id glist1 3 , g2.list_id glist2 4 from list_group g1 5 , list_group g2 6 where g1.list_id <> g2.list_id; GLIST1 GLIST2 -------- -------- CALLSTAT MAILSTAT CALLSTAT FAXSTAT CALLSTAT POSTSTAT MAILSTAT CALLSTAT MAILSTAT FAXSTAT MAILSTAT POSTSTAT FAXSTAT CALLSTAT FAXSTAT MAILSTAT FAXSTAT POSTSTAT POSTSTAT CALLSTAT POSTSTAT MAILSTAT POSTSTAT FAXSTAT 12 rows selected.
2. Exclude from the master pairs, those in which GLIST1 is not a subset of GLIST2
(i.e. all list_val detail values of GLIST1 are not present in GLIST2)
Correlated subquery comes in handy when comparing the ids from the outer query in the detail subquery.
-- Exclude from the master pairs, those in -- which GLIST1 is not a subset of GLIST2 -- (i.e. all list_val detail values of GLIST1 -- are not present in GLIST2) select g1.list_id glist1 , g2.list_id glist2 from list_group g1 , list_group g2 where g1.list_id <> g2.list_id and not exists ( select 1 from list_val v1 where v1.list_id = g1.list_id and not exists ( select 1 from list_val v2 where v2.list_id = g2.list_id and v2.value_code = v1.value_code ) );
When run:
SQL> -- Exclude from the master pairs, those in SQL> -- which GLIST1 is not a subset of GLIST2 SQL> -- (i.e. all list_val detail values of GLIST1 SQL> -- are not present in GLIST2) SQL> select 2 g1.list_id glist1 3 , g2.list_id glist2 4 from list_group g1 5 , list_group g2 6 where g1.list_id <> g2.list_id 7 and not exists 8 ( 9 select 1 10 from list_val v1 11 where v1.list_id = g1.list_id 12 and not exists 13 ( 14 select 1 15 from list_val v2 16 where v2.list_id = g2.list_id 17 and v2.value_code = v1.value_code 18 ) 19 ); GLIST1 GLIST2 -------- -------- MAILSTAT POSTSTAT FAXSTAT MAILSTAT FAXSTAT POSTSTAT POSTSTAT MAILSTAT
3. Exclude from the master pairs, those in which GLIST2 is not a subset of GLIST1
(i.e. all list_val detail values of GLIST2 are not present in GLIST1)
We make use of a Common Table Expression (CTE) to factor out the repeating portion of the query.
-- Make a cte of the filtered master -- pairs. Exclude from the master -- pairs, those in which GLIST2 is -- not a subset of GLIST1. with cte as (select g1.list_id glist1 , g2.list_id glist2 from list_group g1 , list_group g2 where g1.list_id <> g2.list_id and not exists ( select 1 from list_val v1 where v1.list_id = g1.list_id and not exists ( select 1 from list_val v2 where v2.list_id = g2.list_id and v2.value_code = v1.value_code ) ) ) select c1.* from cte c1 , cte c2 where c1.glist1 = c2.glist2 and c1.glist2 = c2.glist1 and c1.glist1 < c1.glist2;
When run:
SQL> -- Make a cte of the filtered master SQL> -- pairs. Exclude from the master SQL> -- pairs, those in which GLIST2 is SQL> -- not a subset of GLIST1. SQL> with cte as 2 (select 3 g1.list_id glist1 4 , g2.list_id glist2 5 from list_group g1 6 , list_group g2 7 where g1.list_id <> g2.list_id 8 and not exists 9 ( 10 select 1 11 from list_val v1 12 where v1.list_id = g1.list_id 13 and not exists 14 ( 15 select 1 16 from list_val v2 17 where v2.list_id = g2.list_id 18 and v2.value_code = v1.value_code 19 ) 20 ) 21 ) 22 select c1.* 23 from cte c1 24 , cte c2 25 where c1.glist1 = c2.glist2 26 and c1.glist2 = c2.glist1 27 and c1.glist1 < c1.glist2; GLIST1 GLIST2 -------- -------- MAILSTAT POSTSTAT
And there you have it — a single SQL to find master records with identical detail records.