Fastest technique to deleting duplicate data

Question

Fastest technique to deleting duplicate data

After searching stackoverflow.com I found several questions asking how to remove duplicates, but none of them addressed speed.

In my case I have a table with 10 columns that contains 5 million exact row duplicates. In addition, I have at least a million other rows with duplicates in 9 of the 10 columns. My current technique is taking (so far) 3 hours to delete these 5 million rows. Here is my process:

-- Step 1:  **This step took 13 minutes.** Insert only one of the n duplicate rows into a temp table
select
    MAX(prikey) as MaxPriKey, -- identity(1, 1)
    a,
    b,
    c,
    d,
    e,
    f,
    g,
    h,
    i
into #dupTemp
FROM sourceTable
group by
    a,
    b,
    c,
    d,
    e,
    f,
    g,
    h,
    i
having COUNT(*) > 1

Next,

-- Step 2: **This step is taking the 3+ hours**
-- delete the row when all the non-unique columns are the same (duplicates) and
-- have a smaller prikey not equal to the max prikey
delete 
from sourceTable
from sourceTable
inner join #dupTemp on  
    sourceTable.a = #dupTemp.a and
    sourceTable.b = #dupTemp.b and
    sourceTable.c = #dupTemp.c and
    sourceTable.d = #dupTemp.d and
    sourceTable.e   = #dupTemp.e and
    sourceTable.f = #dupTemp.f and
    sourceTable.g = #dupTemp.g and
    sourceTable.h = #dupTemp.h and
    sourceTable.i   = #dupTemp.i and
    sourceTable.PriKey != #dupTemp.MaxPriKey

Any tips on how to speed this up, or a faster way? Remember I will have to run this again for rows that are not exact duplicates.

Thanks so much.

UPDATE:
Я должен был остановить шаг 2 от выполнения в 9-часовой отметке. Я попробовал метод OMG Ponies, и он закончился через 40 минут. Я попробовал свой шаг 2 с пакетным удалением Andomar, оно прошло 9 часов, прежде чем я остановил его. ОБНОВИТЬ: Запустил аналогичный запрос с одним меньшим полем, чтобы избавиться от другого набора дубликатов, и запрос выполнялся всего 4 минуты (8000 строк) с использованием метода OMG Ponies.

Я попробую технику cte при следующей возможности, однако я подозреваю, что метод OMG Ponies будет непростым.

7

sql sql-server sql-server-2008 etl

задан O.O 18 August 2010 в 15:07

6 ответов

Можете ли вы позволить себе, чтобы исходная таблица была недоступна на короткое время?

Я думаю, что самым быстрым решением является создание новой таблицы без дубликатов. В основном подход, который вы используете с временной таблицей, но вместо этого создаете «обычную» таблицу.

Затем отбросьте исходную таблицу и переименуйте промежуточную таблицу, чтобы она имела то же имя, что и старая таблица.

4