Что состоит в том, чтобы проверить самый быстрый путь, идентичны ли файлы?

Question

Что состоит в том, чтобы проверить самый быстрый путь, идентичны ли файлы?

Если ваши результаты не должны быть на другом листе, вы можете просто преобразовать свои данные в таблицу. Выберите Ячейки A1: D8 и нажмите Вставить -> Таблица. Убедитесь, что «Моя таблица имеет заголовки», и вуаля!

После форматирования в виде таблицы вы можете отфильтровать идентификатор продукта, как вам нужно.

Если вам нужно показать эти результаты на другом листе, VBA станет для меня подходящим решением. Может быть что-то вроде этого:

Public Sub FilterResults()
    Dim findText As String
    Dim lastRow As Long
    Dim foundRow As Long
    Dim i As Long

    'If there's nothing to search for, then just stop the sub
    findText = LCase(Worksheets("Sheet2").Range("D4"))
    If findText = "" Then Exit Sub

    'Clear any old search results
    lastRow = Worksheets("Sheet2").Cells(Rows.Count, 4).End(xlUp).Row
    If lastRow > 5 Then
        For i = 6 To lastRow
            Worksheets("Sheet2").Range("C" & i).ClearContents
            Worksheets("Sheet2").Range("D" & i).ClearContents
            Worksheets("Sheet2").Range("E" & i).ClearContents
            Worksheets("Sheet2").Range("F" & i).ClearContents
        Next i
    End If

    'Start looking for new results
    lastRow = Worksheets("Sheet1").Cells(Rows.Count, 1).End(xlUp).Row
    foundRow = 6

    For i = 2 To lastRow
        If InStr(1, LCase(Worksheets("Sheet1").Range("B" & i)), findText) <> 0 Then
            Worksheets("Sheet2").Range("C" & foundRow) = Worksheets("Sheet1").Range("A" & i)
            Worksheets("Sheet2").Range("D" & foundRow) = Worksheets("Sheet1").Range("B" & i)
            Worksheets("Sheet2").Range("E" & foundRow) = Worksheets("Sheet1").Range("C" & i)
            Worksheets("Sheet2").Range("F" & foundRow) = Worksheets("Sheet1").Range("D" & i)
            foundRow = foundRow + 1
        End If
    Next i

    'If no results were found, then open a pop-up that notifies the user
    If foundRow = 6 Then MsgBox "No Results Found", vbCritical + vbOKOnly
End Sub

31

file language-agnostic comparison

задан ojblass 24 April 2009 в 05:42

14 ответов

Хеш MD5 будет быстрее, чем сравнение, но медленнее, чем обычная проверка CRC. Вы должны выяснить, какую надежность вы хотите сравнивать.

-1

ответ дан 27 November 2019 в 22:11

beyond compare, sync two folders, super fast! we use it all the time, everyday.

0

ответ дан 27 November 2019 в 22:11

Зачем изобретать велосипед? Как насчет стороннего приложения? Конечно, у него нет API, но я не думаю, что вы часто ставите себя в такую ситуацию. Мне нравится это приложение DoubleKiller просто сделайте резервную копию, прежде чем начать. :) Это быстро и бесплатно!

0

ответ дан 27 November 2019 в 22:11

First compare the file lengths of all million. If you have a cheap way to do so, start with the largest files. If they all pass that then compare each file using a binary division pattern; this will fail faster on files that are similar but not the same. For information on this method of comparison see Knuth-Morris-Pratt method.

2

ответ дан 27 November 2019 в 22:11

Использование cksum не так надежно, как использование чего-то вроде md5sum , Но я бы выбрал максимальную надежность, что означает побайтовое сравнение с использованием cmp .

Вы должны прочитать каждый байт в обоих файлах для всех методов проверки, чтобы вы могли также выбрать тот, который является наиболее надежным.

В качестве первого прохода вы можете проверить список каталогов, чтобы увидеть, отличаются ли размеры. Это быстрый способ получить более быструю обратную связь для разных файлов.

1

ответ дан 27 November 2019 в 22:11

Я бы запустил что-то вроде этого

find -name \*.java -print0 | xargs -0 md5sum | sort

, а затем увидел, какие файлы имеют разные суммы MD5. Это сгруппирует файлы по контрольной сумме.

Вы можете заменить md5sum на sha1sum или даже rmd160, если хотите.

0

ответ дан 27 November 2019 в 22:11

Update: Don't get stuck on the fact they are source files. Pretend for example you took a million runs of a program with very regulated output. You want to prove all 1,000,000 versions of the output are the same.

if you have control over the output have the program creating the files / output create an md5 on the fly and embed it in the file or output stream or even pipe the output through a program that creates the md5 along the way and stores it along side the data somehow, point is to do the calculations when the bytes are already in memory.

if you can't pull this off then like others have said, check file sizes then do a straight byte by byte comparison on same sized files, i don't see how any sort of binary division or md5 calculation is any better than a straight comparison, you will have to touch every byte to prove equality any way you cut it so you might as well cut the amount of computation needed per byte and gain the ability to cut off as soon as you find a mis-match.

the md5 calculation would be useful if you plan to compare these again later to new outputs but your basically back to my first point of calculating the md5 as soon as possible

5

ответ дан 27 November 2019 в 22:11

There are a number of programs that compare a set of files in general to find identical ones. FDUPES is a good one: Link. A million files shoudln't be a a problem, depending on the exact nature of the input. I think that FDUPES requires Linux, but there are other such programs for other platforms.

I tried to write a faster program myself, but except for special cases, FDUPES was faster.

Anyway, the general idea is to start by checking the sizes of the files. Files that have different sizes can't be equal, so you only need to look at groups of files with the same size. Then it gets more complicated if you want optimal performance: If the files are likely to be different, you should compare small parts of the files, in the hope of finding differences early, so you don't have to read the rest of them. If the files are likely to be identical, though, it can be faster to read through each file to calculate a checksum, because then you can read sequentially from the disk instead of jumping back and forth between two or more files. (This assumes normal disks, so SSD:s may be different.)

In my benchmarks when trying to make a faster program it (somewhat to my surprise) turned out to be faster to first read through each file to calculate a checksum, and then if the checksums were equal, compare the files directly by reading a blocks alternately from each file, than to just read blocks alternately without the previous checksum calculations! It turned out that when calculating the checksums, Linux cached both files in main memory, reading each file sequentially, and the second reads were then very fast. When starting with alternating reads, the files were not (physically) read sequentially.

EDIT:

Some people have expressed surprise end even doubt that it could be faster to read the files twice than reading them just once. Perhaps I didn't manage to explain very clearly what I was doing. I am talking about cache pre-loading, in order to have the files in disk cache when later accessing them in a way that would be slow to do on the physical disk drive. Here is a web page where I have tried to explain more in detail, with pictures, C code and measurements.

However, this has (at best) marginal relevance to the original question.

3

ответ дан 27 November 2019 в 22:11

Ну, самый оптимальный алгоритм будет зависеть от количества дубликатов файлов. 1277 Предположим, что некоторые из них одинаковы, но большинство из них разные, а файлы большие.

Отфильтруйте те, которые явно не совпадают, используя простую проверку длины файла.

Выберите случайные байты из файла, вычислите хеш и сравните (сворачивая поиск диска)

Затем выполните полный файл SHA1.

2

ответ дан 27 November 2019 в 22:11

Assuming that the expectation is that the files will be the same (it sound like that's the scenario), then dealing with checksums/hashes is a waste of time - it's likely that they'll be the same and you'd have to re-read the files to get the final proof (I'm also assuming that since you want to "prove ... they are the same", that having them hash to the same value is not good enough).

If that's the case I think that the solution proposed by David is pretty close to what you'd need to do. A couple things that could be done to optimize the comparison, in increasing level of complexity:

check if the file sizes are the same before doing the compare
use the fastest memcmp() that you can (comparing words instead of bytes - most C runtimes should do this already)
use multiple threads to do the memory block compares (up to the number of processors available on the system, going over that would cause your thread to fight each other)
use overlapped/asynchronous I/O to keep the I/O channels as busy as possible, but also profile carefully so you thrash between the files as little as possible (if the files are divided among several different disks and I/O ports, all the better)

8

ответ дан 27 November 2019 в 22:11

I don't think hashing is going to be faster than byte by byte comparisons. The byte by byte comparison can be optimized a bit by pipelining the reading and comparision of the bytes, also multiple sections of the file could be compared in parallel threads. It would be go something like this:

Check if the files sizes differ
Read blocks of the files into memory asynchronously
Handle them off to worker threads to do the comparisons

Or just run a cmp's (or the equivalent for your OS) in parallel. This could be scripted easily and you still get the benefit of parallelism.

1

ответ дан 27 November 2019 в 22:11

Большинство людей в своих ответах игнорируют тот факт, что файлы должны сравниваться неоднократно. Таким образом, контрольные суммы быстрее, поскольку контрольная сумма вычисляется один раз и сохраняется в памяти (вместо последовательного чтения файлов n раз).

14

ответ дан 27 November 2019 в 22:11

Я только что написал приложение на c#, которое делает что-то похожее на то, что вы хотите. Мой код делает следующее.

Считываем все размеры каждого файла в список или массив.

Используйте цикл for, чтобы проверить, совпадают ли размеры файлов. Если они одинакового размера, сравните байт одного файла с байтом другого файла. Если эти два байта одинаковы, перейдите к следующему байту. Если найдена разница, верните, что файлы разные.

Если достигнут конец обоих файлов, и последние два байта одинаковы, то файлы должны быть идентичны.

Я экспериментировал со сравнением MD5-хэшей файлов вместо того, чтобы перебирать байт за байтом, и обнаружил, что идентичные файлы часто пропускаются при использовании этого метода, однако он значительно быстрее.

0

ответ дан 27 November 2019 в 22:11

Другие вопросы по тегам:

file language-agnostic comparison

Что состоит в том, чтобы проверить самый быстрый путь, идентичны ли файлы?

14 ответов

Похожие вопросы: