Лучше всего кластеризируя алгоритм? (просто объясненный)

Question

Лучше всего кластеризируя алгоритм? (просто объясненный)

Библиотеки xcb разделены на несколько разных пакетов; Получается, что вам нужно явно использовать библиотеки xcb и xcb-randr:

... `pkg-config --cflags --libs xcb xcb-randr`

Возможно, ваш дистрибутив Linux отдельно упаковывает библиотеку randr. Проверяя Fedora, он упаковывает и xcb, и xcb-rand в подпакет libxcb-devel; но возможно, что ваш дистрибутив Linux имеет отдельный подпакет libxcb-randr-devel, который вам нужно установить.

19

algorithm text cluster-analysis data-mining text-mining

задан Anony-Mousse 19 May 2017 в 13:19

3 ответа

I believe you need to make some design decisions about clustering, and continue from there:

Why are you clustering texts? Do you want to display related documents together? Do you want to explore your document corpus via clusters?
As a result, do you want flat or hierarchical clustering?
Now we have the complexity issue, in two dimensions: first, the number and type of features you create from the text - individual words may number in the tens of thousands. You may want to try some feature selection - such as taking the N most informative words, or the N words appearing the most times, after ignoring stop words.
Second, you want to minimize the number of times you measure similarity between documents. As bubaker correctly points out, checking similarity between all pairs of documents may be too much. If clustering into a small number of clusters is enough, you may consider K-means clustering, which is basically: choose an initial K documents as cluster centers, assign every document to the closest cluster, recalculate cluster centers by finding document vector means, and iterate. This only costs K*number of documents per iteration. I believe there are also heuristics for reducing the needed number of computations for hierarchical clustering as well.

1

ответ дан 30 November 2019 в 04:40

Как выглядит функция подобный_текст , вызываемая в Подходе №1? Я думаю, что вы имеете в виду не кластеризацию, а показатель сходства. Я не могу реально улучшить подход гистограммы Белого Валлоуна :-) - интересная проблема, о которой стоит кое-что прочитать.

Как бы вы ни реализовали check () , вы должны использовать его, чтобы сделать не менее 200 миллионов сравнений (половина 20000 ^ 2 ). Ограничение для «связанных» статей может ограничивать то, что вы храните в базе данных, но кажется слишком произвольным, чтобы уловить всю полезную кластеризацию текстов,

Мой подход заключался бы в изменении check () , чтобы вернуть " метрика подобия ( $ prozent или rtn ). Запишите матрицу 20K x 20K в файл и используйте внешнюю программу для выполнения кластеризации для определения ближайших соседей для каждой статьи, которую вы можете загрузить в связанную таблицу . Я бы сделал кластеризацию в R - есть хороший учебник для кластеризации данных в файле, работающем R из php .

0

ответ дан 30 November 2019 в 04:40

Другие вопросы по тегам:

algorithm text cluster-analysis data-mining text-mining

Похожие вопросы:

score 15 · Accepted Answer

The most standard way I know of to do this on text data like you have, is to use the 'bag of words' technique.

First, create a 'histogram' of words for each article. Lets say between all your articles, you only have 500 unique words between them. Then this histogram is going to be a vector(Array, List, Whatever) of size 500, where the data is the number of times each word appears in the article. So if the first spot in the vector represented the word 'asked', and that word appeared 5 times in the article, vector[0] would be 5:

for word in article.text
    article.histogram[indexLookup[word]]++

Now, to compare any two articles, it is pretty straightforward. We simply multiply the two vectors:

def check(articleA, articleB)
    rtn = 0
    for a,b in zip(articleA.histogram, articleB.histogram)
        rtn += a*b
    return rtn > threshold

(Sorry for using python instead of PHP, my PHP is rusty and the use of zip makes that bit easier)

This is the basic idea. Notice the threshold value is semi-arbitrary; you'll probably want to find a good way to normalize the dot product of your histograms (this will almost have to factor in the article length somewhere) and decide what you consider 'related'.

Also, you should not just put every word into your histogram. You'll, in general, want to include the ones that are used semi-frequently: Not in every article nor in only one article. This saves you a bit of overhead on your histogram, and increases the value of your relations.

By the way, this technique is described in more detail here