Вычислить вхождения Word из файла в bash

Question

Вычислить вхождения Word из файла в bash

Прошу прощения за нубский вопрос, но я новичок в bashпрограммировании, (начал несколько дней назад ). По сути, я хочу сохранить один файл со всеми вхождениями слов из другого файла

Я знаю, что могу это сделать:

sort | uniq -c | sort

дело в том, что после этого я хочу взять второй файл, снова вычислить вхождения и обновить первый. После беру третий файл и так далее.

То, чем я сейчас занимаюсь, работает без проблем (Я использую grep, sedиawk), но выглядит довольно медленно.

Я почти уверен, что есть очень эффективный способ с помощью команды или около того, используя uniq, но я не могу понять.

Не могли бы вы указать мне правильный путь?

Я также вставляю код, который написал:

#!/bin/bash
#   count the number of word occurrences from a file and writes to another file #
#   the words are listed from the most frequent to the less one                 #

touch.check                # used to check the occurrances. Temporary file
touch distribution.txt      # final file with all the occurrences calculated

page=$1             # contains the file I'm calculating
occurrences=$2          # temporary file for the occurrences

# takes all the words from the file $page and orders them by occurrences
cat $page | tr -cs A-Za-z\' '\n'| tr A-Z a-z >.check

# loop to update the old file with the new information
# basically what I do is check word by word and add them to the old file as an update
cat.check | while read words
do
    word=${words}       # word I'm calculating
    strlen=${#word}     # word's length
    # I use a black list to not calculate banned words (for example very small ones or inunfluent words, like articles and prepositions
    if ! grep -Fxq $word.blacklist && [ $strlen -gt 2 ]
    then
        # if the word was never found before it writes it with 1 occurrence
        if [ `egrep -c -i "^$word: " $occurrences` -eq 0 ]
        then
            echo "$word: 1" | cat >> $occurrences
        # else it calculates the occurrences
        else
            old=`awk -v words=$word -F": " '$1==words { print $2 }' $occurrences`
            let "new=old+1"
            sed -i "s/^$word: $old$/$word: $new/g" $occurrences
        fi
    fi
done

rm.check

# finally it orders the words
awk -F": " '{print $2" "$1}' $occurrences | sort -rn | awk -F" " '{print $2": "$1}' > distribution.txt

5

bash shell uniq linux

задан slm 2 November 2013 в 06:05

0 ответов

Другие вопросы по тегам:

bash shell uniq linux

Вычислить вхождения Word из файла в bash

0 ответов

Похожие вопросы: