То, почему делает чтение файла в память, берет 4x память в Java?

Question

То, почему делает чтение файла в память, берет 4x память в Java?

C++ 11

//may return 0 when not able to detect
unsigned concurentThreadsSupported = std::thread::hardware_concurrency();

Ссылка: станд.:: поток:: hardware_concurrency

В C++ до C++ 11, нет никакого портативного пути. Вместо этого необходимо будет использовать один или несколько следующих методов (охраняемый соответствующим #ifdef строки):

Win32

SYSTEM_INFO sysinfo;
GetSystemInfo(&sysinfo);
int numCPU = sysinfo.dwNumberOfProcessors;

Linux, Солярис, AIX и Mac OS X> =10.4 (т.е. Tiger вперед)
```
int numCPU = sysconf(_SC_NPROCESSORS_ONLN);
```

FreeBSD, MacOS X, NetBSD, OpenBSD, и т.д.

int mib[4];
int numCPU;
std::size_t len = sizeof(numCPU); 

/* set the mib for hw.ncpu */
mib[0] = CTL_HW;
mib[1] = HW_AVAILCPU;  // alternatively, try HW_NCPU;

/* get the number of CPUs from the system */
sysctl(mib, 2, &numCPU, &len, NULL, 0);

if (numCPU < 1) 
{
    mib[1] = HW_NCPU;
    sysctl(mib, 2, &numCPU, &len, NULL, 0);
    if (numCPU < 1)
        numCPU = 1;
}

Objective C HPUX

int numCPU = mpctl(MPC_GETNUMSPUS, NULL, NULL);

IRIX
```
int numCPU = sysconf(_SC_NPROC_ONLN);
```

(Mac OS X> =10.5 или iOS)

NSUInteger a = [[NSProcessInfo processInfo] processorCount];
NSUInteger b = [[NSProcessInfo processInfo] activeProcessorCount];

11

java performance memory file file-io

задан Carl Manaster 6 July 2009 в 21:51

9 ответов

Это может быть связано с тем, как StringBuffer изменяет размер, когда он достигает своей емкости. Это включает создание нового char [] , вдвое превышающего размер предыдущий, а затем скопируйте содержимое в новый массив. Вместе с тем, что уже говорилось о том, что символы в Java сохраняются как 2 байта, это определенно увеличит использование памяти.

Чтобы решить эту проблему, вы можете создать StringBuffer с достаточной емкостью для начала, учитывая, что вы знаете размер файла (и, следовательно, приблизительное количество символов для чтения). Однако имейте в виду, что выделение массива также произойдет, если вы затем попытаетесь преобразовать этот большой StringBuffer в String .

Другой момент: Обычно вам следует отдавать предпочтение StringBuilder , а не StringBuffer , поскольку операции над ним выполняются быстрее.

Вы можете рассмотреть возможность реализации собственного «CharBuffer», используя, например, LinkedList ] of char [], чтобы избежать дорогостоящих операций выделения / копирования массива. Вы могли бы заставить этот класс реализовать CharSequence и, возможно, вообще избежать преобразования в String . Еще одно предложение для более компактного представления: если вы читаете английский текст, содержащий большое количество повторяющихся слов, вы можете прочитать и сохранить каждое слово, используя функцию String.intern () , чтобы значительно уменьшить объем памяти.

используя, например, LinkedList of char [], чтобы избежать дорогостоящих операций выделения / копирования массива. Вы могли бы заставить этот класс реализовать CharSequence и, возможно, вообще избежать преобразования в String . Еще одно предложение для более компактного представления: если вы читаете английский текст, содержащий большое количество повторяющихся слов, вы можете прочитать и сохранить каждое слово, используя функцию String.intern () , чтобы значительно уменьшить объем памяти.

21

ответ дан 3 December 2019 в 00:45

At the last insert into the StringBuffer, you need three times the memory allocated, because the StringBuffer always expands by (size + 1) * 2 (which is already double because of unicode). So a 400GB file could require an allocation of 800GB * 3 == 2.4GB at the end of the inserts. It may be something less, that depends on exactly when the threshold is reached.

The suggestion to concatenate Strings rather than using a Buffer or Builder is in order here. There will be a lot of garbage collection and object creation (so it will be slow), but a much lower memory footprint.

[At Michael's prompting, I investigated this further, and concat wouldn't help here, as it copies the char buffer, so while it wouldn't require triple, it would require double the memory at the end.]

You could continue to use the Buffer (or better yet Builder in this case) if you know the maximum size of the file and initialize the size of the Buffer on creation and you are sure this method will only get called from one thread at a time.

But really such an approach of loading such a large file into memory at once should only be done as a last resort.

1

ответ дан 3 December 2019 в 00:45

To begin with Java strings are UTF-16 (i.e. 2 bytes per character), so assuming your input file is ASCII or a similar one-byte-per-character format then holder will be ~2x the size of the input data, plus the extra \r\n per line and any additional overhead. There's ~800MB straight away, assuming a very low storage overhead in StringBuffer.

I could also believe that the contents of your file is buffered twice - once at the I/O level and once in the BufferedReader.

However, to know for sure, it's probably best to look at what's actually on the heap - use a tool like HPROF to see exactly where your memory has gone.

I terms of solving this, I suggest you process a line at a time, writing out each line after your have added the line termination. That way your memory usage should be proportional to the length of a line, instead of the entire file.

13

ответ дан 3 December 2019 в 00:45

You have a number of problems here:

Unicode: characters take twice as much space in memory as on disk (assuming a 1 byte encoding)
StringBuffer resizing: could double (permanently) and triple (temporarily) the occupied memory, though this is the worst case
StringBuffer.toString() temporarily doubles the occupied memory since it makes a copy

All of these combined mean that you could require temporarily up to 8 times your file's size in RAM, i.e. 3.2G for a 400M file. Even if your machine physically has that much RAM, it has to be running a 64bit OS and JVM to actually get that much heap for the JVM.

All in all, it's simply a horrible idea to keep such a huge String in memory - and it's totally unneccessary as well - since your method returns an InputStream, all you really need is a FilterInputStream that adds the line breaks on the fly.

11

ответ дан 3 December 2019 в 00:45

It's the StringBuffer. The empty constructor creates a StringBuffer with a initial length of 16 Bytes. Now if you append something and the capacity is not sufficiant, it does an Arraycopy of the internal String Array to a new buffer.

So in fact, with each line appended the StringBuffer has to create a copy of the complete internal Array which nearly doubles the required memory when appending the last line. Together with the UTF-16 representation this results in the observed memory demand.

Edit

Michael is right, when saying, that the internal buffer is not incremented in small portions - it roughly doubles in size each to you need more memory. But still, in the worst case, say the buffer needs to expand capacity just with the very last append, it creates a new array twice the size of the actual one - so in this case, for a moment you need roughly three times the amount of memory.

Anyway, I've learned the lesson: StringBuffer (and Builder) may cause unexpected OutOfMemory errors and I'll always initialize it with a size, at least when I have to store large Strings. Thanks for the question :)

4

ответ дан 3 December 2019 в 00:45

Я бы посоветовал вам использовать файловый кеш ОС вместо копирования данных в память Java через символы и обратно в байты. Если вы перечитаете файл по мере необходимости (возможно, трансформируя его по ходу), он будет быстрее и, скорее всего, будет проще

Вам потребуется более 2 ГБ, потому что для 1 байтовых букв используются символы (2 байта) в памяти, и когда ваш StringBuffer изменяет размер, вам нужно вдвое больше (чтобы скопировать старый массив в новый больший массив). Новый массив обычно на 50% больше, поэтому вам нужно до 6 раз больше исходного размера файла. Если производительность была недостаточно низкой, вы используете StringBuffer вместо StringBuilder, который синхронизирует каждый вызов, когда он явно не нужен. (Это только замедляет работу, но использует тот же объем памяти)

1

ответ дан 3 December 2019 в 00:45

I also recommend checking out Commons IO FileUtils class for this. Specifically: org.apache.commons.io.FileUtils#readFileToString. You can also specify the encoding if you know you only are using ASCII.

0

ответ дан 3 December 2019 в 00:45

Другие объяснили, почему вам не хватает памяти. Что касается решения этой проблемы, я бы предложил написать собственный подкласс FilterInputStream. Этот класс будет читать по одной строке за раз, добавлять символы «\ r \ n» и буферизовать результат. Как только строка будет прочитана потребителем вашего FilterInputStream, вы прочтете другую строку. Таким образом, вы будете иметь в памяти только одну строку.

1

ответ дан 3 December 2019 в 00:45

Другие вопросы по тегам:

java performance memory file file-io

То, почему делает чтение файла в память, берет 4x память в Java?

Win32

Linux, Солярис, AIX и Mac OS X> =10.4 (т.е. Tiger вперед)

FreeBSD, MacOS X, NetBSD, OpenBSD, и т.д.

Objective C HPUX

IRIX

(Mac OS X> =10.5 или iOS)

9 ответов

Похожие вопросы: