Чтение файла gz и отслеживание позиции в файле

So, here is the situation:

I have to read big .gz archives (GBs) and kind of "index" them to later on be able to retrieve specific pieces using random access. In other words, I wish to read the archive line by line, and be able to get the specific location in the file for any such line. (so that I can jump directly to these specific locations upon request). (PS: ...and it's UTF-8 so we cannot assume 1 byte == 1 char.)

So, basically, what I just need is a BufferedReader which keeps track of its location in the file. However, this doesn't seem to exist.

Is there anything available or do I have to roll my own?

A few additional comments:

  • I cannot use BufferedReader directly since the file location corresponds to what has been buffered so far. In other words, a multiple of the internal buffer size instead of the line location.
  • I cannot use InputStreamReader directly for performance reasons. Unbuffered would be way to slow, and, btw, lacks convenience methods to read lines.
  • I cannot use RandomAccessFile since 1. it's zipped, and 2. RandomAccessFile uses "modified" UTF-8

I guess the best would be use a kind of of buffered reader keeping track of file location and buffer offset ...but this sounds quite cumbersome. But maybe I missed something. Perhaps there is already something existing to do that, to read files line by lines and keep track of location (even if zipped).

Thanks for tips,

Arnaud

5
задан Tim Cooper 6 March 2011 в 13:19
поделиться