Strategies to handle a file with multiple fixed formats

This question is not Perl-specific, (although the unpack function will most probably figure into my implementation).

I have to deal with files where multiple formats exist to hierarchically break down the data into meaningful sections. What I'd like to be able to do is parse the file data into a suitable data structure.

Here's an example (commentary on RHS):

                                       # | Format | Level | Comment
                                       # +--------+-------+---------
**DEVICE 109523.69142                  #        1       1   file-specific
  .981    561A                         #        2       1
10/MAY/2010    24.15.30,13.45.03       #        3       2   group of records
05:03:01   AB23X  15.67   101325.72    #        4       3   part of single record
*           14  31.30474 13        0   #        5       3   part of single record
05:03:15   CR22X  16.72   101325.42    #        4       3   new record
*           14  29.16264 11        0   #        5       3
06:23:51   AW41X  15.67    101323.9    #        4       3
*           14  31.26493219        0   #        5       3
11/MAY/2010    24.07.13,13.44.63       #        3       2   group of new records
15:57:14   AB23X  15.67   101327.23    #        4       3   part of single record
*           14  31.30474 13        0   #        5       3   part of single record
15:59:59   CR22X  16.72   101331.88    #        4       3   new record
*           14  29.16264 11        0   #        5

The logic I have at the moment is fragile:

  • I know, for instance, that a Format 2 always comes after a Format 1, and that they only span 2 lines.
  • I also know that Formats 4 and 5 always come in pairs as they correspond to a single record. The number of records may be variable
  • I'm using regular expressions to infer the format of each line. However, this is risky and does not lend to flexibility in the future (when someone decides to change the format of the output).

The big question here is about what strategies I can employ to determine which format needs to be used for which line. I'd be interested to know if others have faced similar situations and what they've done to address it.

6
задан user151019 11 May 2012 в 20:48
поделиться