A question came up on the JavaCC user's list about parsing binary data with JavaCC. In response I posted a little example grammar that parses the header section of a DOOM map data file (e.g., a WAD file). There's really not much to it; here's the lexical spec:
TOKEN : {
<IWAD : "IWAD">
| <PWAD : "PWAD">
| <LONG : (["\u0000"-"\u00FF"]){4}>
}
And the syntactic spec:
void Header() : {
Token lumpCount=null;
Token offSet=null;
} {
(<IWAD> | <PWAD>)
lumpCount=<LONG>
{ System.out.println("Lumps in this file: " + littleEndianFourByteStringToInt(lumpCount.image)); }
offSet=<LONG>
{ System.out.println("Byte offset of body: " + littleEndianFourByteStringToInt(offSet.image)); }
}
Here's the utility function to decode those little endian ints to Java ints:
static int littleEndianFourByteStringToInt(String s) {
int accum = 0;
for ( int shiftBy=0,counter=0; shiftBy<32; shiftBy+=8,counter++ ) {
char c = s.charAt(counter);
int byteValue = (c & 0xFF) << shiftBy;
accum |= byteValue;
}
return accum;
}
Most of the above bit shifting stuff is based on this helpful page, and some notes on the DOOM map file format are here. Hm. This sort of thing may be a good late addition to my JavaCC book.
Do you have an example of parsing a text file into objects? Where the text file contains different sections (each section has its own structure). I ask because I can't figure out the "best way" to parse a text file where the sections are seperated as such
~A
~B
~C
~D
thought you might some insight on this.
Posted by: Jason | May 09, 2007 at 11:48 AM
Hi Jason - Hm, are there always the same four sections, and are the token types the same for all four sections? Can you do something like;
void File() : {} {
A() B() C() D()
}
void A() : {} {
"A" Other() Things() In() Here()
}
Posted by: tomcopeland | May 09, 2007 at 01:41 PM
The sections are always the same (but the content for each section is different from one section to another) but not always present and not always in the same order except for the last. Example, would be ...
Where ~+[letter] marks the start of the section (also which section it is such as ~V might stand for ~Version but the only required part is ~V) and ~D must always appear last in the file but the other sections may or may not be present.
~V
~A
~P
~D
Posted by: Jason | May 09, 2007 at 02:15 PM
Hi Jason - Cool, but do all the sections consist of the same tokens? I mean, are they all generally the same sort of data, but with a different structure in each section? Feel free to email me offline at tom@infoether.com....
Posted by: tomcopeland | May 09, 2007 at 06:03 PM