Tom Copeland's Recent Posts

RSS Feeds

« RubyForge Subversion and CVS browsing | Main | Spam on RubyForge trackers »

Parsing binary data with JavaCC

A question came up on the JavaCC user's list about parsing binary data with JavaCC.  In response I posted a little example grammar that parses the header section of a DOOM map data file (e.g., a WAD file).  There's really not much to it; here's the lexical spec:

TOKEN : {
  <IWAD : "IWAD">
  | <PWAD : "PWAD">
  | <LONG : (["\u0000"-"\u00FF"]){4}>
}

And the syntactic spec:

void Header() : {
  Token lumpCount=null;
  Token offSet=null;
} {
  (<IWAD> | <PWAD>)
  lumpCount=<LONG>
    { System.out.println("Lumps in this file: " + littleEndianFourByteStringToInt(lumpCount.image)); }
  offSet=<LONG> 
    { System.out.println("Byte offset of body: " + littleEndianFourByteStringToInt(offSet.image)); }
}

Here's the utlity function to decode those little endian ints to Java ints:

  static int littleEndianFourByteStringToInt(String s) {
   int accum = 0;
   for ( int shiftBy=0,counter=0; shiftBy<32; shiftBy+=8,counter++ ) {
    char c = s.charAt(counter);
    int byteValue = (c & 0xFF) << shiftBy;
    accum |= byteValue;
   }
   return accum;
  }

Most of the above bit shifting stuff is based on this helpful page, and some notes on the DOOM map file format are here.  Hm.  This sort of thing may be a good late addition to my JavaCC book.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83451d3c069e200e5505cd9ac8833

Listed below are links to weblogs that reference Parsing binary data with JavaCC:

Comments

Do you have an example of parsing a text file into objects? Where the text file contains different sections (each section has its own structure). I ask because I can't figure out the "best way" to parse a text file where the sections are seperated as such

~A
~B
~C
~D

thought you might some insight on this.

Hi Jason - Hm, are there always the same four sections, and are the token types the same for all four sections? Can you do something like;

void File() : {} {
A() B() C() D()
}

void A() : {} {
"A" Other() Things() In() Here()
}

The sections are always the same (but the content for each section is different from one section to another) but not always present and not always in the same order except for the last. Example, would be ...

Where ~+[letter] marks the start of the section (also which section it is such as ~V might stand for ~Version but the only required part is ~V) and ~D must always appear last in the file but the other sections may or may not be present.

~V
~A
~P
~D

Hi Jason - Cool, but do all the sections consist of the same tokens? I mean, are they all generally the same sort of data, but with a different structure in each section? Feel free to email me offline at tom@infoether.com....

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Comments are moderated, and will not appear until the author has approved them.