How do you debug a binary format?
Clash Royale CLAN TAG#URR8PPP
I would like to be able to debug building a binary builder. Right now I am basically printing out the input data to the binary parser, and then going deep into the code and printing out the mapping of the input to the output, then taking the output mapping (integers) and using that to locate the corresponding integer in the binary. Pretty clunky, and requires that I modify the source code deeply to get at the mapping between input and output.
It seems like you could view the binary in different variants (in my case I'd like to view it in 8-bit chunks as decimal numbers, because that's pretty close to the input). Actually, some numbers are 16 bit, some 8, some 32, etc. So maybe there would be a way to view the binary with each of these different numbers highlighted in memory in some way.
The only way I could see that being possible is if you actually build a visualizer specific to the actual binary format/layout. So it knows where in the sequence the 32 bit numbers should be, and where the 8 bit numbers should be, etc. This is a lot of work and kind of tricky in some situations. So wondering if there's a general way to do it.
I am also wondering what the general way of debugging this type of thing currently is, so maybe I can get some ideas on what to try from that.
debugging binary
|
show 6 more comments
I would like to be able to debug building a binary builder. Right now I am basically printing out the input data to the binary parser, and then going deep into the code and printing out the mapping of the input to the output, then taking the output mapping (integers) and using that to locate the corresponding integer in the binary. Pretty clunky, and requires that I modify the source code deeply to get at the mapping between input and output.
It seems like you could view the binary in different variants (in my case I'd like to view it in 8-bit chunks as decimal numbers, because that's pretty close to the input). Actually, some numbers are 16 bit, some 8, some 32, etc. So maybe there would be a way to view the binary with each of these different numbers highlighted in memory in some way.
The only way I could see that being possible is if you actually build a visualizer specific to the actual binary format/layout. So it knows where in the sequence the 32 bit numbers should be, and where the 8 bit numbers should be, etc. This is a lot of work and kind of tricky in some situations. So wondering if there's a general way to do it.
I am also wondering what the general way of debugging this type of thing currently is, so maybe I can get some ideas on what to try from that.
debugging binary
74
You got one answer saying "use the hexdump directly, and do this and that additionally"- and that answer got a lot of upvotes. And a second answer, 5 hours later(!), saying only "use a hexdump". Then you accepted the second one in favor of the first? Seriously?
– Doc Brown
Jan 16 at 22:29
4
While you might have a good reason to use a binary format, do consider whether you can just use an existing text format like JSON instead. Human readability counts a lot, and machines and networks are typically fast enough that using a custom format to reduce size is unnecessary nowadays.
– jpmc26
Jan 16 at 23:45
4
@jpmc26 there's still a lot of use for binary formats and always will be. Human readability is usually secondary to performance, storage requirements, and network performance. And there are still many areas where especially network performance is poor and storage limited. Plus don't forget all the systems having to interface with legacy systems (both hardware and software) and having to support their data formats.
– jwenting
Jan 17 at 10:15
4
@jwenting No, actually, developer time is usually the most expensive piece of an application. Sure, that may not be the case if you're working at Google or Facebook, but most apps don't operate at that scale. And when your developers spending time on stuff is the most expensive resource, human readability counts for a lot more than the 100 milliseconds extra for the program to parse it.
– jpmc26
Jan 17 at 13:09
3
@jpmc26 I don't see anything in the question that suggests to me that the OP is the one defining the format.
– JimmyJames
Jan 17 at 15:40
|
show 6 more comments
I would like to be able to debug building a binary builder. Right now I am basically printing out the input data to the binary parser, and then going deep into the code and printing out the mapping of the input to the output, then taking the output mapping (integers) and using that to locate the corresponding integer in the binary. Pretty clunky, and requires that I modify the source code deeply to get at the mapping between input and output.
It seems like you could view the binary in different variants (in my case I'd like to view it in 8-bit chunks as decimal numbers, because that's pretty close to the input). Actually, some numbers are 16 bit, some 8, some 32, etc. So maybe there would be a way to view the binary with each of these different numbers highlighted in memory in some way.
The only way I could see that being possible is if you actually build a visualizer specific to the actual binary format/layout. So it knows where in the sequence the 32 bit numbers should be, and where the 8 bit numbers should be, etc. This is a lot of work and kind of tricky in some situations. So wondering if there's a general way to do it.
I am also wondering what the general way of debugging this type of thing currently is, so maybe I can get some ideas on what to try from that.
debugging binary
I would like to be able to debug building a binary builder. Right now I am basically printing out the input data to the binary parser, and then going deep into the code and printing out the mapping of the input to the output, then taking the output mapping (integers) and using that to locate the corresponding integer in the binary. Pretty clunky, and requires that I modify the source code deeply to get at the mapping between input and output.
It seems like you could view the binary in different variants (in my case I'd like to view it in 8-bit chunks as decimal numbers, because that's pretty close to the input). Actually, some numbers are 16 bit, some 8, some 32, etc. So maybe there would be a way to view the binary with each of these different numbers highlighted in memory in some way.
The only way I could see that being possible is if you actually build a visualizer specific to the actual binary format/layout. So it knows where in the sequence the 32 bit numbers should be, and where the 8 bit numbers should be, etc. This is a lot of work and kind of tricky in some situations. So wondering if there's a general way to do it.
I am also wondering what the general way of debugging this type of thing currently is, so maybe I can get some ideas on what to try from that.
debugging binary
debugging binary
edited Jan 28 at 18:26
Peter Mortensen
1,11521114
1,11521114
asked Jan 16 at 15:07
Lance PollardLance Pollard
819513
819513
74
You got one answer saying "use the hexdump directly, and do this and that additionally"- and that answer got a lot of upvotes. And a second answer, 5 hours later(!), saying only "use a hexdump". Then you accepted the second one in favor of the first? Seriously?
– Doc Brown
Jan 16 at 22:29
4
While you might have a good reason to use a binary format, do consider whether you can just use an existing text format like JSON instead. Human readability counts a lot, and machines and networks are typically fast enough that using a custom format to reduce size is unnecessary nowadays.
– jpmc26
Jan 16 at 23:45
4
@jpmc26 there's still a lot of use for binary formats and always will be. Human readability is usually secondary to performance, storage requirements, and network performance. And there are still many areas where especially network performance is poor and storage limited. Plus don't forget all the systems having to interface with legacy systems (both hardware and software) and having to support their data formats.
– jwenting
Jan 17 at 10:15
4
@jwenting No, actually, developer time is usually the most expensive piece of an application. Sure, that may not be the case if you're working at Google or Facebook, but most apps don't operate at that scale. And when your developers spending time on stuff is the most expensive resource, human readability counts for a lot more than the 100 milliseconds extra for the program to parse it.
– jpmc26
Jan 17 at 13:09
3
@jpmc26 I don't see anything in the question that suggests to me that the OP is the one defining the format.
– JimmyJames
Jan 17 at 15:40
|
show 6 more comments
74
You got one answer saying "use the hexdump directly, and do this and that additionally"- and that answer got a lot of upvotes. And a second answer, 5 hours later(!), saying only "use a hexdump". Then you accepted the second one in favor of the first? Seriously?
– Doc Brown
Jan 16 at 22:29
4
While you might have a good reason to use a binary format, do consider whether you can just use an existing text format like JSON instead. Human readability counts a lot, and machines and networks are typically fast enough that using a custom format to reduce size is unnecessary nowadays.
– jpmc26
Jan 16 at 23:45
4
@jpmc26 there's still a lot of use for binary formats and always will be. Human readability is usually secondary to performance, storage requirements, and network performance. And there are still many areas where especially network performance is poor and storage limited. Plus don't forget all the systems having to interface with legacy systems (both hardware and software) and having to support their data formats.
– jwenting
Jan 17 at 10:15
4
@jwenting No, actually, developer time is usually the most expensive piece of an application. Sure, that may not be the case if you're working at Google or Facebook, but most apps don't operate at that scale. And when your developers spending time on stuff is the most expensive resource, human readability counts for a lot more than the 100 milliseconds extra for the program to parse it.
– jpmc26
Jan 17 at 13:09
3
@jpmc26 I don't see anything in the question that suggests to me that the OP is the one defining the format.
– JimmyJames
Jan 17 at 15:40
74
74
You got one answer saying "use the hexdump directly, and do this and that additionally"- and that answer got a lot of upvotes. And a second answer, 5 hours later(!), saying only "use a hexdump". Then you accepted the second one in favor of the first? Seriously?
– Doc Brown
Jan 16 at 22:29
You got one answer saying "use the hexdump directly, and do this and that additionally"- and that answer got a lot of upvotes. And a second answer, 5 hours later(!), saying only "use a hexdump". Then you accepted the second one in favor of the first? Seriously?
– Doc Brown
Jan 16 at 22:29
4
4
While you might have a good reason to use a binary format, do consider whether you can just use an existing text format like JSON instead. Human readability counts a lot, and machines and networks are typically fast enough that using a custom format to reduce size is unnecessary nowadays.
– jpmc26
Jan 16 at 23:45
While you might have a good reason to use a binary format, do consider whether you can just use an existing text format like JSON instead. Human readability counts a lot, and machines and networks are typically fast enough that using a custom format to reduce size is unnecessary nowadays.
– jpmc26
Jan 16 at 23:45
4
4
@jpmc26 there's still a lot of use for binary formats and always will be. Human readability is usually secondary to performance, storage requirements, and network performance. And there are still many areas where especially network performance is poor and storage limited. Plus don't forget all the systems having to interface with legacy systems (both hardware and software) and having to support their data formats.
– jwenting
Jan 17 at 10:15
@jpmc26 there's still a lot of use for binary formats and always will be. Human readability is usually secondary to performance, storage requirements, and network performance. And there are still many areas where especially network performance is poor and storage limited. Plus don't forget all the systems having to interface with legacy systems (both hardware and software) and having to support their data formats.
– jwenting
Jan 17 at 10:15
4
4
@jwenting No, actually, developer time is usually the most expensive piece of an application. Sure, that may not be the case if you're working at Google or Facebook, but most apps don't operate at that scale. And when your developers spending time on stuff is the most expensive resource, human readability counts for a lot more than the 100 milliseconds extra for the program to parse it.
– jpmc26
Jan 17 at 13:09
@jwenting No, actually, developer time is usually the most expensive piece of an application. Sure, that may not be the case if you're working at Google or Facebook, but most apps don't operate at that scale. And when your developers spending time on stuff is the most expensive resource, human readability counts for a lot more than the 100 milliseconds extra for the program to parse it.
– jpmc26
Jan 17 at 13:09
3
3
@jpmc26 I don't see anything in the question that suggests to me that the OP is the one defining the format.
– JimmyJames
Jan 17 at 15:40
@jpmc26 I don't see anything in the question that suggests to me that the OP is the one defining the format.
– JimmyJames
Jan 17 at 15:40
|
show 6 more comments
5 Answers
5
active
oldest
votes
For ad-hoc checks, just use a standard hexdump and learn to eyeball it.
If you want to tool up for a proper investigation, I usually write a separate decoder in something like Python - ideally this will be driven directly from a message spec document or IDL, and be as automated as possible (so there's no chance of manually introducing the same bug in both decoders).
Lastly, don't forget you should be writing unit tests for your decoder, using known-correct canned input.
2
"just use a standard hexdump and learn to eyeball it." Yup. In my experience, multiple sections of anything up to 200 bits can be written down on a whiteboard for grouped comparison, which sometimes helps with this kind of thing to get started.
– Mast
Jan 16 at 19:20
1
I find a separate decoder well worth the effort if the binary data plays an important part in the application (or system, in general). This is especially true if the data format is variable: data in fixed layouts can be spotted in a hexdump with a little practice but hits a practicability wall fast. We debugged USB and CAN traffic with commercial packet decoders, and I have written a PROFIBus decoder (where variables spread across bytes, completely unreadable in a hex dump), and found all three of them immensely helpful.
– Peter A. Schneider
Jan 17 at 13:14
add a comment |
The first step to doing this is that you need a way to find or define a grammar that describes structure of the data i.e. a schema.
An example of this is a language feature of COBOL which is informally known as copybook. In COBOL programs you would define the structure of the data in memory. This structure mapped directly to the way the bytes were stored. This is common to languages of that era as opposed to common contemporary languages where the physical layout of memory is an implementation concern that is abstracted away from the developer.
A google search for binary data schema language turns up a number of tools. An example is Apache DFDL. There may already be UI for this as well.
2
This feature is not reserved to 'ancient' era languages. C and C++ structs and unions can be memory aligned. C# has StructLayoutAttribute, which I have use to transmit binary data.
– Kasper van den Berg
Jan 16 at 16:53
1
@KaspervandenBerg Unless you are saying that C and C++ added these recently, I consider that the same era. The point is that these formats were not simply for data transmission, though they were used for that, they mapped directly to how the code worked with data in memory and on disk. That's not, in general, how newer languages tend to work though they may have such features.
– JimmyJames
Jan 16 at 17:14
@KaspervandenBerg C++ does not do that as much as you think it does. It is possible using implementation-specific tooling to align and eliminate padding (and, admittedly, increasingly the standard is adding features for this kind of thing) and member order is deterministic (but not necessarily the same as in memory!).
– Lightness Races in Orbit
Jan 17 at 16:40
add a comment |
ASN.1, Abstract Syntax Notation One, provides a way of specifying a binary format.
- DDT - Develop using sample data and unit tests.
- A textual dump can be helpful. If in XML you can collapse/expand subhierarchies.
- ASN.1 is not really needed but a grammar based, more declarative file specification is easier.
6
If the never-ending parade of security vulnerabilities in ASN.1 parsers is any indication, adopting it would certainly provide good exercise in debugging binary formats.
– Mark
Jan 16 at 22:50
1
@Mark many small byte arrays (and that in varying hierarchy trees) are often not handled right (securely) in C (for instance not using exceptions). Never underestimate the low-levelness, inherent unsafeness of C. ASN.1 in - for instance - java does not expose this problem. As an ASN.1 grammar directed parsing could be done safely, even C could be done with a small and safe code base. And part of the vulnerabilities are inherent of the binary format itself: one can exploit "legal" constructs of the format's grammar, that have desastrous semantics.
– Joop Eggen
Jan 17 at 8:09
add a comment |
Other answers have described viewing a hex dump, or writing out object structures in i.e. JSON. I think combining both of these is very helpful.
Using a tool that can render the JSON on top of the hex dump is really useful; I wrote an open source tool that parsed .NET binaries called dotNetBytes, here's a view of an example DLL.
add a comment |
I'm not sure I fully understand, but it sounds like you have a parser for this binary format, and you control the code for it. So this answer is built on that assumption.
A parser will in some way be filling up structs, classes, or whatever data structure your language has. If you implement a ToString
for everything that gets parsed, then you end up with a very easy to use and easily maintained method of displaying that binary data in a human readable format.
You would essentially have:
byte arrayOfBytes; // initialized somehow
Object obj = Parser.parse(arrayOfBytes);
Logger.log(obj.ToString());
And that's it, from the standpoint of using it. Of course this requires you implement/override the ToString
function for yourObject
class/struct/whatever, and you would also have to do so for any nested classes/structs/whatevers.
You can additionally use a conditional statement to prevent the ToString
function from being called in release code so that you don't waste time on something that won't be logged outside of debug mode.
Your ToString
might look like this:
return String.Format("%d,%d,%d,%d", int32var, int16var, int8var, int32var2);
// OR
return String.Format("%s:%d,%s:%d,%s:%d,%s:%d", varName1, int32var, varName2, int16var, varName3, int8var, varName4, int32var2);
Your original question makes it sound like you've somewhat attempted to do this, and that you think this method is burdensome, but you have also at some point implemented parsing a binary format and created variables to store that data. So all you have to do is print those existing variables at the appropriate level of abstraction (the class/struct the variable is in).
This is something you should only have to do once, and you can do it while building the parser. And it will only change when the binary format changes (which will already prompt a change to your parser anyways).
In a similar vein: some languages have robust features for turning classes into XML or JSON. C# is particularly good at this. You don't have to give up your binary format, you just do the XML or JSON in a debug logging statement and leave your release code alone.
I'd personally recommend not going the hex dump route, because it's prone to errors (did you start on the right byte, are you sure when you're reading left to right that you're "seeing" the correct endianness, etc).
Example: Say your ToStrings
spit out variables a,b,c,d,e,f,g,h
. You run your program and notice a bug with g
, but the problem really started with c
(but you're debugging, so you haven't figured that out yet). If you know the input values (and you should) you'll instantly see that c
is where problems start.
Compared to a hex dump that just tells you 338E 8455 0000 FF76 0000 E444 ....
; if your fields are varying size, where does c
begin and what's the value - a hex editor will tell you but my point is this is error prone and time consuming. Not only that, but you can't easily/quickly automate a test via a hex viewer. Printing out a string after parsing the data will tell you exactly what your program is 'thinking', and will be one step along the path of automated testing.
add a comment |
protected by gnat Jan 17 at 16:12
Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).
Would you like to answer one of these unanswered questions instead?
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
For ad-hoc checks, just use a standard hexdump and learn to eyeball it.
If you want to tool up for a proper investigation, I usually write a separate decoder in something like Python - ideally this will be driven directly from a message spec document or IDL, and be as automated as possible (so there's no chance of manually introducing the same bug in both decoders).
Lastly, don't forget you should be writing unit tests for your decoder, using known-correct canned input.
2
"just use a standard hexdump and learn to eyeball it." Yup. In my experience, multiple sections of anything up to 200 bits can be written down on a whiteboard for grouped comparison, which sometimes helps with this kind of thing to get started.
– Mast
Jan 16 at 19:20
1
I find a separate decoder well worth the effort if the binary data plays an important part in the application (or system, in general). This is especially true if the data format is variable: data in fixed layouts can be spotted in a hexdump with a little practice but hits a practicability wall fast. We debugged USB and CAN traffic with commercial packet decoders, and I have written a PROFIBus decoder (where variables spread across bytes, completely unreadable in a hex dump), and found all three of them immensely helpful.
– Peter A. Schneider
Jan 17 at 13:14
add a comment |
For ad-hoc checks, just use a standard hexdump and learn to eyeball it.
If you want to tool up for a proper investigation, I usually write a separate decoder in something like Python - ideally this will be driven directly from a message spec document or IDL, and be as automated as possible (so there's no chance of manually introducing the same bug in both decoders).
Lastly, don't forget you should be writing unit tests for your decoder, using known-correct canned input.
2
"just use a standard hexdump and learn to eyeball it." Yup. In my experience, multiple sections of anything up to 200 bits can be written down on a whiteboard for grouped comparison, which sometimes helps with this kind of thing to get started.
– Mast
Jan 16 at 19:20
1
I find a separate decoder well worth the effort if the binary data plays an important part in the application (or system, in general). This is especially true if the data format is variable: data in fixed layouts can be spotted in a hexdump with a little practice but hits a practicability wall fast. We debugged USB and CAN traffic with commercial packet decoders, and I have written a PROFIBus decoder (where variables spread across bytes, completely unreadable in a hex dump), and found all three of them immensely helpful.
– Peter A. Schneider
Jan 17 at 13:14
add a comment |
For ad-hoc checks, just use a standard hexdump and learn to eyeball it.
If you want to tool up for a proper investigation, I usually write a separate decoder in something like Python - ideally this will be driven directly from a message spec document or IDL, and be as automated as possible (so there's no chance of manually introducing the same bug in both decoders).
Lastly, don't forget you should be writing unit tests for your decoder, using known-correct canned input.
For ad-hoc checks, just use a standard hexdump and learn to eyeball it.
If you want to tool up for a proper investigation, I usually write a separate decoder in something like Python - ideally this will be driven directly from a message spec document or IDL, and be as automated as possible (so there's no chance of manually introducing the same bug in both decoders).
Lastly, don't forget you should be writing unit tests for your decoder, using known-correct canned input.
answered Jan 16 at 15:27
UselessUseless
9,01922036
9,01922036
2
"just use a standard hexdump and learn to eyeball it." Yup. In my experience, multiple sections of anything up to 200 bits can be written down on a whiteboard for grouped comparison, which sometimes helps with this kind of thing to get started.
– Mast
Jan 16 at 19:20
1
I find a separate decoder well worth the effort if the binary data plays an important part in the application (or system, in general). This is especially true if the data format is variable: data in fixed layouts can be spotted in a hexdump with a little practice but hits a practicability wall fast. We debugged USB and CAN traffic with commercial packet decoders, and I have written a PROFIBus decoder (where variables spread across bytes, completely unreadable in a hex dump), and found all three of them immensely helpful.
– Peter A. Schneider
Jan 17 at 13:14
add a comment |
2
"just use a standard hexdump and learn to eyeball it." Yup. In my experience, multiple sections of anything up to 200 bits can be written down on a whiteboard for grouped comparison, which sometimes helps with this kind of thing to get started.
– Mast
Jan 16 at 19:20
1
I find a separate decoder well worth the effort if the binary data plays an important part in the application (or system, in general). This is especially true if the data format is variable: data in fixed layouts can be spotted in a hexdump with a little practice but hits a practicability wall fast. We debugged USB and CAN traffic with commercial packet decoders, and I have written a PROFIBus decoder (where variables spread across bytes, completely unreadable in a hex dump), and found all three of them immensely helpful.
– Peter A. Schneider
Jan 17 at 13:14
2
2
"just use a standard hexdump and learn to eyeball it." Yup. In my experience, multiple sections of anything up to 200 bits can be written down on a whiteboard for grouped comparison, which sometimes helps with this kind of thing to get started.
– Mast
Jan 16 at 19:20
"just use a standard hexdump and learn to eyeball it." Yup. In my experience, multiple sections of anything up to 200 bits can be written down on a whiteboard for grouped comparison, which sometimes helps with this kind of thing to get started.
– Mast
Jan 16 at 19:20
1
1
I find a separate decoder well worth the effort if the binary data plays an important part in the application (or system, in general). This is especially true if the data format is variable: data in fixed layouts can be spotted in a hexdump with a little practice but hits a practicability wall fast. We debugged USB and CAN traffic with commercial packet decoders, and I have written a PROFIBus decoder (where variables spread across bytes, completely unreadable in a hex dump), and found all three of them immensely helpful.
– Peter A. Schneider
Jan 17 at 13:14
I find a separate decoder well worth the effort if the binary data plays an important part in the application (or system, in general). This is especially true if the data format is variable: data in fixed layouts can be spotted in a hexdump with a little practice but hits a practicability wall fast. We debugged USB and CAN traffic with commercial packet decoders, and I have written a PROFIBus decoder (where variables spread across bytes, completely unreadable in a hex dump), and found all three of them immensely helpful.
– Peter A. Schneider
Jan 17 at 13:14
add a comment |
The first step to doing this is that you need a way to find or define a grammar that describes structure of the data i.e. a schema.
An example of this is a language feature of COBOL which is informally known as copybook. In COBOL programs you would define the structure of the data in memory. This structure mapped directly to the way the bytes were stored. This is common to languages of that era as opposed to common contemporary languages where the physical layout of memory is an implementation concern that is abstracted away from the developer.
A google search for binary data schema language turns up a number of tools. An example is Apache DFDL. There may already be UI for this as well.
2
This feature is not reserved to 'ancient' era languages. C and C++ structs and unions can be memory aligned. C# has StructLayoutAttribute, which I have use to transmit binary data.
– Kasper van den Berg
Jan 16 at 16:53
1
@KaspervandenBerg Unless you are saying that C and C++ added these recently, I consider that the same era. The point is that these formats were not simply for data transmission, though they were used for that, they mapped directly to how the code worked with data in memory and on disk. That's not, in general, how newer languages tend to work though they may have such features.
– JimmyJames
Jan 16 at 17:14
@KaspervandenBerg C++ does not do that as much as you think it does. It is possible using implementation-specific tooling to align and eliminate padding (and, admittedly, increasingly the standard is adding features for this kind of thing) and member order is deterministic (but not necessarily the same as in memory!).
– Lightness Races in Orbit
Jan 17 at 16:40
add a comment |
The first step to doing this is that you need a way to find or define a grammar that describes structure of the data i.e. a schema.
An example of this is a language feature of COBOL which is informally known as copybook. In COBOL programs you would define the structure of the data in memory. This structure mapped directly to the way the bytes were stored. This is common to languages of that era as opposed to common contemporary languages where the physical layout of memory is an implementation concern that is abstracted away from the developer.
A google search for binary data schema language turns up a number of tools. An example is Apache DFDL. There may already be UI for this as well.
2
This feature is not reserved to 'ancient' era languages. C and C++ structs and unions can be memory aligned. C# has StructLayoutAttribute, which I have use to transmit binary data.
– Kasper van den Berg
Jan 16 at 16:53
1
@KaspervandenBerg Unless you are saying that C and C++ added these recently, I consider that the same era. The point is that these formats were not simply for data transmission, though they were used for that, they mapped directly to how the code worked with data in memory and on disk. That's not, in general, how newer languages tend to work though they may have such features.
– JimmyJames
Jan 16 at 17:14
@KaspervandenBerg C++ does not do that as much as you think it does. It is possible using implementation-specific tooling to align and eliminate padding (and, admittedly, increasingly the standard is adding features for this kind of thing) and member order is deterministic (but not necessarily the same as in memory!).
– Lightness Races in Orbit
Jan 17 at 16:40
add a comment |
The first step to doing this is that you need a way to find or define a grammar that describes structure of the data i.e. a schema.
An example of this is a language feature of COBOL which is informally known as copybook. In COBOL programs you would define the structure of the data in memory. This structure mapped directly to the way the bytes were stored. This is common to languages of that era as opposed to common contemporary languages where the physical layout of memory is an implementation concern that is abstracted away from the developer.
A google search for binary data schema language turns up a number of tools. An example is Apache DFDL. There may already be UI for this as well.
The first step to doing this is that you need a way to find or define a grammar that describes structure of the data i.e. a schema.
An example of this is a language feature of COBOL which is informally known as copybook. In COBOL programs you would define the structure of the data in memory. This structure mapped directly to the way the bytes were stored. This is common to languages of that era as opposed to common contemporary languages where the physical layout of memory is an implementation concern that is abstracted away from the developer.
A google search for binary data schema language turns up a number of tools. An example is Apache DFDL. There may already be UI for this as well.
edited Jan 16 at 16:22
answered Jan 16 at 15:38
JimmyJamesJimmyJames
13.3k2352
13.3k2352
2
This feature is not reserved to 'ancient' era languages. C and C++ structs and unions can be memory aligned. C# has StructLayoutAttribute, which I have use to transmit binary data.
– Kasper van den Berg
Jan 16 at 16:53
1
@KaspervandenBerg Unless you are saying that C and C++ added these recently, I consider that the same era. The point is that these formats were not simply for data transmission, though they were used for that, they mapped directly to how the code worked with data in memory and on disk. That's not, in general, how newer languages tend to work though they may have such features.
– JimmyJames
Jan 16 at 17:14
@KaspervandenBerg C++ does not do that as much as you think it does. It is possible using implementation-specific tooling to align and eliminate padding (and, admittedly, increasingly the standard is adding features for this kind of thing) and member order is deterministic (but not necessarily the same as in memory!).
– Lightness Races in Orbit
Jan 17 at 16:40
add a comment |
2
This feature is not reserved to 'ancient' era languages. C and C++ structs and unions can be memory aligned. C# has StructLayoutAttribute, which I have use to transmit binary data.
– Kasper van den Berg
Jan 16 at 16:53
1
@KaspervandenBerg Unless you are saying that C and C++ added these recently, I consider that the same era. The point is that these formats were not simply for data transmission, though they were used for that, they mapped directly to how the code worked with data in memory and on disk. That's not, in general, how newer languages tend to work though they may have such features.
– JimmyJames
Jan 16 at 17:14
@KaspervandenBerg C++ does not do that as much as you think it does. It is possible using implementation-specific tooling to align and eliminate padding (and, admittedly, increasingly the standard is adding features for this kind of thing) and member order is deterministic (but not necessarily the same as in memory!).
– Lightness Races in Orbit
Jan 17 at 16:40
2
2
This feature is not reserved to 'ancient' era languages. C and C++ structs and unions can be memory aligned. C# has StructLayoutAttribute, which I have use to transmit binary data.
– Kasper van den Berg
Jan 16 at 16:53
This feature is not reserved to 'ancient' era languages. C and C++ structs and unions can be memory aligned. C# has StructLayoutAttribute, which I have use to transmit binary data.
– Kasper van den Berg
Jan 16 at 16:53
1
1
@KaspervandenBerg Unless you are saying that C and C++ added these recently, I consider that the same era. The point is that these formats were not simply for data transmission, though they were used for that, they mapped directly to how the code worked with data in memory and on disk. That's not, in general, how newer languages tend to work though they may have such features.
– JimmyJames
Jan 16 at 17:14
@KaspervandenBerg Unless you are saying that C and C++ added these recently, I consider that the same era. The point is that these formats were not simply for data transmission, though they were used for that, they mapped directly to how the code worked with data in memory and on disk. That's not, in general, how newer languages tend to work though they may have such features.
– JimmyJames
Jan 16 at 17:14
@KaspervandenBerg C++ does not do that as much as you think it does. It is possible using implementation-specific tooling to align and eliminate padding (and, admittedly, increasingly the standard is adding features for this kind of thing) and member order is deterministic (but not necessarily the same as in memory!).
– Lightness Races in Orbit
Jan 17 at 16:40
@KaspervandenBerg C++ does not do that as much as you think it does. It is possible using implementation-specific tooling to align and eliminate padding (and, admittedly, increasingly the standard is adding features for this kind of thing) and member order is deterministic (but not necessarily the same as in memory!).
– Lightness Races in Orbit
Jan 17 at 16:40
add a comment |
ASN.1, Abstract Syntax Notation One, provides a way of specifying a binary format.
- DDT - Develop using sample data and unit tests.
- A textual dump can be helpful. If in XML you can collapse/expand subhierarchies.
- ASN.1 is not really needed but a grammar based, more declarative file specification is easier.
6
If the never-ending parade of security vulnerabilities in ASN.1 parsers is any indication, adopting it would certainly provide good exercise in debugging binary formats.
– Mark
Jan 16 at 22:50
1
@Mark many small byte arrays (and that in varying hierarchy trees) are often not handled right (securely) in C (for instance not using exceptions). Never underestimate the low-levelness, inherent unsafeness of C. ASN.1 in - for instance - java does not expose this problem. As an ASN.1 grammar directed parsing could be done safely, even C could be done with a small and safe code base. And part of the vulnerabilities are inherent of the binary format itself: one can exploit "legal" constructs of the format's grammar, that have desastrous semantics.
– Joop Eggen
Jan 17 at 8:09
add a comment |
ASN.1, Abstract Syntax Notation One, provides a way of specifying a binary format.
- DDT - Develop using sample data and unit tests.
- A textual dump can be helpful. If in XML you can collapse/expand subhierarchies.
- ASN.1 is not really needed but a grammar based, more declarative file specification is easier.
6
If the never-ending parade of security vulnerabilities in ASN.1 parsers is any indication, adopting it would certainly provide good exercise in debugging binary formats.
– Mark
Jan 16 at 22:50
1
@Mark many small byte arrays (and that in varying hierarchy trees) are often not handled right (securely) in C (for instance not using exceptions). Never underestimate the low-levelness, inherent unsafeness of C. ASN.1 in - for instance - java does not expose this problem. As an ASN.1 grammar directed parsing could be done safely, even C could be done with a small and safe code base. And part of the vulnerabilities are inherent of the binary format itself: one can exploit "legal" constructs of the format's grammar, that have desastrous semantics.
– Joop Eggen
Jan 17 at 8:09
add a comment |
ASN.1, Abstract Syntax Notation One, provides a way of specifying a binary format.
- DDT - Develop using sample data and unit tests.
- A textual dump can be helpful. If in XML you can collapse/expand subhierarchies.
- ASN.1 is not really needed but a grammar based, more declarative file specification is easier.
ASN.1, Abstract Syntax Notation One, provides a way of specifying a binary format.
- DDT - Develop using sample data and unit tests.
- A textual dump can be helpful. If in XML you can collapse/expand subhierarchies.
- ASN.1 is not really needed but a grammar based, more declarative file specification is easier.
answered Jan 16 at 16:37
Joop EggenJoop Eggen
1,02756
1,02756
6
If the never-ending parade of security vulnerabilities in ASN.1 parsers is any indication, adopting it would certainly provide good exercise in debugging binary formats.
– Mark
Jan 16 at 22:50
1
@Mark many small byte arrays (and that in varying hierarchy trees) are often not handled right (securely) in C (for instance not using exceptions). Never underestimate the low-levelness, inherent unsafeness of C. ASN.1 in - for instance - java does not expose this problem. As an ASN.1 grammar directed parsing could be done safely, even C could be done with a small and safe code base. And part of the vulnerabilities are inherent of the binary format itself: one can exploit "legal" constructs of the format's grammar, that have desastrous semantics.
– Joop Eggen
Jan 17 at 8:09
add a comment |
6
If the never-ending parade of security vulnerabilities in ASN.1 parsers is any indication, adopting it would certainly provide good exercise in debugging binary formats.
– Mark
Jan 16 at 22:50
1
@Mark many small byte arrays (and that in varying hierarchy trees) are often not handled right (securely) in C (for instance not using exceptions). Never underestimate the low-levelness, inherent unsafeness of C. ASN.1 in - for instance - java does not expose this problem. As an ASN.1 grammar directed parsing could be done safely, even C could be done with a small and safe code base. And part of the vulnerabilities are inherent of the binary format itself: one can exploit "legal" constructs of the format's grammar, that have desastrous semantics.
– Joop Eggen
Jan 17 at 8:09
6
6
If the never-ending parade of security vulnerabilities in ASN.1 parsers is any indication, adopting it would certainly provide good exercise in debugging binary formats.
– Mark
Jan 16 at 22:50
If the never-ending parade of security vulnerabilities in ASN.1 parsers is any indication, adopting it would certainly provide good exercise in debugging binary formats.
– Mark
Jan 16 at 22:50
1
1
@Mark many small byte arrays (and that in varying hierarchy trees) are often not handled right (securely) in C (for instance not using exceptions). Never underestimate the low-levelness, inherent unsafeness of C. ASN.1 in - for instance - java does not expose this problem. As an ASN.1 grammar directed parsing could be done safely, even C could be done with a small and safe code base. And part of the vulnerabilities are inherent of the binary format itself: one can exploit "legal" constructs of the format's grammar, that have desastrous semantics.
– Joop Eggen
Jan 17 at 8:09
@Mark many small byte arrays (and that in varying hierarchy trees) are often not handled right (securely) in C (for instance not using exceptions). Never underestimate the low-levelness, inherent unsafeness of C. ASN.1 in - for instance - java does not expose this problem. As an ASN.1 grammar directed parsing could be done safely, even C could be done with a small and safe code base. And part of the vulnerabilities are inherent of the binary format itself: one can exploit "legal" constructs of the format's grammar, that have desastrous semantics.
– Joop Eggen
Jan 17 at 8:09
add a comment |
Other answers have described viewing a hex dump, or writing out object structures in i.e. JSON. I think combining both of these is very helpful.
Using a tool that can render the JSON on top of the hex dump is really useful; I wrote an open source tool that parsed .NET binaries called dotNetBytes, here's a view of an example DLL.
add a comment |
Other answers have described viewing a hex dump, or writing out object structures in i.e. JSON. I think combining both of these is very helpful.
Using a tool that can render the JSON on top of the hex dump is really useful; I wrote an open source tool that parsed .NET binaries called dotNetBytes, here's a view of an example DLL.
add a comment |
Other answers have described viewing a hex dump, or writing out object structures in i.e. JSON. I think combining both of these is very helpful.
Using a tool that can render the JSON on top of the hex dump is really useful; I wrote an open source tool that parsed .NET binaries called dotNetBytes, here's a view of an example DLL.
Other answers have described viewing a hex dump, or writing out object structures in i.e. JSON. I think combining both of these is very helpful.
Using a tool that can render the JSON on top of the hex dump is really useful; I wrote an open source tool that parsed .NET binaries called dotNetBytes, here's a view of an example DLL.
answered Jan 17 at 21:33
Carl WalshCarl Walsh
1653
1653
add a comment |
add a comment |
I'm not sure I fully understand, but it sounds like you have a parser for this binary format, and you control the code for it. So this answer is built on that assumption.
A parser will in some way be filling up structs, classes, or whatever data structure your language has. If you implement a ToString
for everything that gets parsed, then you end up with a very easy to use and easily maintained method of displaying that binary data in a human readable format.
You would essentially have:
byte arrayOfBytes; // initialized somehow
Object obj = Parser.parse(arrayOfBytes);
Logger.log(obj.ToString());
And that's it, from the standpoint of using it. Of course this requires you implement/override the ToString
function for yourObject
class/struct/whatever, and you would also have to do so for any nested classes/structs/whatevers.
You can additionally use a conditional statement to prevent the ToString
function from being called in release code so that you don't waste time on something that won't be logged outside of debug mode.
Your ToString
might look like this:
return String.Format("%d,%d,%d,%d", int32var, int16var, int8var, int32var2);
// OR
return String.Format("%s:%d,%s:%d,%s:%d,%s:%d", varName1, int32var, varName2, int16var, varName3, int8var, varName4, int32var2);
Your original question makes it sound like you've somewhat attempted to do this, and that you think this method is burdensome, but you have also at some point implemented parsing a binary format and created variables to store that data. So all you have to do is print those existing variables at the appropriate level of abstraction (the class/struct the variable is in).
This is something you should only have to do once, and you can do it while building the parser. And it will only change when the binary format changes (which will already prompt a change to your parser anyways).
In a similar vein: some languages have robust features for turning classes into XML or JSON. C# is particularly good at this. You don't have to give up your binary format, you just do the XML or JSON in a debug logging statement and leave your release code alone.
I'd personally recommend not going the hex dump route, because it's prone to errors (did you start on the right byte, are you sure when you're reading left to right that you're "seeing" the correct endianness, etc).
Example: Say your ToStrings
spit out variables a,b,c,d,e,f,g,h
. You run your program and notice a bug with g
, but the problem really started with c
(but you're debugging, so you haven't figured that out yet). If you know the input values (and you should) you'll instantly see that c
is where problems start.
Compared to a hex dump that just tells you 338E 8455 0000 FF76 0000 E444 ....
; if your fields are varying size, where does c
begin and what's the value - a hex editor will tell you but my point is this is error prone and time consuming. Not only that, but you can't easily/quickly automate a test via a hex viewer. Printing out a string after parsing the data will tell you exactly what your program is 'thinking', and will be one step along the path of automated testing.
add a comment |
I'm not sure I fully understand, but it sounds like you have a parser for this binary format, and you control the code for it. So this answer is built on that assumption.
A parser will in some way be filling up structs, classes, or whatever data structure your language has. If you implement a ToString
for everything that gets parsed, then you end up with a very easy to use and easily maintained method of displaying that binary data in a human readable format.
You would essentially have:
byte arrayOfBytes; // initialized somehow
Object obj = Parser.parse(arrayOfBytes);
Logger.log(obj.ToString());
And that's it, from the standpoint of using it. Of course this requires you implement/override the ToString
function for yourObject
class/struct/whatever, and you would also have to do so for any nested classes/structs/whatevers.
You can additionally use a conditional statement to prevent the ToString
function from being called in release code so that you don't waste time on something that won't be logged outside of debug mode.
Your ToString
might look like this:
return String.Format("%d,%d,%d,%d", int32var, int16var, int8var, int32var2);
// OR
return String.Format("%s:%d,%s:%d,%s:%d,%s:%d", varName1, int32var, varName2, int16var, varName3, int8var, varName4, int32var2);
Your original question makes it sound like you've somewhat attempted to do this, and that you think this method is burdensome, but you have also at some point implemented parsing a binary format and created variables to store that data. So all you have to do is print those existing variables at the appropriate level of abstraction (the class/struct the variable is in).
This is something you should only have to do once, and you can do it while building the parser. And it will only change when the binary format changes (which will already prompt a change to your parser anyways).
In a similar vein: some languages have robust features for turning classes into XML or JSON. C# is particularly good at this. You don't have to give up your binary format, you just do the XML or JSON in a debug logging statement and leave your release code alone.
I'd personally recommend not going the hex dump route, because it's prone to errors (did you start on the right byte, are you sure when you're reading left to right that you're "seeing" the correct endianness, etc).
Example: Say your ToStrings
spit out variables a,b,c,d,e,f,g,h
. You run your program and notice a bug with g
, but the problem really started with c
(but you're debugging, so you haven't figured that out yet). If you know the input values (and you should) you'll instantly see that c
is where problems start.
Compared to a hex dump that just tells you 338E 8455 0000 FF76 0000 E444 ....
; if your fields are varying size, where does c
begin and what's the value - a hex editor will tell you but my point is this is error prone and time consuming. Not only that, but you can't easily/quickly automate a test via a hex viewer. Printing out a string after parsing the data will tell you exactly what your program is 'thinking', and will be one step along the path of automated testing.
add a comment |
I'm not sure I fully understand, but it sounds like you have a parser for this binary format, and you control the code for it. So this answer is built on that assumption.
A parser will in some way be filling up structs, classes, or whatever data structure your language has. If you implement a ToString
for everything that gets parsed, then you end up with a very easy to use and easily maintained method of displaying that binary data in a human readable format.
You would essentially have:
byte arrayOfBytes; // initialized somehow
Object obj = Parser.parse(arrayOfBytes);
Logger.log(obj.ToString());
And that's it, from the standpoint of using it. Of course this requires you implement/override the ToString
function for yourObject
class/struct/whatever, and you would also have to do so for any nested classes/structs/whatevers.
You can additionally use a conditional statement to prevent the ToString
function from being called in release code so that you don't waste time on something that won't be logged outside of debug mode.
Your ToString
might look like this:
return String.Format("%d,%d,%d,%d", int32var, int16var, int8var, int32var2);
// OR
return String.Format("%s:%d,%s:%d,%s:%d,%s:%d", varName1, int32var, varName2, int16var, varName3, int8var, varName4, int32var2);
Your original question makes it sound like you've somewhat attempted to do this, and that you think this method is burdensome, but you have also at some point implemented parsing a binary format and created variables to store that data. So all you have to do is print those existing variables at the appropriate level of abstraction (the class/struct the variable is in).
This is something you should only have to do once, and you can do it while building the parser. And it will only change when the binary format changes (which will already prompt a change to your parser anyways).
In a similar vein: some languages have robust features for turning classes into XML or JSON. C# is particularly good at this. You don't have to give up your binary format, you just do the XML or JSON in a debug logging statement and leave your release code alone.
I'd personally recommend not going the hex dump route, because it's prone to errors (did you start on the right byte, are you sure when you're reading left to right that you're "seeing" the correct endianness, etc).
Example: Say your ToStrings
spit out variables a,b,c,d,e,f,g,h
. You run your program and notice a bug with g
, but the problem really started with c
(but you're debugging, so you haven't figured that out yet). If you know the input values (and you should) you'll instantly see that c
is where problems start.
Compared to a hex dump that just tells you 338E 8455 0000 FF76 0000 E444 ....
; if your fields are varying size, where does c
begin and what's the value - a hex editor will tell you but my point is this is error prone and time consuming. Not only that, but you can't easily/quickly automate a test via a hex viewer. Printing out a string after parsing the data will tell you exactly what your program is 'thinking', and will be one step along the path of automated testing.
I'm not sure I fully understand, but it sounds like you have a parser for this binary format, and you control the code for it. So this answer is built on that assumption.
A parser will in some way be filling up structs, classes, or whatever data structure your language has. If you implement a ToString
for everything that gets parsed, then you end up with a very easy to use and easily maintained method of displaying that binary data in a human readable format.
You would essentially have:
byte arrayOfBytes; // initialized somehow
Object obj = Parser.parse(arrayOfBytes);
Logger.log(obj.ToString());
And that's it, from the standpoint of using it. Of course this requires you implement/override the ToString
function for yourObject
class/struct/whatever, and you would also have to do so for any nested classes/structs/whatevers.
You can additionally use a conditional statement to prevent the ToString
function from being called in release code so that you don't waste time on something that won't be logged outside of debug mode.
Your ToString
might look like this:
return String.Format("%d,%d,%d,%d", int32var, int16var, int8var, int32var2);
// OR
return String.Format("%s:%d,%s:%d,%s:%d,%s:%d", varName1, int32var, varName2, int16var, varName3, int8var, varName4, int32var2);
Your original question makes it sound like you've somewhat attempted to do this, and that you think this method is burdensome, but you have also at some point implemented parsing a binary format and created variables to store that data. So all you have to do is print those existing variables at the appropriate level of abstraction (the class/struct the variable is in).
This is something you should only have to do once, and you can do it while building the parser. And it will only change when the binary format changes (which will already prompt a change to your parser anyways).
In a similar vein: some languages have robust features for turning classes into XML or JSON. C# is particularly good at this. You don't have to give up your binary format, you just do the XML or JSON in a debug logging statement and leave your release code alone.
I'd personally recommend not going the hex dump route, because it's prone to errors (did you start on the right byte, are you sure when you're reading left to right that you're "seeing" the correct endianness, etc).
Example: Say your ToStrings
spit out variables a,b,c,d,e,f,g,h
. You run your program and notice a bug with g
, but the problem really started with c
(but you're debugging, so you haven't figured that out yet). If you know the input values (and you should) you'll instantly see that c
is where problems start.
Compared to a hex dump that just tells you 338E 8455 0000 FF76 0000 E444 ....
; if your fields are varying size, where does c
begin and what's the value - a hex editor will tell you but my point is this is error prone and time consuming. Not only that, but you can't easily/quickly automate a test via a hex viewer. Printing out a string after parsing the data will tell you exactly what your program is 'thinking', and will be one step along the path of automated testing.
answered Jan 17 at 16:02
ShazShaz
2,3761714
2,3761714
add a comment |
add a comment |
protected by gnat Jan 17 at 16:12
Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).
Would you like to answer one of these unanswered questions instead?
74
You got one answer saying "use the hexdump directly, and do this and that additionally"- and that answer got a lot of upvotes. And a second answer, 5 hours later(!), saying only "use a hexdump". Then you accepted the second one in favor of the first? Seriously?
– Doc Brown
Jan 16 at 22:29
4
While you might have a good reason to use a binary format, do consider whether you can just use an existing text format like JSON instead. Human readability counts a lot, and machines and networks are typically fast enough that using a custom format to reduce size is unnecessary nowadays.
– jpmc26
Jan 16 at 23:45
4
@jpmc26 there's still a lot of use for binary formats and always will be. Human readability is usually secondary to performance, storage requirements, and network performance. And there are still many areas where especially network performance is poor and storage limited. Plus don't forget all the systems having to interface with legacy systems (both hardware and software) and having to support their data formats.
– jwenting
Jan 17 at 10:15
4
@jwenting No, actually, developer time is usually the most expensive piece of an application. Sure, that may not be the case if you're working at Google or Facebook, but most apps don't operate at that scale. And when your developers spending time on stuff is the most expensive resource, human readability counts for a lot more than the 100 milliseconds extra for the program to parse it.
– jpmc26
Jan 17 at 13:09
3
@jpmc26 I don't see anything in the question that suggests to me that the OP is the one defining the format.
– JimmyJames
Jan 17 at 15:40