C Data Serialization

The problem is that with various data structures (which often contain void. data so you don't know whether you need to care about byte ordering) the code becomes really bloated with serialization code that's very specific to each data structure and can't be reused at all. Here, we use the term 'serialization' to mean the reversible deconstruction of an arbitrary set of C data structures to a sequence of bytes. Such a system can be used to reconstitute an equivalent structure in another program context. It is often necessary to send or receive complex data structures to or from another program that may run on a different architecture or may have been designed for different version of the data structures in question. A typical example is a program that saves its state to a file on exit and then reads it back when started.

(Redirected from Comparison of data serialization formats)

This is a comparison of>N/ANoApache Avro™ 1.8.1 SpecificationYesNoN/AYes (built-in)N/AN/AApache ParquetApache Software FoundationN/ANoYesNoNoN/AJava, PythonNoASN.1ISO, IEC, ITU-TN/AYesISO/IEC 8824; X.680 series of ITU-T RecommendationsYes
(BER, DER, PER, OER, or custom via ECN)Yes
(XER, JER, GSER, or custom via ECN)PartialfYes (built-in)N/AYes (OER)BencodeBram Cohen (creator)
BitTorrent, Inc. (maintainer)N/ADe facto standard via BitTorrent Enhancement Proposal (BEP)Part of BitTorrent protocol specificationPartially
(numbers and delimiters are ASCII)NoNoNoNoN/ABinnBernardo RamosN/ANoBinn SpecificationYesNoNoNoNoYesBSONMongoDBJSONNoBSON SpecificationYesNoNoNoNoN/ACBORCarsten Bormann, P. HoffmanJSON (loosely)YesRFC 7049YesNoYes
through taggingYes
(CDDL)NoYesComma-separated values (CSV)RFC author:
Yakov ShafranovichN/APartial
(myriad informal variants used)RFC 4180
(among others)NoYesNoNoNoNoCommon Data Representation (CDR)Object Management GroupN/AYesGeneral Inter-ORB ProtocolYesNoYesYesADA, C, C++, Java, Cobol, Lisp, Python, Ruby, SmalltalkN/AD-Bus Message Protocolfreedesktop.orgN/AYesD-Bus SpecificationYesNoNoPartial
(Signature strings)Yes
(see D-Bus)N/AEfficient XML Interchange (EXI)W3CXML, Efficient XMLYesEfficient XML Interchange (EXI) Format 1.0YesYes
(XML)Yes
(XPointer, XPath)Yes
(XML Schema)Yes
(DOM, SAX, StAX, XQuery, XPath)N/AFlatBuffersGoogleN/ANoYesYes
(Apache Arrow)Partial
(internal to the buffer)Yes [2]C++, Java, C#, Go, Python, Rust, JavaScript, PHP, C, Dart, Lua, TypeScriptYesFast InfosetISO, IEC, ITU-TXMLYesITU-T X.891 and ISO/IEC 24824-1:2007YesNoYes
(XPointer, XPath)Yes
(XML schema)Yes
(DOM, SAX, XQuery, XPath)N/AFHIRHealth_Level_7REST basicsYesFast Healthcare Interoperability ResourcesYesYesYesYesHapi for FHIR[1]JSON, XML, TurtleNoIonAmazonJSONNoThe Amazon Ion SpecificationYesYesNoNoNoN/AJava serializationOracle CorporationN/AYesJava Object SerializationYesNoYesNoYesN/AJSONDouglas CrockfordJavaScript syntaxYesSTD 90/RFC 8259
(ancillary:
RFC 6901,
RFC 6902), ECMA-404, ISO/IEC 21778:2017No, but see BSON, Smile, UBJSONYesYes
(JSON Pointer (RFC 6901);
alternately:
JSONPath, JPath, JSPON, json:select()), JSON-LDPartial
(JSON Schema Proposal, ASN.1 with JER, Kwalify, Rx, Itemscript Schema), JSON-LDPartial
(Clarinet, JSONQuery, JSONPath), JSON-LDNoMessagePackSadayuki FuruhashiJSON (loosely)NoMessagePack format specificationYesNoNoNoNoYesNetstringsDan BernsteinN/ANonetstrings.txtYesYesNoNoNoYesOGDLRolf Veen?NoSpecificationYes
(Binary Specification)YesYes
(Path Specification)Yes
(Schema WD)N/AOPC-UA BinaryOPC FoundationN/ANoopcfoundation.orgYesNoYesNoNoN/AOpenDDLEric LengyelC, PHPNoOpenDDL.orgNoYesYesNoYes
(OpenDDL Library)N/APickle (Python)Guido van RossumPythonDe facto standard via Python Enhancement Proposals (PEPs)[3] PEP 3154 -- Pickle protocol version 4YesNoNoNoYes
([4])NoProperty listNeXT (creator)
Apple (maintainer)?PartialPublic DTD for XML formatYesaYesbNo?Cocoa, CoreFoundation, OpenStep, GnuStepNoProtocol Buffers (protobuf)GoogleN/ANoDeveloper Guide: EncodingYesPartialdNoYes (built-in)C++, C#, Java, Python, Javascript, GoNoS-expressionsJohn McCarthy (original)
Ron Rivest (internet draft)Lisp, NetstringsPartial
(largely de facto)Yes
('Canonical representation')Yes
('Advanced transport representation')NoNoN/ASmileTatu SalorantaJSONNoSmile Format SpecificationYesNoNoPartial
(JSON Schema Proposal, other JSON schemas/IDLs)Partial
(via JSON APIs implemented with Smile backend, on Jackson, Python)N/ASOAPW3CXMLYesW3C Recommendations:
SOAP/1.1
SOAP/1.2Partial
(Efficient XML Interchange, Binary XML, Fast Infoset, MTOM, XSD base64 data)YesYes
(built-in id/ref, XPointer, XPath)Yes
(WSDL, XML schema)Yes
(DOM, SAX, XQuery, XPath)N/AStructured Data eXchange FormatsMax WildgrubeN/AYesRFC 3072YesNoNoNoN/AThriftFacebook (creator)
Apache (maintainer)N/ANoOriginal whitepaperYesPartialcNoYes (built-in)N/AUBJSONThe Buzz Media, LLCJSON, BSONNo[5]YesNoNoNoNoN/AeXternal Data Representation (XDR)Sun Microsystems (creator)
IETF (maintainer)N/AYesSTD 67/RFC 4506YesNoYesYesYesN/AXMLW3CSGMLYesW3C Recommendations:
1.0 (Fifth Edition)
1.1 (Second Edition)Partial
(Efficient XML Interchange, Binary XML, Fast Infoset, XSD base64 data)YesYes
(XPointer, XPath)Yes
(XML schema, RELAX NG)Yes
(DOM, SAX, XQuery, XPath)N/AXML-RPCDave Winer[2]XMLNoXML-RPC SpecificationNoYesNoNoNoN/AYAMLClark Evans,
Ingy döt Net,
and Oren Ben-KikiC, Java, Perl, Python, Ruby, Email, HTML, MIME, URI, XML, SAX, SOAP, JSON[3]NoVersion 1.2NoYesYesPartial
(Kwalify, Rx, built-in language type-defs)NoN/ANameCreator-maintainerBased onStandardized?SpecificationBinary?Human-readable?Supports references?eSchema-IDL?Standard APIsSupports Zero-copy operations

  • a. ^ The current default format is binary.
  • b. ^ The 'classic' format is plain text, and an XML format is also supported.
  • c. ^ Theoretically possible due to abstraction, but no implementation is included.
  • d. ^ The primary format is binary, but a text format is available.[4]
  • e. ^ Means that generic tools/libraries know how to encode, decode, and dereference a reference to another piece of data in the same document. A tool may require the IDL file, but no more. Excludes custom, non-standardized referencing techniques.
  • f. ^ ASN.1 does offer OIDs, a standard format for globally unique identifiers, as well as a standard notation ('absolute reference') for referencing a component of a value. Thus it would be possible to reference a component of an encoded value present in a document by combining an OID (assigned to the document) and an 'absolute reference' to the component of the value. However, there is no standard way to indicate that a field contains such an absolute reference. Therefore, a generic ASN.1 tool/library cannot automatically encode/decode/resolve references within a document without help from custom-written program code.
  • g. ^ VelocyPack offers a value type to store pointers to other VPack items. It is allowed if the VPack data resides in memory, but not if stored on disk or sent over a network.
  • h. ^ The primary format is binary, but a text format is available.[5][6]
  • i. ^ The primary format is binary, but text and json formats are available.[7]

C Data Serialization Definition

Syntax comparison of human-readable formats[edit]

FormatNullBoolean trueBoolean falseIntegerFloating-pointStringArrayAssociative array/Object
ASN.1
(XML Encoding Rules)
<foo /><foo>true</foo><foo>false</foo><foo>685230</foo><foo>6.8523015e+5</foo><foo>A to Z</foo>An object (the key is a field name):

A data mapping (the key is a data value):

CSVbnulla
(or an empty element in the row)a
1a
truea
0a
falsea
685230
-685230a
6.8523015e+5aA to Z
'We said, 'no'.'
true,-42.1e7,'A to Z'
FormatNullBoolean trueBoolean falseIntegerFloating-pointStringArrayAssociative array/Object
Ion

null
null.null
null.bool
null.int
null.float
null.decimal
null.timestamp
null.string
null.symbol
null.blob
null.clob
null.struct
null.list
null.sexp

truefalse685230
-685230
0xA74AE
0b111010010101110
6.8523015e5'A to Z'
''
A
to
Z
''
Netstringsc0:,a
4:null,a
1:1,a
4:true,a
1:0,a
5:false,a
6:685230,a9:6.8523e+5,a6:A to Z,29:4:true,0:,7:-42.1e7,6:A to Z,41:9:2:42,1:1,25:6:A to Z,12:1:1,1:2,1:3,a
JSONnulltruefalse685230
-685230
6.8523015e+5'A to Z'
OGDL[verification needed]nullatrueafalsea685230a6.8523015e+5a'A to Z'
'A to Z'
NoSpaces

(true, null, -42.1e7, 'A to Z')

FormatNullBoolean trueBoolean falseIntegerFloating-pointStringArrayAssociative array/Object
OpenDDLref {null}bool {true}bool {false}int32 {685230}
int32 {0x74AE}
int32 {0b111010010101110}
float {6.8523015e+5}string {'A to Z'}Homogeneous array:

Heterogeneous array:

Pickle (Python)N.I01n.I00n.I685230n.F685230.15n.S'A to Z'n.(lI01na(laF-421000000.0naS'A to Z'na.(dI42nI01nsS'A to Z'n(lI1naI2naI3nas.
Property list
(plain text format)[8]
N/A<*BY><*BN><*I685230><*R6.8523015e+5>'A to Z'( <*BY>, <*R-42.1e7>, 'A to Z' )
Property list
(XML format)[9][10]
N/A<true /><false /><integer>685230</integer><real>6.8523015e+5</real><string>A to Z</string>
Protocol BuffersN/Atruefalse685230
-685230
20.0855369'A to Z'
'sdfff2 000001002377376375'
'qtqq<>q2&001377'
FormatNullBoolean trueBoolean falseIntegerFloating-pointStringArrayAssociative array/Object
S-expressionsNIL
nil
T
#tf
true
NIL
#ff
false
6852306.8523015e+5abc
'abc'
#616263#
3:abc
{MzphYmM=}
|YWJj|
(T NIL -42.1e7 'A to Z')((42 T) ('A to Z' (1 2 3)))
YAML~
null
Null
NULL[11]
y
Y
yes
Yes
YES
on
On
ON
true
True
TRUE[12]
n
N
no
No
NO
off
Off
OFF
false
False
FALSE[12]
685230
+685_230
-685230
02472256
0x_0A_74_AE
0b1010_0111_0100_1010_1110
190:20:30[13]
6.8523015e+5
685.230_15e+03
685_230.15
190:20:30.15
.inf
-.inf
.Inf
.INF
.NaN
.nan
.NAN[14]
A to Z
'A to Z'
'A to Z'
[y, ~, -42.1e7, 'A to Z']{'John':3.14, 'Jane':2.718}
XMLe and SOAP<null />atruefalse6852306.8523015e+5A to Z
XML-RPC<value><boolean>1</boolean></value><value><boolean>0</boolean></value><value><int>685230</int></value><value><double>6.8523015e+5</double></value><value><string>A to Z</string></value>
  • a. ^ Omitted XML elements are commonly decoded by XML data binding tools as NULLs. Shown here is another possible encoding; XML schema does not define an encoding for this datatype.
  • b. ^ The RFC CSV specification only deals with delimiters, newlines, and quote characters; it does not directly deal with serializing programming data structures.
  • c. ^ The netstrings specification only deals with nested byte strings; anything else is outside the scope of the specification.
  • d. ^ PHP will unserialize any floating-point number correctly, but will serialize them to their full decimal expansion. For example, 3.14 will be serialized to 3.140000000000000124344978758017532527446746826171875.
  • e. ^XML data bindings and SOAP serialization tools provide type-safe XML serialization of programming data structures into XML. Shown are XML values that can be placed in XML elements and attributes.
  • f. ^ This syntax is not compatible with the Internet-Draft, but is used by some dialects of Lisp.

Comparison of binary formats[edit]

Serialization
FormatNullBooleansIntegerFloating-pointStringArrayAssociative array/Object
ASN.1
(BER, PER or OER encoding)
NULL typeBOOLEAN:
  • BER: as 1 byte in binary form;
  • PER: as 1 bit;
  • OER: as 1 byte
INTEGER:
  • BER: variable-length big-endian binary representation (up to 2^(2^1024) bits);
  • PER Unaligned: a fixed number of bits if the integer type has a finite range; a variable number of bits otherwise;
  • PER Aligned: a fixed number of bits if the integer type has a finite range and the size of the range is less than 65536; a variable number of octets otherwise;
  • OER: one, two, or four octets (either signed or unsigned) if the integer type has a finite range that fits in that number of octets; a variable number of octets otherwise
REAL:

base-10 real values are represented as character strings in ISO 6093 format;

binary real values are represented in a binary format that includes the mantissa, the base (2, 8, or 16), and the exponent;

the special values NaN, -INF, +INF, and negative zero are also supported

Multiple valid types (VisibleString, PrintableString, GeneralString, UniversalString, UTF8String)data specifications SET OF (unordered) and SEQUENCE OF (guaranteed order)user definable type
Binnx00True: x01
False: x02
big-endian2's complement signed and unsigned 8/16/32/64 bitssingle: big-endianbinary32
double: big-endianbinary64
UTF-8 encoded, null terminated, preceded by int8 or int32 string length in bytesTypecode (one byte) + 1-4 bytes size + 1-4 bytes items count + list itemsTypecode (one byte) + 1-4 bytes size + 1-4 bytes items count + key/value pairs
BSONNull type – 0 bytes for valueTrue: one byte x01
False: x00
int32: 32-bit little-endian2's complement or int64: 64-bit little-endian2's complementdouble: little-endianbinary64UTF-8 encoded, preceded by int32 encoded string length in bytesBSON embedded document with numeric keysBSON embedded document
Concise Binary Object Representation (CBOR)xf6True: xf5
False: xf4
Small positive number x00-x17, small negative number x20-x37 (abs(N) <= 23)

8-bit: positive x18xhh, negative x38xhh
16-bit: positive x19<uint16_t>, negative x39<uint16_t>
32-bit: positive x1A<uint32_t>, negative x3A<uint32_t>
64-bit: positive x1B<uint64_t>, negative x3B<uint64_t>
Negative number x encoded as ~x (binary inversion) or as (-x-1)
Byte order – Big-endian

Typecode (one byte) + IEEE half/single/doubleTypecode with length (like integer coding) and content.

Bytestring and UTF-8 have different typecode

Typecode with count (like integer coding) and itemsTypecode with pairs count (like integer coding) and pairs
Efficient XML Interchange (EXI)xsi:nil element (1-4 bits depending on context)1 bit.0–12 bits (log2 range) bits for integers with defined ranges less than 4096. Extensible sequence of octets with infinite range for larger or undefined ranges. Also supports custom representations.Scalable floating point representation requiring 18 to 88 bits depending on magnitude. Also supports IEEE and custom representations.Length prefixed sequence of Unicode code points with partitioned string tables for efficient representation of repeated items. The length and code points are represented as variable length unsigned integers where values under 128 require 1 octet each. Also supports custom representations.Repeated elements or length-prefixed list of values. Also supports custom representations.Ordered (sequence) or unordered (all) group of named elements.
FlatBuffersEncoded as absence of field in parent objectTrue: one byte x01
False: x00
little-endian2's complement signed and unsigned 8/16/32/64 bitsfloats: little-endianbinary32

doubles: little-endianbinary64

UTF-8 encoded, preceded by 32 bit integer length of string in bytesVectors of any other type, preceded by 32 bit integer length of number of elementsTables (schema defined types) or Vectors sorted by key (maps / dictionaries)
MessagePackxc0True: xc3
False: xc2
Single byte 'fixnum' (values -32..127)

ortypecode (one byte) + big-endian (u)int8/16/32/64

Typecode (one byte) + IEEE single/doubleTypecode + up to 15 bytes
or
typecode + length as uint8/16/32 + bytes;
encoding is unspecified[15]
As 'fixarray' (single-byte prefix + up to 15 array items)

ortypecode (one byte) + 2–4 bytes length + array items

As 'fixmap' (single-byte prefix + up to 15 key-value pairs)

ortypecode (one byte) + 2–4 bytes length + key-value pairs

Netstrings0:,True: 1:1,

False: 1:0,

OGDL Binary
Property list
(binary format)
Protocol BuffersVariable encoding length signed 32-bit: varint encoding of 'ZigZag'-encoded value (n << 1) XOR (n >> 31)

Variable encoding length signed 64-bit: varint encoding of 'ZigZag'-encoded (n << 1) XOR (n >> 63)
Constant encoding length 32-bit: 32 bits in little-endian2's complement
Constant encoding length 64-bit: 64 bits in little-endian2's complement

floats: little-endianbinary32

doubles: little-endianbinary64

UTF-8 encoded, preceded by varint-encoded integer length of string in bytesRepeated value with the same tagN/A
Smilex21True: x23
False: x22
Single byte 'small' (values -16..15 encoded using xc0 - xdf),

zigzag-encoded varints (1–11 data bytes), or BigInteger

IEEE single/double, BigDecimalLength-prefixed 'short' Strings (up to 64 bytes), marker-terminated 'long' Strings and (optional) back-referencesArbitrary-length heterogenous arrays with end-markerArbitrary-length key/value pairs with end-marker
Structured Data eXchange Formats (SDXF)big-endian signed 24-bit or 32-bit integerbig-endian IEEE doubleeither UTF-8 or ISO 8859-1 encodedlist of elements with identical ID and size, preceded by array header with int16 lengthchunks can contain other chunks to arbitrary depth
Thrift

Any XML based representation can be compressed, or generated as, using EXI - Efficient XML Interchange, which is a 'Schema Informed' (as opposed to schema-required, or schema-less) binary compression standard for XML.

See also[edit]

References[edit]

  1. ^'HAPI FHIR - The Open Source FHIR API for Java'. hapifhir.io.
  2. ^'A Brief History of SOAP'. www.xml.com.
  3. ^Ben-Kiki, Oren; Evans, Clark; Net, Ingy döt (2009-10-01). 'YAML Ain't Markup Language (YAML) Version 1.2'. The Official YAML Web Site. Retrieved 2012-02-10.
  4. ^'text_format.h - Protocol Buffers'. Google Developers.
  5. ^'Cap'n Proto serialization/RPC system: core tools and C++ library - capnproto/capnproto'. 2 April 2019 – via GitHub.
  6. ^'Cap'n Proto: The capnp Tool'. capnproto.org.
  7. ^'Fast Binary Encoding is ultra fast and universal serialization solution for C++, C#, Go, Java, JavaScript, Kotlin, Python, Ruby: chronoxor/FastBinaryEncoding'. 2 April 2019 – via GitHub.
  8. ^'NSPropertyListSerialization class documentation'. www.gnustep.org.
  9. ^'Documentation Archive'. developer.apple.com.
  10. ^'Documentation Archive'. developer.apple.com.
  11. ^Oren Ben-Kiki; Clark Evans; Brian Ingerson (2005-01-18). 'Null Language-Independent Type for YAML Version 1.1'. YAML.org. Retrieved 2009-09-12.
  12. ^ abOren Ben-Kiki; Clark Evans; Brian Ingerson (2005-01-18). 'Boolean Language-Independent Type for YAML Version 1.1'. YAML.org. Clark C. Evans. Retrieved 2009-09-12.
  13. ^Oren Ben-Kiki; Clark Evans; Brian Ingerson (2005-02-11). 'Integer Language-Independent Type for YAML Version 1.1'. YAML.org. Clark C. Evans. Retrieved 2009-09-12.
  14. ^Oren Ben-Kiki; Clark Evans; Brian Ingerson (2005-01-18). 'Floating-Point Language-Independent Type for YAML Version 1.1'. YAML.org. Clark C. Evans. Retrieved 2009-09-12.
  15. ^'MessagePack is an extremely efficient object serialization library. It's like JSON, but very fast and small.: msgpack/msgpack'. 2 April 2019 – via GitHub.

External links[edit]

Retrieved from 'https://en.wikipedia.org/w/index.php?title=Comparison_of_data-serialization_formats&oldid=916328912'
Active1 year, 5 months ago

I'm writing some code to serialize some data to send it over the network. Currently, I use this primitive procedure:

  1. create a void* buffer
  2. apply any byte ordering operations such as the hton family on the data I want to send over the network
  3. use memcpy to copy the memory into the buffer
  4. send the memory over the network

The problem is that with various data structures (which often contain void* data so you don't know whether you need to care about byte ordering) the code becomes really bloated with serialization code that's very specific to each data structure and can't be reused at all.

What are some good serialization techniques for C that make this easier / less ugly?

-

Note: I'm bound to a specific protocol so I cannot freely choose how to serialize my data.

ryystryyst
5,03615 gold badges58 silver badges95 bronze badges

C-data Drug Testing

4 Answers

For each data structure, have a serialize_X function (where X is the struct name) which takes a pointer to an X and a pointer to an opaque buffer structure and calls the appropriate serializing functions. You should supply some primitives such as serialize_int which write to the buffer and update the output index.The primitives will have to call something like reserve_space(N) where N is the number of bytes that are required before writing any data. reserve_space() will realloc the void* buffer to make it at least as big as it's current size plus N bytes.To make this possible, the buffer structure will need to contain a pointer to the actual data, the index to write the next byte to (output index) and the size that is allocated for the data.With this system, all of your serialize_X functions should be pretty straightforward, for example:

And the framework code will be something like:

From this, it should be pretty simple to implement all of the serialize_() functions you need.

EDIT:For example:

EDIT:Also note that my code has some potential bugs. The size of the buffer array is stored in a size_t but the index is an int (I'm not sure if size_t is considered a reasonable type for an index). Also, there is no provision for error handling and no function to free the Buffer after you're done so you'll have to do this yourself. I was just giving a demonstration of the basic architecture that I would use.

jstanleyjstanley

I suggest using a library.

As I was not happy with the existing ones, I created the Binn library to make our lives easier.

Here is an example of using it:

Bernardo RamosBernardo Ramos

Python Data Serialization

I would say definitely don't try to implement serialization yourself. It's been done a zillion times and you should use an existing solution. e.g. protobufs: https://github.com/protobuf-c/protobuf-c

Data

Data Serialization In Hadoop

It also has the advantage of being compatible with many other programming languages.

Assaf LavieAssaf Lavie
46.9k31 gold badges130 silver badges188 bronze badges

It would help if we knew what the protocol constraints are, but in general your options are really pretty limited. If the data are such that you can make a union of a byte array sizeof(struct) for each struct it might simplify things, but from your description it sounds like you have a more essential problem: if you're transferring pointers (you mention void * data) then those points are very unlikely to be valid on the receiving machine. Why would the data happen to appear at the same place in memory?

Charlie MartinCharlie Martin

Data Serialization Definition

94.8k21 gold badges171 silver badges245 bronze badges

C++ Data Types

Not the answer you're looking for? Browse other questions tagged cserialization or ask your own question.