- C Data Serialization Definition
- C-data Drug Testing
- Python Data Serialization
- Data Serialization In Hadoop
- Data Serialization Definition
- C++ Data Types
The problem is that with various data structures (which often contain void. data so you don't know whether you need to care about byte ordering) the code becomes really bloated with serialization code that's very specific to each data structure and can't be reused at all. Here, we use the term 'serialization' to mean the reversible deconstruction of an arbitrary set of C data structures to a sequence of bytes. Such a system can be used to reconstitute an equivalent structure in another program context. It is often necessary to send or receive complex data structures to or from another program that may run on a different architecture or may have been designed for different version of the data structures in question. A typical example is a program that saves its state to a file on exit and then reads it back when started.
This is a comparison of>N/ANoApache Avro™ 1.8.1 SpecificationYesNoN/AYes (built-in)N/AN/AApache ParquetApache Software FoundationN/ANoYesNoNoN/AJava, PythonNoASN.1ISO, IEC, ITU-TN/AYesISO/IEC 8824; X.680 series of ITU-T RecommendationsYes
(BER, DER, PER, OER, or custom via ECN)Yes
(XER, JER, GSER, or custom via ECN)PartialfYes (built-in)N/AYes (OER)BencodeBram Cohen (creator)
BitTorrent, Inc. (maintainer)N/ADe facto standard via BitTorrent Enhancement Proposal (BEP)Part of BitTorrent protocol specificationPartially
(numbers and delimiters are ASCII)NoNoNoNoN/ABinnBernardo RamosN/ANoBinn SpecificationYesNoNoNoNoYesBSONMongoDBJSONNoBSON SpecificationYesNoNoNoNoN/ACBORCarsten Bormann, P. HoffmanJSON (loosely)YesRFC 7049YesNoYes
through taggingYes
(CDDL)NoYesComma-separated values (CSV)RFC author:
Yakov ShafranovichN/APartial
(myriad informal variants used)RFC 4180
(among others)NoYesNoNoNoNoCommon Data Representation (CDR)Object Management GroupN/AYesGeneral Inter-ORB ProtocolYesNoYesYesADA, C, C++, Java, Cobol, Lisp, Python, Ruby, SmalltalkN/AD-Bus Message Protocolfreedesktop.orgN/AYesD-Bus SpecificationYesNoNoPartial
(Signature strings)Yes
(see D-Bus)N/AEfficient XML Interchange (EXI)W3CXML, Efficient XMLYesEfficient XML Interchange (EXI) Format 1.0YesYes
(XML)Yes
(XPointer, XPath)Yes
(XML Schema)Yes
(DOM, SAX, StAX, XQuery, XPath)N/AFlatBuffersGoogleN/ANoYesYes
(Apache Arrow)Partial
(internal to the buffer)Yes [2]C++, Java, C#, Go, Python, Rust, JavaScript, PHP, C, Dart, Lua, TypeScriptYesFast InfosetISO, IEC, ITU-TXMLYesITU-T X.891 and ISO/IEC 24824-1:2007YesNoYes
(XPointer, XPath)Yes
(XML schema)Yes
(DOM, SAX, XQuery, XPath)N/AFHIRHealth_Level_7REST basicsYesFast Healthcare Interoperability ResourcesYesYesYesYesHapi for FHIR[1]JSON, XML, TurtleNoIonAmazonJSONNoThe Amazon Ion SpecificationYesYesNoNoNoN/AJava serializationOracle CorporationN/AYesJava Object SerializationYesNoYesNoYesN/AJSONDouglas CrockfordJavaScript syntaxYesSTD 90/RFC 8259
(ancillary:
RFC 6901,
RFC 6902), ECMA-404, ISO/IEC 21778:2017No, but see BSON, Smile, UBJSONYesYes
(JSON Pointer (RFC 6901);
alternately:
JSONPath, JPath, JSPON, json:select()), JSON-LDPartial
(JSON Schema Proposal, ASN.1 with JER, Kwalify, Rx, Itemscript Schema), JSON-LDPartial
(Clarinet, JSONQuery, JSONPath), JSON-LDNoMessagePackSadayuki FuruhashiJSON (loosely)NoMessagePack format specificationYesNoNoNoNoYesNetstringsDan BernsteinN/ANonetstrings.txtYesYesNoNoNoYesOGDLRolf Veen?NoSpecificationYes
(Binary Specification)YesYes
(Path Specification)Yes
(Schema WD)N/AOPC-UA BinaryOPC FoundationN/ANoopcfoundation.orgYesNoYesNoNoN/AOpenDDLEric LengyelC, PHPNoOpenDDL.orgNoYesYesNoYes
(OpenDDL Library)N/APickle (Python)Guido van RossumPythonDe facto standard via Python Enhancement Proposals (PEPs)[3] PEP 3154 -- Pickle protocol version 4YesNoNoNoYes
([4])NoProperty listNeXT (creator)
Apple (maintainer)?PartialPublic DTD for XML formatYesaYesbNo?Cocoa, CoreFoundation, OpenStep, GnuStepNoProtocol Buffers (protobuf)GoogleN/ANoDeveloper Guide: EncodingYesPartialdNoYes (built-in)C++, C#, Java, Python, Javascript, GoNoS-expressionsJohn McCarthy (original)
Ron Rivest (internet draft)Lisp, NetstringsPartial
(largely de facto)Yes
('Canonical representation')Yes
('Advanced transport representation')NoNoN/ASmileTatu SalorantaJSONNoSmile Format SpecificationYesNoNoPartial
(JSON Schema Proposal, other JSON schemas/IDLs)Partial
(via JSON APIs implemented with Smile backend, on Jackson, Python)N/ASOAPW3CXMLYesW3C Recommendations:
SOAP/1.1
SOAP/1.2Partial
(Efficient XML Interchange, Binary XML, Fast Infoset, MTOM, XSD base64 data)YesYes
(built-in id/ref, XPointer, XPath)Yes
(WSDL, XML schema)Yes
(DOM, SAX, XQuery, XPath)N/AStructured Data eXchange FormatsMax WildgrubeN/AYesRFC 3072YesNoNoNoN/AThriftFacebook (creator)
Apache (maintainer)N/ANoOriginal whitepaperYesPartialcNoYes (built-in)N/AUBJSONThe Buzz Media, LLCJSON, BSONNo[5]YesNoNoNoNoN/AeXternal Data Representation (XDR)Sun Microsystems (creator)
IETF (maintainer)N/AYesSTD 67/RFC 4506YesNoYesYesYesN/AXMLW3CSGMLYesW3C Recommendations:
1.0 (Fifth Edition)
1.1 (Second Edition)Partial
(Efficient XML Interchange, Binary XML, Fast Infoset, XSD base64 data)YesYes
(XPointer, XPath)Yes
(XML schema, RELAX NG)Yes
(DOM, SAX, XQuery, XPath)N/AXML-RPCDave Winer[2]XMLNoXML-RPC SpecificationNoYesNoNoNoN/AYAMLClark Evans,
Ingy döt Net,
and Oren Ben-KikiC, Java, Perl, Python, Ruby, Email, HTML, MIME, URI, XML, SAX, SOAP, JSON[3]NoVersion 1.2NoYesYesPartial
(Kwalify, Rx, built-in language type-defs)NoN/ANameCreator-maintainerBased onStandardized?SpecificationBinary?Human-readable?Supports references?eSchema-IDL?Standard APIsSupports Zero-copy operations
- a. ^ The current default format is binary.
- b. ^ The 'classic' format is plain text, and an XML format is also supported.
- c. ^ Theoretically possible due to abstraction, but no implementation is included.
- d. ^ The primary format is binary, but a text format is available.[4]
- e. ^ Means that generic tools/libraries know how to encode, decode, and dereference a reference to another piece of data in the same document. A tool may require the IDL file, but no more. Excludes custom, non-standardized referencing techniques.
- f. ^ ASN.1 does offer OIDs, a standard format for globally unique identifiers, as well as a standard notation ('absolute reference') for referencing a component of a value. Thus it would be possible to reference a component of an encoded value present in a document by combining an OID (assigned to the document) and an 'absolute reference' to the component of the value. However, there is no standard way to indicate that a field contains such an absolute reference. Therefore, a generic ASN.1 tool/library cannot automatically encode/decode/resolve references within a document without help from custom-written program code.
- g. ^ VelocyPack offers a value type to store pointers to other VPack items. It is allowed if the VPack data resides in memory, but not if stored on disk or sent over a network.
- h. ^ The primary format is binary, but a text format is available.[5][6]
- i. ^ The primary format is binary, but text and json formats are available.[7]
C Data Serialization Definition
Syntax comparison of human-readable formats[edit]
Format | Null | Boolean true | Boolean false | Integer | Floating-point | String | Array | Associative array/Object |
---|---|---|---|---|---|---|---|---|
ASN.1 (XML Encoding Rules) | <foo /> | <foo>true</foo> | <foo>false</foo> | <foo>685230</foo> | <foo>6.8523015e+5</foo> | <foo>A to Z</foo> | An object (the key is a field name): A data mapping (the key is a data value): | |
CSVb | null a(or an empty element in the row)a | 1 atrue a | 0 afalse a | 685230 -685230 a | 6.8523015e+5 a | A to Z 'We said, 'no'.' | true,-42.1e7,'A to Z' | |
Format | Null | Boolean true | Boolean false | Integer | Floating-point | String | Array | Associative array/Object |
Ion |
| true | false | 685230 -685230 0xA74AE 0b111010010101110 | 6.8523015e5 | 'A to Z' '' | ||
Netstringsc | 0:, a4:null, a | 1:1, a4:true, a | 1:0, a5:false, a | 6:685230, a | 9:6.8523e+5, a | 6:A to Z, | 29:4:true,0:,7:-42.1e7,6:A to Z, | 41:9:2:42,1:1,25:6:A to Z,12:1:1,1:2,1:3, a |
JSON | null | true | false | 685230 -685230 | 6.8523015e+5 | 'A to Z' | ||
OGDL[verification needed] | null a | true a | false a | 685230 a | 6.8523015e+5 a | 'A to Z' 'A to Z' NoSpaces |
| |
Format | Null | Boolean true | Boolean false | Integer | Floating-point | String | Array | Associative array/Object |
OpenDDL | ref {null} | bool {true} | bool {false} | int32 {685230} int32 {0x74AE} int32 {0b111010010101110} | float {6.8523015e+5} | string {'A to Z'} | Homogeneous array: Heterogeneous array: | |
Pickle (Python) | N. | I01n. | I00n. | I685230n. | F685230.15n. | S'A to Z'n. | (lI01na(laF-421000000.0naS'A to Z'na. | (dI42nI01nsS'A to Z'n(lI1naI2naI3nas. |
Property list (plain text format)[8] | N/A | <*BY> | <*BN> | <*I685230> | <*R6.8523015e+5> | 'A to Z' | ( <*BY>, <*R-42.1e7>, 'A to Z' ) | |
Property list (XML format)[9][10] | N/A | <true /> | <false /> | <integer>685230</integer> | <real>6.8523015e+5</real> | <string>A to Z</string> | ||
Protocol Buffers | N/A | true | false | 685230 -685230 | 20.0855369 | 'A to Z' | ||
Format | Null | Boolean true | Boolean false | Integer | Floating-point | String | Array | Associative array/Object |
S-expressions | NIL nil | T #t ftrue | NIL #f ffalse | 685230 | 6.8523015e+5 | abc 'abc' #616263# 3:abc {MzphYmM=} |YWJj| | (T NIL -42.1e7 'A to Z') | ((42 T) ('A to Z' (1 2 3))) |
YAML | ~ null Null NULL [11] | y Y yes Yes YES on On ON true True TRUE [12] | n N no No NO off Off OFF false False FALSE [12] | 685230 +685_230 -685230 02472256 0x_0A_74_AE 0b1010_0111_0100_1010_1110 190:20:30 [13] | 6.8523015e+5 685.230_15e+03 685_230.15 190:20:30.15 .inf -.inf .Inf .INF .NaN .nan .NAN [14] | A to Z 'A to Z' 'A to Z' | [y, ~, -42.1e7, 'A to Z'] | {'John':3.14, 'Jane':2.718} |
XMLe and SOAP | <null /> a | true | false | 685230 | 6.8523015e+5 | A to Z | ||
XML-RPC | <value><boolean>1</boolean></value> | <value><boolean>0</boolean></value> | <value><int>685230</int></value> | <value><double>6.8523015e+5</double></value> | <value><string>A to Z</string></value> |
- a. ^ Omitted XML elements are commonly decoded by XML data binding tools as NULLs. Shown here is another possible encoding; XML schema does not define an encoding for this datatype.
- b. ^ The RFC CSV specification only deals with delimiters, newlines, and quote characters; it does not directly deal with serializing programming data structures.
- c. ^ The netstrings specification only deals with nested byte strings; anything else is outside the scope of the specification.
- d. ^ PHP will unserialize any floating-point number correctly, but will serialize them to their full decimal expansion. For example, 3.14 will be serialized to 3.140000000000000124344978758017532527446746826171875.
- e. ^XML data bindings and SOAP serialization tools provide type-safe XML serialization of programming data structures into XML. Shown are XML values that can be placed in XML elements and attributes.
- f. ^ This syntax is not compatible with the Internet-Draft, but is used by some dialects of Lisp.
Comparison of binary formats[edit]
Format | Null | Booleans | Integer | Floating-point | String | Array | Associative array/Object |
---|---|---|---|---|---|---|---|
ASN.1 (BER, PER or OER encoding) | NULL type | BOOLEAN:
| INTEGER:
| REAL: base-10 real values are represented as character strings in ISO 6093 format; binary real values are represented in a binary format that includes the mantissa, the base (2, 8, or 16), and the exponent; the special values NaN, -INF, +INF, and negative zero are also supported | Multiple valid types (VisibleString, PrintableString, GeneralString, UniversalString, UTF8String) | data specifications SET OF (unordered) and SEQUENCE OF (guaranteed order) | user definable type |
Binn | x00 | True: x01 False: x02 | big-endian2's complement signed and unsigned 8/16/32/64 bits | single: big-endianbinary32 double: big-endianbinary64 | UTF-8 encoded, null terminated, preceded by int8 or int32 string length in bytes | Typecode (one byte) + 1-4 bytes size + 1-4 bytes items count + list items | Typecode (one byte) + 1-4 bytes size + 1-4 bytes items count + key/value pairs |
BSON | Null type – 0 bytes for value | True: one byte x01 False: x00 | int32: 32-bit little-endian2's complement or int64: 64-bit little-endian2's complement | double: little-endianbinary64 | UTF-8 encoded, preceded by int32 encoded string length in bytes | BSON embedded document with numeric keys | BSON embedded document |
Concise Binary Object Representation (CBOR) | xf6 | True: xf5 False: xf4 | Small positive number x00-x17 , small negative number x20-x37 (abs(N) <= 23) 8-bit: positive | Typecode (one byte) + IEEE half/single/double | Typecode with length (like integer coding) and content. Bytestring and UTF-8 have different typecode | Typecode with count (like integer coding) and items | Typecode with pairs count (like integer coding) and pairs |
Efficient XML Interchange (EXI) | xsi:nil element (1-4 bits depending on context) | 1 bit. | 0–12 bits (log2 range) bits for integers with defined ranges less than 4096. Extensible sequence of octets with infinite range for larger or undefined ranges. Also supports custom representations. | Scalable floating point representation requiring 18 to 88 bits depending on magnitude. Also supports IEEE and custom representations. | Length prefixed sequence of Unicode code points with partitioned string tables for efficient representation of repeated items. The length and code points are represented as variable length unsigned integers where values under 128 require 1 octet each. Also supports custom representations. | Repeated elements or length-prefixed list of values. Also supports custom representations. | Ordered (sequence) or unordered (all) group of named elements. |
FlatBuffers | Encoded as absence of field in parent object | True: one byte x01 False: x00 | little-endian2's complement signed and unsigned 8/16/32/64 bits | floats: little-endianbinary32 doubles: little-endianbinary64 | UTF-8 encoded, preceded by 32 bit integer length of string in bytes | Vectors of any other type, preceded by 32 bit integer length of number of elements | Tables (schema defined types) or Vectors sorted by key (maps / dictionaries) |
MessagePack | xc0 | True: xc3 False: xc2 | Single byte 'fixnum' (values -32..127) ortypecode (one byte) + big-endian (u)int8/16/32/64 | Typecode (one byte) + IEEE single/double | Typecode + up to 15 bytes or typecode + length as uint8/16/32 + bytes; encoding is unspecified[15] | As 'fixarray' (single-byte prefix + up to 15 array items) ortypecode (one byte) + 2–4 bytes length + array items | As 'fixmap' (single-byte prefix + up to 15 key-value pairs) ortypecode (one byte) + 2–4 bytes length + key-value pairs |
Netstrings | 0:, | True: 1:1, False: | |||||
OGDL Binary | |||||||
Property list (binary format) | |||||||
Protocol Buffers | Variable encoding length signed 32-bit: varint encoding of 'ZigZag'-encoded value (n << 1) XOR (n >> 31) Variable encoding length signed 64-bit: varint encoding of 'ZigZag'-encoded | floats: little-endianbinary32 doubles: little-endianbinary64 | UTF-8 encoded, preceded by varint-encoded integer length of string in bytes | Repeated value with the same tag | N/A | ||
Smile | x21 | True: x23 False: x22 | Single byte 'small' (values -16..15 encoded using xc0 - xdf ),zigzag-encoded | IEEE single/double, BigDecimal | Length-prefixed 'short' Strings (up to 64 bytes), marker-terminated 'long' Strings and (optional) back-references | Arbitrary-length heterogenous arrays with end-marker | Arbitrary-length key/value pairs with end-marker |
Structured Data eXchange Formats (SDXF) | big-endian signed 24-bit or 32-bit integer | big-endian IEEE double | either UTF-8 or ISO 8859-1 encoded | list of elements with identical ID and size, preceded by array header with int16 length | chunks can contain other chunks to arbitrary depth | ||
Thrift |
Any XML based representation can be compressed, or generated as, using EXI - Efficient XML Interchange, which is a 'Schema Informed' (as opposed to schema-required, or schema-less) binary compression standard for XML.
See also[edit]
References[edit]
- ^'HAPI FHIR - The Open Source FHIR API for Java'. hapifhir.io.
- ^'A Brief History of SOAP'. www.xml.com.
- ^Ben-Kiki, Oren; Evans, Clark; Net, Ingy döt (2009-10-01). 'YAML Ain't Markup Language (YAML) Version 1.2'. The Official YAML Web Site. Retrieved 2012-02-10.
- ^'text_format.h - Protocol Buffers'. Google Developers.
- ^'Cap'n Proto serialization/RPC system: core tools and C++ library - capnproto/capnproto'. 2 April 2019 – via GitHub.
- ^'Cap'n Proto: The capnp Tool'. capnproto.org.
- ^'Fast Binary Encoding is ultra fast and universal serialization solution for C++, C#, Go, Java, JavaScript, Kotlin, Python, Ruby: chronoxor/FastBinaryEncoding'. 2 April 2019 – via GitHub.
- ^'NSPropertyListSerialization class documentation'. www.gnustep.org.
- ^'Documentation Archive'. developer.apple.com.
- ^'Documentation Archive'. developer.apple.com.
- ^Oren Ben-Kiki; Clark Evans; Brian Ingerson (2005-01-18). 'Null Language-Independent Type for YAML Version 1.1'. YAML.org. Retrieved 2009-09-12.
- ^ abOren Ben-Kiki; Clark Evans; Brian Ingerson (2005-01-18). 'Boolean Language-Independent Type for YAML Version 1.1'. YAML.org. Clark C. Evans. Retrieved 2009-09-12.
- ^Oren Ben-Kiki; Clark Evans; Brian Ingerson (2005-02-11). 'Integer Language-Independent Type for YAML Version 1.1'. YAML.org. Clark C. Evans. Retrieved 2009-09-12.
- ^Oren Ben-Kiki; Clark Evans; Brian Ingerson (2005-01-18). 'Floating-Point Language-Independent Type for YAML Version 1.1'. YAML.org. Clark C. Evans. Retrieved 2009-09-12.
- ^'MessagePack is an extremely efficient object serialization library. It's like JSON, but very fast and small.: msgpack/msgpack'. 2 April 2019 – via GitHub.
External links[edit]
I'm writing some code to serialize some data to send it over the network. Currently, I use this primitive procedure:
- create a
void*
buffer - apply any byte ordering operations such as the
hton
family on the data I want to send over the network - use
memcpy
to copy the memory into the buffer - send the memory over the network
The problem is that with various data structures (which often contain void* data so you don't know whether you need to care about byte ordering) the code becomes really bloated with serialization code that's very specific to each data structure and can't be reused at all.
What are some good serialization techniques for C that make this easier / less ugly?
-
Note: I'm bound to a specific protocol so I cannot freely choose how to serialize my data.
ryystryystC-data Drug Testing
4 Answers
For each data structure, have a serialize_X function (where X is the struct name) which takes a pointer to an X and a pointer to an opaque buffer structure and calls the appropriate serializing functions. You should supply some primitives such as serialize_int which write to the buffer and update the output index.The primitives will have to call something like reserve_space(N) where N is the number of bytes that are required before writing any data. reserve_space() will realloc the void* buffer to make it at least as big as it's current size plus N bytes.To make this possible, the buffer structure will need to contain a pointer to the actual data, the index to write the next byte to (output index) and the size that is allocated for the data.With this system, all of your serialize_X functions should be pretty straightforward, for example:
And the framework code will be something like:
From this, it should be pretty simple to implement all of the serialize_() functions you need.
EDIT:For example:
EDIT:Also note that my code has some potential bugs. The size of the buffer array is stored in a size_t but the index is an int (I'm not sure if size_t is considered a reasonable type for an index). Also, there is no provision for error handling and no function to free the Buffer after you're done so you'll have to do this yourself. I was just giving a demonstration of the basic architecture that I would use.
I suggest using a library.
As I was not happy with the existing ones, I created the Binn library to make our lives easier.
Here is an example of using it:
Python Data Serialization
I would say definitely don't try to implement serialization yourself. It's been done a zillion times and you should use an existing solution. e.g. protobufs: https://github.com/protobuf-c/protobuf-c
Data Serialization In Hadoop
It also has the advantage of being compatible with many other programming languages.
Assaf LavieAssaf LavieIt would help if we knew what the protocol constraints are, but in general your options are really pretty limited. If the data are such that you can make a union of a byte array sizeof(struct) for each struct it might simplify things, but from your description it sounds like you have a more essential problem: if you're transferring pointers (you mention void * data) then those points are very unlikely to be valid on the receiving machine. Why would the data happen to appear at the same place in memory?
Charlie MartinCharlie Martin