These notes form the basis for an as-yet-unwritten "String Datatypes" UG section.
This outline is more detailed and attempts to be more comprehensive than the subsequent text.

Outline:
  1. Overview
  2. Strings
  3. String versus char base datatypes
  4. Creating string datatypes
  5. Creating fixed-length string datatypes
  6. Creating variable-length string datatypes
    1. Value and strengths
    2. Limitations
    3. An alternative approach
  7. Constrained variable-length strings
    1. Constrained, or faux or spoofed, variable-length strings
  8. Character encodings
    1. ASCII, UTF-8, H5Pset_char_encoding (objects), H5Tset_cset (datatypes)
    2. Fixed-length ASCII strings easy
    3. Fixed-length UTF-8 strings not so easy
      1. Inherently variable-length, with 1 to 4 bytes per character
      2. In general case or when the nature, range, or constraints of the strings in a dataset are not well characterized, they can be most safely stored in a variable-length string datatype.
      3. But circumstances may enable use of constrained variable-length strings stored in a dataset of fixed-length datatype.
      4. For example, if it is known that the strings in a dataset will always be Hebrew text, one can be confident that a few characters will be 1 byte, most will be 2 bytes, and none will be more than 2 bytes. In this case, a dataset of 10-character strings could be safely stored with NULL terminators in a 21-byte fixed-length string datatype. A dataset of 10- to 30-character strings could be safely stored with NULL terminators in a 61-byte (or 64-byte) fixed-length string datatype. Both would be examples of constrained variable-length datatypes.
      5. This has the potential to substantially boost I/O and/or storage efficiency over the use of a variable-length string datatype.
      6. http://www.utf8-chartable.de/unicode-utf8-table.pl
        http://en.wikipedia.org/wiki/UTF-8
        http://en.wikipedia.org/wiki/Unicode
  9. H5Tvlen_create creates something quite different

Raw and uncorrected text follows.
_topic/create_vlen_strings.htm contains more carefully developed text, but only for one aspect of this.



 

Creating variable-length string datatypes
A heavily revised version of this section (see _topic/create_vlen_strings.htm) is included via PHP on the H5T RM page.
As the term implies, variable-length strings are strings of varying lengths. Real variable-length strings can be arbitrarily long, anywhere from 1 character to thousands of characters long. These are what HDF5 calls variable-length strings and, for the sake of discussion, we'll call them unconstrained variable-length strings in this article.

But there is also a subclass of variable-length strings that vary within a well-defined range. For example, a set of strings might be known to always be between 5 and 20 characters long. In this article, we will call this subclass constrained variable-length strings. From HDF5’s point of view, these are actually just fixed-length strings that may happen to be shorter in length than the assigned datatype. Think of them as faux variable-length strings; we'll discuss them in more detail shortly.

Before we start creating strings, let’s look at string and character datatypes for a minute. HDF5 provides the following predefined datatypes that are relevant to this discussion, one string datatype and three character datatypes:

    H5T_C_S1
    H5T_NATIVE_CHAR
    H5T_NATIVE_SCHAR
    H5T_NATIVE_UCHAR
    
The character datatypes, H5T_NATIVE_CHAR, H5T_NATIVE_SCHAR, and H5T_NATIVE_UCHAR, are single-character datatypes; a data element of one of these datatypes always contains one character. They are unsuitable for creating a string datatype.

The string datatype, H5T_C_S1 for C and H5T_FORTRAN_S1 for Fortran, defaults to one character in size but can be resized to any length.  These types are therefore the base type for any fixed-length or variable-length string datatype.

Creating unconstrained (or real) variable-length string datatypes:
The following HDF5 call creates a variable-length string datatype, vls_type_id:

    vls_type_id = H5Tcreate(H5T_C_S1, H5T_VARIABLE)                 (call 1)
    
Strings of type vls_type_id can be of arbitrary length.

In a C environment, these strings will always be NULL-terminated, so the buffer to hold such a string in memory must be one byte larger than the string itself to accomadate the NULL terminator.

Under the covers, variable-length strings are stored in a heap, which can present challenges for efficient storage and read/write access.

The next section discusses a different approach which may be useful in situations where it is known that the string length in a dataset will vary within known bounds.

Creating datatypes for constrained (or faux) variable-length strings:
To avoid the storage and I/O overhead associated with heaps, it will sometimes be useful to take a different approach when it is known that the string length in a dataset will always fall within known bounds.

Consider the example of a dataset containing one million strings that you know will range from 5 to 20 bytes in length. The following HDF5 call creates a string datatype for strings up to 20 bytes.

    to20B_type_id = H5Tcreate(H5T_C_S1, 20)                         (call 2)
    
If a particular data element is just a 5-byte string, simply write it to the dataset as a 5-byte string plus a NULL terminator (6 bytes total). When HDF5 reads the data back in a C environment and as it works with the data, HDF5 will interpret the NULL-terminated string transparently and correctly.

Note that variable-length strings stored in this manner must always be NULL-terminated unless they exactly fill the full datatype space (exactly 20 bytes in this case). Failure to include the NULL-terminator will result in either misinterpreted data or undefined values.

Strings in this dataset can be of any length up to 20 bytes, giving you essentially a constrained variable-length string. But since everything is handled within a fixed-length datatype, you receive all the benefits of HDF5’s highly efficient sequential I/O without the overhead of extracting data from a heap.

If this datatype were defined as in call 1 and the million-element dataset were fully populated, reading the entire dataset would require HDF5, under the covers, to issue up to 2 million seeks and reads to pluck the data elements 1-by-1 from the heap. Using this faux variable-length datatype, HDF5 can read the entire dataset with a couple of seeks and reads.

Note that this dataset can also be chunked, an option that is not available in a heap and is thus unavailable for a dataset of unconstrained variable-length strings.

Creating fixed-length string datatypes:
Relative to any form of variable-length string datatype, fixed-length string datatypes are straight-forward. The following HDF5 call creates a a fixed-length, 30-byte string datatype:

    20B_type_id = H5Tcreate(H5T_C_S1, 30)
    
This datatype can be used for 30-character ASCII strings without any need for NULL terminators or any other special handling.

[ Consider a note regarding the accommodations necessary to handle fixed-length UTF-8 strings. ]

 


The function H5Tvlen_create does not create variable-length strings
While it is tempting to try to create a variable-length string datatype with H5Tvlen_create, that function actually creates a fundamentally different datatype object.

H5Tvlen_create creates a datatype that is a one-dimensional array datatype with array elements of the base datatype. Consider the following examples:

    vl_char_type_id       = H5Tvlen_create(H5T_NATIVE_CHAR) 
This call creates a datatype that holds a variable-size, one-dimensional array of data elements; each element is of the H5T_NATIVE_CHAR base datatype.

    12B_string_type_id    = H5Tset_size(H5T_C_S1, 12)
    vl_12B_string_type_id = H5Tvlen_create(12B_string) 
This pair of calls creates a datatype that holds a variable-size, one-dimensional array of 12-byte strings.

    vl_int8_type_id       = H5Tvlen_create(H5T_IEEE_F32BE) 
The above call creates a datatype that holds a variable-size, one-dimensional array of IEEE big-endian 32-bit floats.
Last modified: 29 August 2012