DMA Architecture Sketch for MNCS:

Multinational Character Set Extensions to DMA 1.0

Authors: DMA Integration Subcommittee

:Larry Bonham, Xerox Corporation

Dennis E. Hamilton, InfoNuovo

Hitoshi Tanaka, Hitachi Corporation, Ltd.

Revision History

Version 0.3 - 2000-02-24 Editorial Clean-Up (D. Hamilton)

Version 0.2 - 1999-06-09 Updated Sketch by Dennis Hamilton

Version 0.1 - 1999-06-08 Original Sketch by Dennis Hamilton

Content

1. Overview of the Sketch

2. What is MNCS?

3. Why MNCS?

4. When Does MNCS Apply?

5. Where Does MNCS Apply?

6. Frequently-Asked Questions

7. References and Resources

1. Overview of the Sketch

Support for MNCS Extensions are specified in two separate proposals.  The first proposal describes the impact on the binary interfaces standardized by DMA [MNCS-ext].  That determines all of the additional interfaces and the string formats that are used with the additional interfaces.  The first proposal is complete unto itself and is all that is required to implement the MNCS extensions.

The second proposal provides programming support for creation of string-format-independent programs that can be compiled to operate with one MNCS string-format or another simply by changing some macro definitions and #include dependencies.  The recommended macros and their usage are defined in "DMA Architecture Proposal for MNCS String Manipulation Library [MNCS-lib]."    The use of the MNCS Library is entirely optional.  It is implemented and provided so that developers can use a portable, common solution readily available to all of the DMA community.

to Top of Document 

2. What Is MNCS?

A new DMAString Data type is introduced: DmaString_A.  This datatype provides for character strings that consist of counted sequences of 8-bit (octet) character-string elements.

The principle of operation of DmaString_A is the same as DmaString, except for octets instead of for 16-bit character string elements.  (Technically, DmaString is the DmaString_B format, but the _B suffix is never used in DMA.)

The new data type is introduced in the usual forms,

pDmaString_A,

ppDmaString_A,

and so on.  Basically, every DmaString form already defined is supplemented by an additional DmaString_A form.

There are also additional character set encodings, many that are specific to DmaString_A usage.  These additional character set encoding names and numbers are

Include table of new character-set encoding names and their codes here.

For convenience in programming, there is also a generic character-string data type, DmaString_T.  This is provided in the DMA interface definitions and supporting files and is used when the program is designed to operate with any one of DmaString_A, DmaString, or any additional character string formats that may be introduced using the same principles.  The discussion of DmaString_T and other macros used to simplify portability of programs between multinational string formats is discussed in a companion sketch.  It is kept separate because it is important to distinguish what impacts DMA interfaces and what is simply an accommodation for the source program and having no impact on the proper interfacing between DMA-integrated components.

The DmaString_T is not an actual data type that is ever seen at a DMA interface or delivered to a method.  DmaString_T is a "macro" that will at any time actually be implemented by DmaString_A, DmaString, or another string format.

Consider whether any new result codes need to be defined for failure to obtain the right kind of character string support from a supplied object or any binding problems related to inconsistency in character string support.

to Top of Document 

3. Why MNCS?

omitted in this draft

to Top of Document 

4. When Does MNCS Apply?

For MNCS, there are two kinds of interface changes: additions to the integration-model interfaces and replication of standard DMA 1.0 COM interfaces.

We need to extract out a definition (included in the glossary section) for what it means for two objects to be MNCS-compatible. 

The IUnknown, IMalloc, and similar interfaces that have no DmaString parameters to their methods, are automatically accepted as being all of the MNCS forms.

It is conceivable that an object could support more than one MNCS string format, but it would require that the prevailing character set encoding be equally representable in the multiple formats.  Objects are permitted to support more than one MNCS format, they are simply never required to do so.

to Top of Document 

5. Where Does MNCS Apply?

omitted from this draft.

to Top of Document

6. Frequently-Asked Questions

Why Not Use pDmaString as a way to supply and receive pDmaString_A and other formats also?

pDmaString and ppDmaString must be strongly typed in order for distributed operation to occur.  That is, having pDmaString be the same as void* isn't acceptable for distributed operation.   Arrays of 8-bit elements don't "union" consistently into arrays of 16-bit elements across a network.  When the elements take more than one storage byte (as for 16-bit elements), it is necessary to match the potentially-different byte orderings on the sending and the receiving machines.  This cannot be done properly by the method-remoting mechanism unless the method used has a strongly-typed choice of string format.  It is important to use an explicit type that accurately reflects the understood element size as a container for a binary integer. 

Couldn't you include data in the DmaString to communicate the string format and character-set encoding?

The original proposal for DmaString relied on such a concept.  In the trial-use employment of DMA 0.90, we resorted to the current, simplified DmaString based on the Microsoft BSTR format used with COM.  This allowed trial-use components to all work in Unicode and accomodates any other 16-bit character-set encodings that DMA applications might use.

A key advantage of a separate string type is that it is not necessary to inspect every string to see what format it is in.  There is a big difference in using a string without having to inspect it or any parameters carried with it.  Having to "check" every string moves exception processing to a very expensive place in what is ordinarily routine, straightforward use of text-formatted information.

The disadvantage of separate string types is that the "binding" to a string format does occur at a high level.  Based on the assumption that most DMA requesters and providers will be operating in a single string format for 99.99% of their usage around the planet, regardless of character-set encoding and language, this is not much of a limitation.  Most people won't notice it.  At the same time, the MNCS extension allows programs to operate in more than one format in one of two ways: (1) Compile the same program once for each format to be supported, using different copies per format or (2) allow dynamic determination of formats and provide duplicated logic, as needed, in the implementation of object classes and templates within the application.  The MNCS extension supports either approach and makes (1) as easy as it ever gets.

Well, why not just use BSTR (the DmaString implementation) forever?

That isn't going to work in the long run.  There are always legacies to support, and eventually BSTR will be a legacy format.  There is already a 32-bit Unicode and its adoption will raise the need to come up with another format (let's call it DmaString_C for now).  Also, there is already a large legacy of repositories and electronic documents that do not use a 16-bit character set encoding.  Most of these use some form of octet (8-bit) character-set encoding and they need not be reconcilable with any current16-bit encoding. 

I thought that Unicode was the standard for DMA.  What's the Problem?

DmaString is the standard string format and Unicode (ISO-10646-UCS-2)   is the standard character-set encoding that is expected to be available.   However, DMA explicitly allows for the use of different character-set encodings.   MNCS extends the formats that are usable in directly delivering those non-Unicode (ISO-10646-UCS-2) character-set encodings.

Under DMA 1.0, a DMA Provider can offer to operate in a number of different character-set encodings.  MNCS extends that to operation in a number of different string formats commonly associated with important character-set encodings (e.g., EBCDIC, shift-JIS and the many variations on ASCII (ISO-646) used for different European-language character-set encodings). MNCS allows performance optimization in some cases, by using an efficient and already-supported combination of string format and character-set encoding.   MNCS also allows legacy character-set encodings, especially rich ones developed for use in Asia, that do not have a clean mapping to Unicode to be used and accessed.

There is a loss of interoperabiity when a DMA Provider supports only a non-Unicode character set encoding, whether using DmaString or another MNCS format.  Sometimes that can be remedied mechanically if there is an acceptable mapping to the Unicode character set, although there could be a loss of efficiency because of conversion efforts.  In this case Unicode would work for interoperability and the "native" code would work for full fidelity and best performance.  MNCS allows those arrangements to be developed by tolerating "native" as well as interoperable cases.

Doesn't the COM Specification Require Unicode?

It might seem so.  The specific statement is in The Component Object Model Specification [COMspec], Section 3.1.1, The Interface Binary Standard:

Finally, and quite significantly, all strings passed through all COM interfaces (and, at least on Microsoft platforms, all COM APIs) are Unicode strings. There simply is no other reasonable way to get interoperable objects in the face of (i) location transparency, and (ii) a high-efficiency object architecture that doesn’t in all cases intervene system-provided code between client and server. Further, this burden is in practice not large.

This is the use of "Unicode" that is included in the index of the document.  This is not worded as a statement of requirement, it is rather a practical observation of a valuable idea, one that is also honored in DMA 1.0.  The following example is used in The Component Object Model Specification [COMspec], Section 1.4.3.1, Robustly Evolving Functionality Over Time:

To illustrate the QueryInterface cornerstone, imagine a client that wishes to display the contents of a number of text files, and it knows that for each file format (ASCII, RTF, Unicode, etc.) there is some object class associated with that format. Besides a basic interface like IUnknown, which we’ll call interface A, there are two others that the client wishes to use to achieve its ends: interface B allows a client to tell an object to load some information from a file (or to save it), and interface C allows a client to request a graphical rendering of whatever data the object loaded from a file and maintains internally.

It is the case that many functions of COM itself that do pass strings and use Unicode for encoding those strings.  There are also numerous exceptions, including use of octet-formatted strings, use of uninterpreted streams of bytes that may carry character-set encodings (as in the ASCII, RTF, Unicode mention above) and cases of using string-format data with no assumptions whatsoever about character-set encoding.   For example, in Component Object Model Specification [COMspec], section 2.6.4, Naming Elements (for IStorage):

Strings passed to storage-related functions which name files are passed on un-interpreted and unchanged to the file system

to Top of Document

7. References and Resources

[MNCS-ext]
include reference to extension proposal
[MNCS-lib]
include reference to library proposal
[COMspec]
include reference to COM specification.

End of Document

to Top of Document

$$Author: Orcmid $
$$Date: 00-02-24 14:43 $
$$Revision: 6 $