DMA Architecture Enhancement Proposal for Content Based Search

Authors: DMA Query SC: Mitsuru Akizawa, Yoshifumi Sato, Ann Palmer, Jim Green, Dennis Hamilton, Alan Babich


Revision History

HTML Rendition: This HTML rendition was derived from the Microsoft Word 97 file CBSearch12.doc.  It might not be the latest version of the CBSearch extensions to DMA 1.0.  Check the definitive sources for the latest version.  There may be a later packaging of complete, current CBSearch materials there.

    Version 8.0: 6/23/99 Mitsuru Akizawa, Jim Green, Dennis Hamilton, Alan Babich

    (CBSearch12.doc) Final (minor) changes before vote. State that there will be one subclass of the Content Search class for each CBR index supported by the document space.

    Version 7.0: 6/23/99 Mitsuru Akizawa, Jim Green, Dennis Hamilton, Alan Babich

    Put in the changes agreed to on the conference call of 5/11/99, and change the data type of Thesaurus Id from String to Id. Enhance the explanation of the default units for distance:

    Put a sentence in section 3 that states explicitly that the merge rules are not changed by this proposal.

    Add a new class Content Query Options to hold the content based query options. Add a single object valued property, Content Query Options, to the classes Query and Query Operator. Do not require support for this property or any of the option properties, and do not require a default value for any of the options.

    The use of thesauri is on a best effort basis. The query will be executed, and no error will be returned, even if a thesaurus is not available.

    In the introduction it is noted that there can be multiple instances of the Result Object property in the Selections list.

    Put in an example showing the use of the Result Object property.

    Version 6.0: 4/27/99 Mitsuru Akizawa, Jim Green, Dennis Hamilton, Alan Babich

    Change the name of the Proximity operator to Within. Add new query operators NotWithin, WithinUnits, NotWithinUnits. Add HitCount property.

    Version 5.0: 4/4/99 Mitsuru Akizawa, Jim Green, Dennis Hamilton, Alan Babich

    Add example query trees. Explain how to take advantage of the new properties in the Query class. Get rid of the Word macros. Use the value 1 to indicate adjacency instead of 0. This change anticipates that an operand will be added to the Proximity operator to specify the units of . For example, if the units were specified to be "paragraph", then 0 would mean that the search terms must be in the same paragraph, and 1 would mean that the search terms must be in adjacent paragraphs. The old way, -1 would have to be used to indicate that the search terms are in the same paragraph.

    Version 4.0: 2/16/99 Mitsuru Akizawa, Yoshifumi Sato, Ann Palmer, Jim Green, Dennis Hamilton, Alan Babich

    Touch up the wording in the overview.

    Changed the options from list of Booleans to separate properties.

    Wrote the query overview.

    Dropped the last Editor’s note.

    Version 3.0: 1/19/99 Mitsuru Akizawa, Yoshifumi Sato, Ann Palmer, Jim Green, Alan Babich

    Added a paragraph on the scope of the proposal in the introduction.

    Clarified the confusion between class and class instance in the introduction.

    Added Content Search to the glossary of terms.

    Explained that the relative importance of two retrieval terms in a query is indicated by the ratio of their retrieval term weights.

    Changed the data type of Raw Score, Normalized Score, and weights from integer32 to float64.

    Moved the Special Properties to another proposal.

    Made the list of options be list of integer32, not list of Boolean. Added format level "option".

    Provided new name (DMA_CBR_OPTION_SPELLING_VARIANT) and new description for DMA_CBR_OPTION_TRANSLATION_EXPANSION.

    Made Thesaurus name a list of strings, and enhanced its description.

    Deleted the Contains SGML and SGML operators.

    Rewrote the sentence on internationalization considerations.

    Removed most editor’s notes.

    Minor wording touch ups.

    Version 2.0: 12/10/98 Mitsuru Akizawa, Yoshifumi Sato, Ann Palmer, Jim Green, Alan Babich

    Added operator for searches of SGML, HTML, and XML documents. Correct the statement that the Hitachi proposal is based on STARTS. It is not. It used the concept of raw and normalized score, however. Corrected some minor errors in the document.

    Version 1.0: 12/06/98 Mitsuru Akizawa, Yoshifumi Sato, Ann Palmer, Jim Green, Alan Babich

    Initial version.

Table of Contents

    Revision History

    Table of Contents

    Functional Overview/Description

    Glossary/Nomenclature

    New/Changed Classes and Properties

    dmaClass_ContentSearch

    dmaClass_ContentQueryOptions

    DmaClass_Query

    dmaClass_QueryOperator

    New/Changed Interfaces

    Architectural Impacts

    Integration Model

    Query

Query Model Overview

New Query Operators

    Contains
    Within
    NotWithin
    WithinUnits
    NotWithinUnits

    Internationalization/Localization

    Content

    Containment

    Versioning

    Security

    Naming

    Performance Impact

    Failure Modes/Crash Recovery

    Sample Usage

    Contains Operator

    dmaProp_ResultObject

    The "Within" Operator

    Compound Retrieval Condition

    Using the New Properties in the Query Class

    Backwards Compatibility

    Impact on Existing Systems

    Compliance & Conformance Issues

    What Requirement Does This Address?

    Why Is This a Good Solution?

    Existing Systems with Similar Features and Related Experiences

    Reference Materials

    How Is This Incorporated into the Specification?


Functional Overview/Description

    The scope of this proposal is to define a small, basic, interoperable subset of content search that is to be added to the DMA API. The intent is to enable DMA query to provide access to basic content search capability, assuming that is provided by the document space implementation. This proposal is not intended to allow access to every feature of every search engine that currently exists.

    A new class is added to the DMA classes – Content Search.

    It is expected that the Content Search class will be supported and have exactly one subclass for each CBR index supported by the document space. It is possible for each content index to be managed by different content search engine software."

    A query against the Content Search class (or one of its subclasses) implies a query against the corresponding content index.

    It should be noted that documents of many different types are typically indexed in the same content index. It is also possible to include folders and annotations in the content index. Therefore, searches against the Content Search class can in effect search across multiple different subclasses of DocVersion and possibly other classes as well.

    The properties of an instance of Content Search reflect properties stored in the corresponding content index. A new object valued property, Result Object, is introduced as a property of the new Content Search class. The act of accessing the value of the Result Object property of a query result row causes a connected scratch pad object for the DocVersion or other object associated with the result row to be instantiated on the client. Note that there can be multiple instances of the Result Object property in the Selections list.

    The Content Search class is intended to be a virtual class: The Content Search class is intended for use only in queries. No provision is made for inserting, updating, or deleting independently persistable objects of an instance of the Content Search class via the DMA API. Indexing and deindexing of documents in a content index is expected to be done out of band of the DMA API.

    New "score" properties of the Content Search class that allow the results of content based queries to be ranked in descending order of relevance are defined in this proposal.

    In addition to the new Content Search class, new query operators that perform various types of content based search operations are defined in this proposal.

    This proposal does not propose any changes to the merge rules for constructing merged scopes.

    This proposal is based on a working paper submitted by Mr. Mitsuru Akizawa and Mr. Yoshifumi Sato of Hitachi, as modified by the subsequent discussions of the DMA Query subcommittee.

     

Glossary/Nomenclature

    Content Search

    Content search is querying for documents by specifying conditions on the document content. Typically, the documents involved are parsed by special document indexing software, and indexed in a content index. The entries in the content index refer back to the original documents.

    Retrieval Term

    A retrieval term is word or phrase looked up in a content index by content based retrieval (CBR) software.

    Retrieval Term Weight

    If there are multiple retrieval terms in a content based query, it may be desirable to indicate the relative importance of each term to the query. That is done by assigning a weight to each retrieval term. The retrieval term weight is a real number between 0.0 and 1.0 . The ratio of two retrieval term weights indicates the relative importance of the two retrieval terms. Retrieval term weights are not normalized. That is, there is no requirement that the sum of the retrieval term weights adds up to 1.0 or any other number.

    Raw Score, Normalized Score

    The raw score is intended to be useful for ranking purposes when a single content based index is involved in a query. In some document spaces, the raw score is based on the number of occurrences of the search terms in the document, and on other statistics (e.g. the number of documents in the content index that contain the search term, the total number of documents in the content index, etc.). When multiple content indexes are involved in a query, the raw scores are typically not directly comparable, even if the same scoring algorithm is used. That is because the raw score of a document for a particular query typically depends upon all the documents in the content index. Therefore, when multiple content indexes are involved in a query, the normalized score is generally used for ranking purposes. The normalized score may not be perfectly comparable across content based indexes, but it is generally much more appropriate than the raw score in such cases.

New/Changed Classes and Properties

dmaClass_ContentSearch

This is a new, virtual class. Objects of this class can not be created, deleted, modified, or connected to in the usual way. It is intended that the primary use of this class be in content based queries.

Adding and deleting objects of this class is outside of the scope of this proposal. The Name property is not assigned. This class must be searchable but not addressable or navigable.

Class ID: dmaClass_ContentSearch

Superclass: DMA

Interfaces

The interfaces are inherited from the superclass DMA. These are IUnknown, IdmaObject, and IdmaProperties. However, objects of this class can not be instantiated. This class exists to provide properties that can be selected in a query result set.

Properties

Name Impl. Req'd System Gen'ed Read-Only Value Req'd Type Cardinality Req'd Class
OIID - Yes Yes - String Scalar  
Class Description - Yes Yes - Object Scalar Class Description
This - Yes Yes - Object Scalar Content Search
Create Pending - Yes Yes - Boolean Scalar  
Update Pending - Yes Yes - Boolean Scalar  
Delete Pending - Yes Yes - Boolean Scalar  
Content Yes Yes Yes - String Scalar  
Result Object Yes Yes Yes - object Scalar DMA
Raw Score - Yes Yes - Float64 Scalar  
Normalized Score - Yes Yes - Float64 Scalar  
Document Length - Yes Yes - Integer32 Scalar  
Hit Count - Yes Yes - Integer32 Scalar  

Detailed Description

It is useful to think of the persistent data associated with an instance of the Content Search class (or one of its subclasses) as a content index.

The properties of an instance of the Content Search class reflect the properties stored in the corresponding content index.

The Content Search class and its subclasses must be searchable. Searches against it correspond to searches against a content index.

The primary use of this class is to facilitate content queries. It is not intended that it be possible to connect scratchpad objects for members of this class or to create new persistent members of this class (e.g., via the ExecuteChange method). The intent is that indexing and deindexing of documents or other objects in the content index is done out of band with respect to the DMA API.

Search results from this class correspond to documents or other objects indexed in the content based index. A connected scratchpad object for the document or other object associated with a result row may be obtained by accessing the value of the Result Object property.

It is expected that if a document space has more than one content index, then there will be an instance of Content Search or one of its subclasses for each such content index in that document space. Different search engine software can be used to manage each content index.

Property Descriptions

• OIID {dmaProp_OIID}

Property Inherited from DMA

This property may not be implemented.

• Class Description {dmaProp_ClassDescription}

Property Inherited from DMA

This property may not be implemented.

• This {dmaProp_This}

Property Inherited from DMA

This property may not not be implemented.

• Create Pending {dmaProp_CreatePending}

Property Inherited from DMA

This property may not be implemented.

• Update Pending {dmaProp_UpdatePending}

Property Inherited from DMA

This property may not be implemented.

• Delete Pending {dmaProp_DeletePending}

Property Inherited from DMA

This property may not be implemented.

Content {dmaProp_Content}

This property stands for the content of the document or other target object. This property is merely a placeholder needed by content search query operators. (Query operators need an operand that designates the property being queried.) There must be no value for this property (the content must be obtained by accessing the Result Object property and then using the usual DMA 1.0 methods to access its content).

Result Object {dmaProp_ResultObject}

This object-valued property references the actual document or other object that corresponds to a query result row of a search against the Content Search class (or one of its subclasses). In order to obtain a scratchpad object for the target object, this property is mentioned in the Selections list property of the query, and its value is retrieved from a result row of the query.

Raw Score {dmaProp_RawScore}

The value of this property is the raw score of the document, i.e., this score value is not normalized. The value of this property can be used to order the documents retrieved by a content query in order of decreasing relevance. The Raw Score is most useful when the query is against a single document space. The Raw Score property values are not comparable across document spaces: The values of Raw Score of the same document stored in two different document spaces may be different.

Normalized Score {dmaProp_NormalizedScore}

The value of this property is the normalized score of the document. The Normalized Score can be used to order the documents retrieved by a content query in order of decreasing relevance. The Normalized Score is expected to be more useful than the Raw Score when the query is across multiple document spaces. However, although using the Normalized Score is better than using the Raw Score in the case of multiple document spaces, there is no guarantee that the normalized scores from different document spaces will be perfectly comparable.

Document Length {dmaProp_DocumentLength}

This is the length in bytes of the rendition of the document that was indexed in the content index. This property is optionally supported.

Hit Count {dmaProp_HitCount}

For the case that there is a single Contains operator, the value of this property is the number of occurrences of the search term in the document associated with the Result Object property. For other cases, the value of Hit Count is implementation defined. This property might be null for some content queries and non null for others.

Other Properties in the Content Based Index

It is usually the case that content search engines support the indexing of "hard" properties of documents (e.g., "Author", "Title", "LoanNumber", etc.) in addition to document content. To make such "hard" properties searchable or retrievable from the content based index, they must be properties of Content Search or a subclass of Content Search. If any of the hard properties are strings, they can be used as the target property for the content search query operators if the document space implementation supports that. The normal query operators (e.g., dmaQueryOperaor_GreaterInteger32, etc.) can also be used against the hard properties in the content index.

dmaClass_ContentQueryOptions

dmaClass_Query

dmaClass_QueryOperator

New/Changed Interfaces

Architectural Impacts

Integration Model

    N/A

Query

    The query model is not affected by this proposal. However, new operators must be defined to provide content search functionality.

    The companion proposal defining capabilities for content search query will indicate whether or not there are some constraints on the form of query expressions. A fully general query would include conditions on properties of the Content Search class as well as other classes, and that would require more effort than an implementation that allowed queries against only properties of Content Search, or properties not of Content Search.

Query Model Overview

    The query model overview should be updated by adding the following section to the DMA 1.0 specification to the very end of the query overview as section 3.5.14 "Content Based Search":

    The Content Search class exists to support content searches.

    The Content Search class is intended to be a virtual class: The Content Search class is intended for use only in queries. No provision is made for inserting, updating, or deleting independently persistable objects of an instance of the Content Search class via the DMA API. Indexing and deindexing of documents in a content index is expected to be done out of band of the DMA API.

    It is expected that the Content Search class will be supported and have exactly one subclass for each CBR index supported by the document space. It is possible for each content index to be managed by different content search engine software."

    A query against the Content Search class (or one of its subclasses) implies a query against the corresponding content index.

    It should be noted that documents of many different types are typically indexed in the same content index. It is also possible to include folders and annotations in the content index. Therefore, searches against the Content Search class can in effect search across multiple different subclasses of DocVersion and possibly other classes as well.

    The properties of an instance of Content Search reflect properties stored in the corresponding content index. The object valued property, Result Object, is a property of the new Content Search class. The act of accessing the value of the Result Object property of a query result row causes a connected scratch pad object for the DocVersion or other object associated with the result row to be instantiated on the client.

    The "score" properties of the Content Search class that allow the results of content based queries to be ranked in descending order of relevance. (See "raw score, normalized score" in the glossary. See also "content search", "retrieval term", and "retrieval term weight".)

    Query operators, such as Contains and Within, are specialized for content based searching.

    Content search options global to the query are specified by certain property values of the Content Query Options property of the Query object (e.g., CBR Soundex, CBR Case Sensitive, etc.). These global settings may be overridden for a sub part of the query by values for these same properties of the Content Query Options property on certain Query Operator objects, e.g. those with Query Operator Id equal to dmaQueryOperator_Contains or dmaQueryOperator_Within.

New Query Operators

Contains

    Name Result Type Min # of Opnds Max # of Opnds Operand Type(s) Safe to Eliminate Define
    Contains Boolean 3 3 0: object (class is Query Property)

    1: string

    2: float64

    Yes dmaQueryOperator_Contains

dmaQueryOperator_Contains

This operator returns DMA_TRUE if and only if the content of the document referred to by the current object under scan in the content index contains the retrieval terms specified in the operands. Otherwise, it returns DMA_FALSE.

Operand 0 is a scalar object valued property that must be of class Query Property. The Searchable Class Occurrence must designate the Content Search class or one of its subclasses. The Property Id usually designates the Content property of Content Search.

Operand 1 is a scalar object valued property that must be of class Query Constant String. The value of the string is the retrieval term (i.e., a word or phrase to be looked up in the content index).

Operand 2 is of class Query Constant Float64 specifying the retrieval term weight. The weight must be between 0.0 and 1.0 . (The ratio of two weights indicates the relative importance of two retrieval terms or co-occurrences.)

Within

    Name Result Type Min # of Opnds Max # of Opnds Operand Type(s) Safe to Eliminate Define
    Within Boolean 6 6 0: object (class is Query Property)

    1: String

    2: String

    3: integer32

    4: Boolean

    5: float64

    Yes DmaQueryOperator_Within

dmaQueryOperator_Within

This operator is useful to find documents in which two search terms (i.e., words or phrases) occur near each other. Specifically, this operator returns DMA_TRUE if and only if all of the following conditions are satisfied: (1) the content of the document referred to by the current object under scan in the content index contains both of the two retrieval terms specified in the operands, and (2) the distance between the two terms in the co-occurrence must be less than or equal to a specified distance of each other, and, (3) if it matters which retrieval term occurs first, then when a co-occurrence of both retrieval terms satisfies the proximity condition, the first retrieval term does indeed occur before the second retrieval term. Otherwise, the operator returns DMA_FALSE.

Operand 0 is a scalar object valued property that must be of class Query Property. The Searchable Class Occurrence must designate the Content Search class or one of its subclasses. The Property Id usually designates the Content property of Content Search.

Operand 1 is a scalar object valued property that must be of class Query Constant String. The value of the string is the first retrieval term (i.e., a word or phrase to be looked up in the content index).

Operand 2 is a scalar object valued property that must be of class Query Constant String. The value of the string is the second retrieval term (i.e., a word or phrase to be looked up in the content index).

Operand 3 is of class Query Constant Integer32. The value indicates the maximum distance allowed between the occurrence of the retrieval terms of operand 1 and operand 2 in the document. A value of 1 indicates adjacency. A value of 2 indicates at most one intervening thing, etc. The units of distance for this operator are the, so called, default units. (See the description of DMA_DISTANCE_DEFAULT in the Within Units operator.) For most languages, the default units are words. For some Asian languages, the default units are characters, because the concept of "word" is not applicable. For example, if operand 3 had the value 1, then the search term of operand 1 must be adjacent to the search term of operand 2.

Operand 4 is of class Query Constant Boolean. If the value is DMA_TRUE, then the first search term must occur first. If the value is DMA_FALSE, then either the first or the second search term may occur first, and the other search term must occur next.

Operand 5 is of class Query Constant Float64. The value specifies the weight for the co-occurrence of the two retrieval terms. The weight must be between 0.0 and 1.0 . (The ratio of two weights indicates the relative importance of the two retrieval terms or co-occurrences.)

NotWithin

WithinUnits

NotWithinUnits

Internationalization/Localization

Content

Containment

Versioning

Security

Naming

Performance Impact

Failure Modes/Crash Recovery

Sample Usage

Contains Operator

dmaProp_ResultObject

The "Within" Operator

Compound Retrieval Condition

Using the New Properties in the Query Class

Most of the new properties introduced to the Query class are Booleans. In order to illustrate how to use the information they provide, it should be sufficient to consider just one of them, say, CBR Soundex.

Backwards Compatibility

Impact on Existing Systems

Compliance & Conformance Issues

What Requirement Does This Address?

Why Is This a Good Solution?

Existing Systems with Similar Features and Related Experiences

Reference Materials

How Is This Incorporated into the Specification?

html rendition derived 1999-08-06-00:13 -0700 (pdt)
$$Date: 00-02-17 15:07 $
$$Revision: 5 $