X-Git-Url: http://git.indexdata.com/?a=blobdiff_plain;f=doc%2Fquerymodel.xml;h=831eeff71e5e7e94013c342a9a0a897c37183af8;hb=930d9dc74a2bee5037ee1372dd5de4a6963ecd6b;hp=88c2fd7799e247f9968d09d4c746034e12eab9e1;hpb=1ab2e1e0d6f2aa60baa5195b0a313f689d4c1027;p=idzebra-moved-to-github.git diff --git a/doc/querymodel.xml b/doc/querymodel.xml index 88c2fd7..831eeff 100644 --- a/doc/querymodel.xml +++ b/doc/querymodel.xml @@ -1,11 +1,10 @@ - + Query Model - Query Model Overview + Query Model Overview - Query Languages @@ -25,43 +24,42 @@ Since the type-1 (RPN) query structure has no direct, useful string - representation, every origin application needs to provide some + representation, every client application needs to provide some form of mapping from a local query notation or representation to it. - - - - - Prefix Query Format (PQF) - - - Index Data has defined a textual representaion in the - Prefix Query Format, short - PQF, which mappes - one-to-one to binary encoded - type-1 RPN query packages. - It has been adopted by other - parties developing Z39.50 software, and is often referred to as - Prefix Query Notation, or in short - PQN. See - for further explanaitions and - descriptions of Zebra's capabilities. - - - - Common Query Language (CQL) + + + + Prefix Query Format (PQF) + + Index Data has defined a textual representation in the + Prefix Query Format, short + PQF, which maps + one-to-one to binary encoded + type-1 RPN queries. + PQF has been adopted by other + parties developing Z39.50 software, and is often referred to as + Prefix Query Notation, or in short + PQN. See + for further explanations and + descriptions of Zebra's capabilities. + + + + + Common Query Language (CQL) - The query model of the type-1 RPN, - expressed in PQF/PQN is natively supported. - On the other hand, the default SRU - webservices Common Query Language - CQL is not natively supported. + The query model of the type-1 RPN, + expressed in PQF/PQN is natively supported. + On the other hand, the default SRU + web services Common Query Language + CQL is not natively supported. - Zebra can be configured to understand and map CQL to PQF. See - . - - + Zebra can be configured to understand and map CQL to PQF. See + . + + @@ -86,21 +84,21 @@ explain operation, which provides the means for learning which fields (also called - indexes or access points + indexes or access points) are provided, which default parameter the server uses, which retrieve document formats are defined, and which specific parts of the general query model are supported. - The Z39.50 embeddes the explain operation - by perfoming a + The Z39.50 embeds the explain operation + by performing a search in the magic IR-Explain-1 database; see . - In SRU, explain is an entirely seperate - operation, which returns an Zeerex + In SRU, explain is an entirely separate + operation, which returns an ZeeRex XML record according to the structure defined by the protocol. @@ -134,9 +132,9 @@ It provides the means to investigate the content of specific indexes. - Scanning an index returns a handfull of terms actually fond in + Scanning an index returns a handful of terms actually found in the indexes, and in addition the scan - operation returns th enumber of documents indexed by each term. + operation returns the number of documents indexed by each term. A search client can use this information to propose proper spelling of search terms, to auto-fill search boxes, or to display controlled vocabularies. @@ -151,10 +149,11 @@ Prefix Query Format syntax and semantics - The PQF grammer + The PQF grammar is documented in the YAZ manual, and shall not be repeated here. This textual PQF representation - is always during search mapped to the equivalent Zebra internal + is not transmistted to Zebra during search, but it is in the + client mapped to the equivalent Z39.50 binary query parse tree. @@ -211,7 +210,7 @@ bib-1 Standard PQF query language attribute set which defines the semantics of Z39.50 searching. In addition, all of the - non-use attributes (type 2-9) define the hard-wired + non-use attributes (types 2-11) define the hard-wired Zebra internal query processing. default @@ -219,7 +218,7 @@ GILS gils - Extention to the Bib1 attribute set. + Extension to the Bib1 attribute set. predefined @@ -252,8 +251,9 @@ Boolean operators - A pair of subquery trees, or of atomic queries, is combined + A pair of sub query trees, or of atomic queries, is combined using the standard boolean operators into new query trees. + Thus, boolean operators are always internal nodes in the query tree. Set complement of two atomic queries hit sets - + - +
@proxbinary PROXIMY operatorbinary PROXIMITY operator Set intersection of two atomic queries hit sets. In addition, the intersection set is purged for all documents which do not satisfy the requested query @@ -307,7 +307,7 @@ Querying for the intersection of all documents containing the terms information AND retrieval: - The hit set is a subset of the coresponding + The hit set is a subset of the corresponding OR query. Z> find @and information retrieval @@ -317,20 +317,21 @@ Querying for the intersection of all documents containing the terms information AND retrieval, taking proximity into account: - The hit set is a subset of the coresponding - AND query. + The hit set is a subset of the corresponding + AND query + (see the PQF grammar for + details on the proximity operator): Z> find @prox 0 3 0 2 k 2 information retrieval - See PQF grammer for details. Querying for the intersection of all documents containing the terms information AND retrieval, in the same order and near each - other as described in the term list - The hit set is a subset of the coresponding - PROXIMY query. + other as described in the term list. + The hit set is a subset of the corresponding + PROXIMITY query. Z> find "information retrieval" @@ -341,14 +342,15 @@ Atomic queries (APT) - Atomic queries are the query parts which work on one acess point + Atomic queries are the query parts which work on one access point only. These consist of an attribute list followed by a single term or a quoted term list, and are often called Attributes-Plus-Terms (APT) queries. - Unsupplied non-use attributes type 2-9 are either inherited from + Atomic (APT) queries are always leaf nodes in the PQF query tree. + UN-supplied non-use attributes types 2-11 are either inherited from higher nodes in the query tree, or are set to Zebra's default values. See for details. @@ -356,12 +358,14 @@ - - @@ -382,7 +386,7 @@
Atomic queries
attribute list
Querying for the term information in the - default index using the default attribite set, the server choice + default index using the default attribute set, the server choice of access point/index, and the default non-use attributes. Z> find information @@ -394,7 +398,7 @@ Z> find @attrset bib-1 @attr 1=1017 @attr 2=3 @attr 3=3 @attr 4=1 @attr 5=100 @attr 6=1 information - + Finding all documents which have the term debussy in the title field. @@ -403,6 +407,22 @@
+ + The scan operation is only supported with + atomic APT queries, as it is bound to one access point at a + time. Boolean query trees are not allowed during + scan. + + + + For example, we might want to scan the title index, starting with + the term + debussy, and displaying this and the + following terms in lexicographic order: + + Z> scan @attr 1=4 debussy + + @@ -410,13 +430,15 @@ Named Result Sets Named result sets are supported in Zebra, and result sets can be - used as operands without limitations. + used as operands without limitations. It follows that named + result sets are leaf nodes in the PQF query tree, exactly as + atomic APT queries are. After the execution of a search, the result set is available at the server, such that the client can use it for subsequent searches or retrieval requests. The Z30.50 standard actually - stresses the fact that result sets are voliatile. It may cease + stresses the fact that result sets are volatile. It may cease to exist at any time point after search, and the server will send a diagnostic to the effect that the requested result set does not exist any more. @@ -424,7 +446,9 @@ Defining a named result set and re-using it in the next query, - using yaz-client. + using yaz-client. Notice that the client, not + the server, assigns the string '1' to the + named result set. Z> f @attr 1=4 mozart ... @@ -433,18 +457,13 @@ Z> f @and @set 1 @attr 1=4 amadeus ... Number of hits: 14, setno 2 - ... - Z> f @attr 1=1016 beethoven - ... - Number of hits: 26, setno 3 - ... Named result sets are only supported by the Z39.50 protocol. The SRU web service is stateless, and therefore the notion of - named result sets does not exist when acessing a Zebra server by + named result sets does not exist when accessing a Zebra server by the SRU protocol. @@ -454,13 +473,13 @@ Zebra's special access point of type 'string' The numeric use (type 1) attribute is usually - refered to from a given + referred to from a given attribute set. In addition, Zebra let you use any internal index name defined in your configuration - as use atribute value. This is a great feature for + as use attribute value. This is a great feature for debugging, and when you do - not need the complecity of defined use attribute values. It is + not need the complexity of defined use attribute values. It is the preferred way of accessing Zebra indexes directly. @@ -494,7 +513,7 @@ See also for details, and - for the SRU PQF query extention using string names as a fast + for the SRU PQF query extension using string names as a fast debugging facility. @@ -507,7 +526,7 @@ idea) to emulate XPath 1.0 based search by defining use (type 1) - string attributes which in appearence + string attributes which in appearance resemble XPath queries. There are two problems with this approach: first, the XPath-look-alike has to be defined at indexation time, no new undefined @@ -525,7 +544,7 @@ use (type 1) xpath attributes. You must enable the xpath enable directive in your - .abs config files. + .abs configuration files. Only a very restricted subset of the @@ -538,14 +557,14 @@ Finding all documents which have the term "content" inside a text node found in a specific XML DOM subtree, whose starting element is - adressed by XPath. + addressed by XPath. Z> find @attr 1=/root content Z> find @attr 1=/root/first content Notice that the XPath must be absolute, i.e., must start with '/', and that the - XPath decendant-or-self axis followed by a + XPath descendant-or-self axis followed by a text node selection text() is implicitly appended to the stated XPath. @@ -564,10 +583,10 @@ - Filter the adressing XPath by a predicate working on exact + Filter the addressing XPath by a predicate working on exact string values in attributes (in the XML sense) can be done: return all those docs which - have the term "english" contained in one of all text subnodes of + have the term "english" contained in one of all text sub nodes of the subtree defined by the XPath /record/title[@lang='en']. And similar predicate filtering. @@ -588,7 +607,8 @@ Escaping PQF keywords and other non-parseable XPath constructs - with '{ }' to prevent syntax errors: + with '{ }' to prevent client-side PQF parsing + syntax errors: Z> find @attr {1=/root/first[@attr='danish']} content Z> find @attr {1=/record/@set} oai @@ -596,7 +616,7 @@ It is worth mentioning that these dynamic performed XPath - queries are a performance bottelneck, as no optimized + queries are a performance bottleneck, as no optimized specialized indexes can be used. Therefore, avoid the use of this facility when speed is essential, and the database content size is medium to large. @@ -634,7 +654,7 @@ Use Attributes (type = 1) - The following Explain search atributes are supported: + The following Explain search attributes are supported: ExplainCategory (@attr 1=1), DatabaseName (@attr 1=3), DateAdded (@attr 1=9), @@ -657,7 +677,7 @@ Explain searches with yaz-client Classic Explain only defines retrieval of Explain information - via ASN.1. Pratically no Z39.50 clients supports this. Fortunately + via ASN.1. Practically no Z39.50 clients supports this. Fortunately they don't have to - Zebra allows retrieval of this information in other formats: SUTRS, XML, @@ -744,7 +764,7 @@ Most of the information contained in this section is an excerpt of the ATTRIBUTE SET BIB-1 (Z39.50-1995) SEMANTICS, - found at . The BIB-1 + found at . The BIB-1 Attribute Set Semantics from 1995, also in an updated Bib-1 Attribute Set @@ -759,20 +779,42 @@ A use attribute specifies an access point for any atomic query. - These acess points are highly dependent on the attribute set used + These access points are highly dependent on the attribute set used in the query, and are user configurable using the following default configuration files: tab/bib1.att, tab/dan1.att, tab/explain.att, and tab/gils.att. + + + For example, some few Bib-1 use + attributes from the tab/bib1.att are: + + att 1 Personal-name + att 2 Corporate-name + att 3 Conference-name + att 4 Title + ... + att 1009 Subject-name-personal + att 1010 Body-of-text + att 1011 Date/time-added-to-db + ... + att 1016 Any + att 1017 Server-choice + att 1018 Publisher + ... + att 1035 Anywhere + att 1036 Author-Title-Subject + + + New attribute sets can be added by adding new tab/*.att configuration files, which need to - be sourced in the main configuration zebra.cfg. + be sourced in the main configuration zebra.cfg. - - In addition, Zebra allows the acess of + In addition, Zebra allows the access of internal index names and dynamic XPath as use attributes; see and @@ -975,7 +1017,7 @@
Any position in field 3defaultsupported
@@ -983,9 +1025,9 @@ The position attribute values first in field (1), and first in subfield(2) are unsupported. - Using them does not trigger an error, but silent defaults to - any position in field (3). - + Using them silently maps to + any position in field (3). A proper diagnostic + should have been issued.
@@ -1004,7 +1046,7 @@ structure attribute (type 4) can be defined using the configuration file tab/default.idx. - The default configuration is summerized in this table. + The default configuration is summarized in this table.
find @attr 1=Body-of-text @attr 4=106 "bach salieri teleman" Z> find @attr 1=Body-of-text @or bach @or salieri teleman - This OR list of terms is very usefull in + This OR list of terms is very useful in combination with relevance ranking: Z> find @attr 1=Body-of-text @attr 2=102 @attr 4=105 "bach salieri teleman" @@ -1174,7 +1216,7 @@ The truncation attribute specifies whether variations of one or - more characters are allowed between serch term and hit terms, or + more characters are allowed between search term and hit terms, or not. Using non-default truncation attributes will broaden the document hit set of a search query. @@ -1257,7 +1299,7 @@ Process # in search term (101) is a poor-man's regular expression search. It maps each # to .*, and - performes then a Regexp-1 (102) regular + performs then a Regexp-1 (102) regular expression search. The following two queries are equivalent: Z> find @attr 1=Body-of-text @attr 5=101 schnit#ke @@ -1279,12 +1321,12 @@ The truncation attribute value - Regexp-2 (103) is a Zebra specific extention + Regexp-2 (103) is a Zebra specific extension which allows fuzzy matches. One single error in spelling of search terms is allowed, i.e., a document is hit if it includes a term which can be mapped to the used search term by one character substitution, addition, deletion or - change of posiiton. + change of position. Z> find @attr 1=Body-of-text @attr 5=100 schnittke ... @@ -1330,7 +1372,7 @@ - + @@ -1377,11 +1419,11 @@ The Zebra internal query engine has been extended to specific needs not covered by the bib-1 attribute set query - model. These extentions are non-standard - and non-portable: most functional extentions + model. These extensions are non-standard + and non-portable: most functional extensions are modeled over the bib-1 attribute set, defining type 7-9 attributes. - There are also the speciel + There are also the special string type index names for the idxpath attribute set. @@ -1421,9 +1463,9 @@ - Zebra specific Search Extentions to all Attribute Sets + Zebra specific Search Extensions to all Attribute Sets - Zebra extends the Bib1 attribute types, and these extentions are + Zebra extends the Bib1 attribute types, and these extensions are recognized regardless of attribute set used in a search operation query. @@ -1431,7 +1473,7 @@
Complete subfield 2depreciateddeprecated
Complete field
- + @@ -1475,7 +1517,7 @@
Zebra Search Attribute ExtentionsZebra Search Attribute Extensions
Name
- Zebra Extention Embedded Sort Attribute (type 7) + Zebra Extension Embedded Sort Attribute (type 7) The embedded sort is a way to specify sort within a query - thus @@ -1516,9 +1558,21 @@ + + + + + - Zebra Extention Rank Weight Attribute (type 9) + Zebra Extension Rank Weight Attribute (type 9) Rank weight is a way to pass a value to a ranking algorithm - so @@ -1556,40 +1613,55 @@ - Zebra Extention Approximative Limit Attribute (type 9) + Zebra Extension Approximative Limit Attribute (type 11) - Newer Zebra versions normally estemiates hit count for every APT + Zebra computes - unless otherwise configured - + the exact hit count for every APT (leaf) in the query tree. These hit counts are returned as part of the searchResult-1 facility in the binary encoded Z39.50 search response packages. - By setting a limit for the APT we can make Zebra turn into - approximate hit count when a certain hit count limit is - reached. A value of zero means exact hit count. + By setting an estimation limit size of the resultset of the APT + leaves, Zebra stoppes processing the result set when the limit + length is reached. + Hit counts under this limit are still precise, but hit counts over it + are estimated using the statistics gathered from the chopped + result set. - For example, we might be intersted in exact hit count for a, but + Specifying a limit of 0 resuts in exact hit counts. + + + For example, we might be interested in exact hit count for a, but for b we allow hit count estimates for 1000 and higher. - Z> find @and a @attr 9=1000 b + Z> find @and a @attr 11=1000 b - The estimated hit count fascility makes searches faster, as one + The estimated hit count facility makes searches faster, as one only needs to process large hit lists partially. + It is mostly used in huge databases, where you you want trade + exactness of hit counts against speed of execution. + Do not use approximative hit count limits + in conjunction with relevance ranking, as re-sorting of the + result set obviosly only works when the entire result set has + been processed. + + This facility clashes with rank weight, because there all documents in the hit lists need to be examined for scoring and re-sorting. It is an experimental - extention. Do not use in production code. + extension. Do not use in production code. - Zebra Extention Term Reference Attribute (type 10) + Zebra Extension Term Reference Attribute (type 10) Zebra supports the searchResult-1 facility. @@ -1613,16 +1685,16 @@ - Zebra specific Scan Extentions to all Attribute Sets + Zebra specific Scan Extensions to all Attribute Sets - Zebra extends the Bib1 attribute types, and these extentions are + Zebra extends the Bib1 attribute types, and these extensions are recognized regardless of attribute set used in a scan operation query. - + @@ -1648,7 +1720,7 @@
Zebra Scan Attribute ExtentionsZebra Scan Attribute Extensions
Name
- Zebra Extention Result Set Narrow (type 8) + Zebra Extension Result Set Narrow (type 8) If attribute Result Set Narrow (type 8) @@ -1661,7 +1733,7 @@ the case of scanning all title fields around the scanterm mozart, then refining the scan by issuing a filtering query for amadeus to - restric the scan to the result set of the query: + restrict the scan to the result set of the query: Z> scan @attr 1=4 mozart ... @@ -1689,11 +1761,11 @@ - Zebra Extention Approximative Limit (type 9) + Zebra Extension Approximative Limit (type 11) - The Zebra Extention Approximative Limit (type - 9) is a way to enable approx + The Zebra Extension Approximative Limit (type + 11) is a way to enable approximate hit counts for scan hit counts, in the same way as for search hit counts. @@ -1723,10 +1795,10 @@ xpath enable option in the GRS filter *.abs configuration files. If one wants to use the special idxpath numeric attribute set, the - main Zebra configuraiton file zebra.cfg + main Zebra configuration file zebra.cfg directive attset: idxpath.att must be enabled. - The idxpath is depreciated, may not be + The idxpath is deprecated, may not be supported in future Zebra versions, and should definitely not be used in production code. @@ -1759,31 +1831,31 @@ XPATH Begin 1 _XPATH_BEGIN - depreciated + deprecated XPATH End 2 _XPATH_END - depreciated + deprecated XPATH CData 1016 _XPATH_CDATA - depreciated + deprecated XPATH Attribute Name 3 _XPATH_ATTR_NAME - depreciated + deprecated XPATH Attribute CData 1015 _XPATH_ATTR_CDATA - depreciated + deprecated @@ -1835,7 +1907,7 @@
- Combining usual bib-1 attribut set searches + Combining usual bib-1 attribute set searches with idxpath attribute set searches: Z> find @and @attr idxpath 1=1 @attr 4=3 link/ @attr 1=4 mozart @@ -1843,7 +1915,7 @@ - Scanning is supportet on all idxpath + Scanning is supported on all idxpath indexes, both specified as numeric use attributes, or as string index names. @@ -1883,10 +1955,10 @@ - + - + @@ -1894,7 +1966,7 @@ - + @@ -1931,7 +2003,7 @@ Numeric use attributes are mapped to the Zebra internal - string index according to the attribute set defintion in use. + string index according to the attribute set definition in use. The default attribute set is Bib-1, and may be omitted in the PQF query. @@ -1973,7 +2045,7 @@ - String indexes can be acessed directly, + String indexes can be accessed directly, independently which attribute set is in use. These are just ignored. The above mentioned name normalization applies. String index names are defined in the @@ -1984,10 +2056,10 @@ - Zebra internal indexes can be acessed directly, + Zebra internal indexes can be accessed directly, according to the same rules as the user defined string indexes. The only difference is that - Zebra internal indexe names are hardwired, + Zebra internal index names are hardwired, all uppercase and must start with the character '_'. @@ -1995,7 +2067,7 @@ Finally, XPATH access points are only available using the GRS filter for indexing. - These acees point names must start with the character + These access point names must start with the character '/', they are not normalized, but passed unaltered to the Zebra internal XPATH engine. See . @@ -2013,8 +2085,8 @@ Internally Zebra has in it's default configuration several different types of registers or indexes, whose tokenization and character normalization rules differ. This reflects the fact that - serching fundamental different tokens like dates, numbers, - bitfields and string based text needs different rulesets. + searching fundamental different tokens like dates, numbers, + bitfields and string based text needs different rule sets.
Acces point name mappingAccess point name mapping
Acess PointAccess Point Type Grammar Notes
Use attibuteUse attribute numeric [1-9][1-9]* directly mapped to string index name
urx (@attr 4=104) - + @@ -2175,7 +2247,7 @@ If the Structure attribute is - URx the term is treated as a URX (URL) entity. + URX the term is treated as a URX (URL) entity. The search is performed on those fields that are indexed as type u in the *.abs file. @@ -2304,6 +2376,8 @@ The next plus character marks the end of the section. Currently Zebra only supports one specifier, the error tolerance, which consists one digit. +
ignored URX/URL ('u')Special index for URL web adressesSpecial index for URL web addresses
numeric (@attr 4=109)