Added charmap facility to delete leading articles

author Sebastian Hammer <quinn@indexdata.com>

Tue, 14 Sep 2004 14:38:07 +0000 (14:38 +0000)

committer Sebastian Hammer <quinn@indexdata.com>

Tue, 14 Sep 2004 14:38:07 +0000 (14:38 +0000)
author Sebastian Hammer <quinn@indexdata.com>
Tue, 14 Sep 2004 14:38:07 +0000 (14:38 +0000)
committer Sebastian Hammer <quinn@indexdata.com>
Tue, 14 Sep 2004 14:38:07 +0000 (14:38 +0000)
diff --git a/NEWS b/NEWS

index 3b78c7b..93615f1 100644 (file)
--- a/NEWS
+++ b/NEWS
@@ -1,3 +1,5 @@
+Added mechanism to ignore leading articles when doing full-field indexing,
+based on the character map files. See the manual for further discussion.
  
  Fixed bug in record management. Releasing blocks could result in
  partial read.
  
  Fixed bug in record management. Releasing blocks could result in
  partial read.
diff --git a/doc/recordmodel.xml b/doc/recordmodel.xml

index e052f12..5846b4b 100644 (file)
--- a/doc/recordmodel.xml
+++ b/doc/recordmodel.xml
@@ -1,5 +1,5 @@
   <chapter id="record-model">
   <chapter id="record-model">
-  <!-- $Id: recordmodel.xml,v 1.18 2004-08-04 08:26:43 adam Exp $ -->
+  <!-- $Id: recordmodel.xml,v 1.19 2004-09-14 14:38:07 quinn Exp $ -->
    <title>The Record Model</title>
    
    <para>
    <title>The Record Model</title>
    
    <para>
@@ -1786,174 +1786,216 @@
       special-purpose fields such as WWW-style linkages (URx).
      </para>
  
       special-purpose fields such as WWW-style linkages (URx).
      </para>
  
-    <para>
-     The field types, and hence character sets, are associated with data
-     elements by the .abs files (see above).
-     The file <literal>default.idx</literal>
-     provides the association between field type codes (as used in the .abs
-     files) and the character map files (with the .chr suffix). The format
-     of the .idx file is as follows
-    </para>
-
-    <para>
-     <variablelist>
-
-      <varlistentry>
-       <term>index <emphasis>field type code</emphasis></term>
-       <listitem>
-        <para>
-         This directive introduces a new search index code.
-         The argument is a one-character code to be used in the
-         .abs files to select this particular index type. An index, roughly,
-         corresponds to a particular structure attribute during search. Refer
-         to <xref linkend="search"/>.
-        </para>
-       </listitem></varlistentry>
-      <varlistentry>
-       <term>sort <emphasis>field code type</emphasis></term>
-       <listitem>
-        <para>
-         This directive introduces a 
-         sort index. The argument is a one-character code to be used in the
-         .abs fie to select this particular index type. The corresponding
-         use attribute must be used in the sort request to refer to this
-         particular sort index. The corresponding character map (see below)
-         is used in the sort process.
-        </para>
-       </listitem></varlistentry>
-      <varlistentry>
-       <term>completeness <emphasis>boolean</emphasis></term>
-       <listitem>
-        <para>
-         This directive enables or disables complete field indexing.
-         The value of the <emphasis>boolean</emphasis> should be 0
-         (disable) or 1. If completeness is enabled, the index entry will
-         contain the complete contents of the field (up to a limit), with words
-         (non-space characters) separated by single space characters
-         (normalized to " " on display). When completeness is
-         disabled, each word is indexed as a separate entry. Complete subfield
-         indexing is most useful for fields which are typically browsed (eg.
-         titles, authors, or subjects), or instances where a match on a
-         complete subfield is essential (eg. exact title searching). For fields
-         where completeness is disabled, the search engine will interpret a
-         search containing space characters as a word proximity search.
-        </para>
-       </listitem></varlistentry>
-      <varlistentry>
-       <term>charmap <emphasis>filename</emphasis></term>
-       <listitem>
-        <para>
-         This is the filename of the character
-         map to be used for this index for field type.
-        </para>
-       </listitem></varlistentry>
-     </variablelist>
-    </para>
-
-    <para>
-     The contents of the character map files are structured as follows:
-    </para>
-
-    <para>
-     <variablelist>
-
-      <varlistentry>
-       <term>lowercase <emphasis>value-set</emphasis></term>
-       <listitem>
-        <para>
-         This directive introduces the basic value set of the field type.
-         The format is an ordered list (without spaces) of the
-         characters which may occur in "words" of the given type.
-         The order of the entries in the list determines the
-         sort order of the index. In addition to single characters, the
-         following combinations are legal:
-        </para>
-
-        <para>
-
-         <itemizedlist>
-          <listitem>
-           <para>
-            Backslashes may be used to introduce three-digit octal, or
-            two-digit hex representations of single characters
-            (preceded by <literal>x</literal>).
-            In addition, the combinations
-            \\, \\r, \\n, \\t, \\s (space &mdash; remember that real
-            space-characters may not occur in the value definition), and
-            \\ are recognized, with their usual interpretation.
-           </para>
-          </listitem>
-
-          <listitem>
-           <para>
-            Curly braces {} may be used to enclose ranges of single
-            characters (possibly using the escape convention described in the
-            preceding point), eg. {a-z} to introduce the
-            standard range of ASCII characters.
-            Note that the interpretation of such a range depends on
-            the concrete representation in your local, physical character set.
-           </para>
-          </listitem>
-
-          <listitem>
-           <para>
-            paranthesises () may be used to enclose multi-byte characters -
-            eg. diacritics or special national combinations (eg. Spanish
-            "ll"). When found in the input stream (or a search term),
-            these characters are viewed and sorted as a single character, with a
-            sorting value depending on the position of the group in the value
-            statement.
-           </para>
-          </listitem>
+    <sect3 id="default-idx-file">
+     <title>The default.idx file</title>
+     <para>
+      The field types, and hence character sets, are associated with data
+      elements by the .abs files (see above).
+      The file <literal>default.idx</literal>
+      provides the association between field type codes (as used in the .abs
+      files) and the character map files (with the .chr suffix). The format
+      of the .idx file is as follows
+     </para>
  
  
-         </itemizedlist>
+     <para>
+      <variablelist>
+
+       <varlistentry>
+       <term>index <emphasis>field type code</emphasis></term>
+       <listitem>
+        <para>
+         This directive introduces a new search index code.
+         The argument is a one-character code to be used in the
+         .abs files to select this particular index type. An index, roughly,
+         corresponds to a particular structure attribute during search. Refer
+         to <xref linkend="search"/>.
+        </para>
+       </listitem></varlistentry>
+       <varlistentry>
+       <term>sort <emphasis>field code type</emphasis></term>
+       <listitem>
+        <para>
+         This directive introduces a 
+         sort index. The argument is a one-character code to be used in the
+         .abs fie to select this particular index type. The corresponding
+         use attribute must be used in the sort request to refer to this
+         particular sort index. The corresponding character map (see below)
+         is used in the sort process.
+        </para>
+       </listitem></varlistentry>
+       <varlistentry>
+       <term>completeness <emphasis>boolean</emphasis></term>
+       <listitem>
+        <para>
+         This directive enables or disables complete field indexing.
+         The value of the <emphasis>boolean</emphasis> should be 0
+         (disable) or 1. If completeness is enabled, the index entry will
+         contain the complete contents of the field (up to a limit), with words
+         (non-space characters) separated by single space characters
+         (normalized to " " on display). When completeness is
+         disabled, each word is indexed as a separate entry. Complete subfield
+         indexing is most useful for fields which are typically browsed (eg.
+         titles, authors, or subjects), or instances where a match on a
+         complete subfield is essential (eg. exact title searching). For fields
+         where completeness is disabled, the search engine will interpret a
+         search containing space characters as a word proximity search.
+        </para>
+       </listitem></varlistentry>
+       <varlistentry>
+       <term>charmap <emphasis>filename</emphasis></term>
+       <listitem>
+        <para>
+         This is the filename of the character
+         map to be used for this index for field type.
+        </para>
+       </listitem></varlistentry>
+      </variablelist>
+     </para>
+    </sect3>
  
  
-        </para>
-       </listitem></varlistentry>
-      <varlistentry>
-       <term>uppercase <emphasis>value-set</emphasis></term>
-       <listitem>
-        <para>
-         This directive introduces the
-         upper-case equivalencis to the value set (if any). The number and
-         order of the entries in the list should be the same as in the
-         <literal>lowercase</literal> directive.
-        </para>
-       </listitem></varlistentry>
-      <varlistentry>
-       <term>space <emphasis>value-set</emphasis></term>
-       <listitem>
-        <para>
-         This directive introduces the character
-         which separate words in the input stream. Depending on the
-         completeness mode of the field in question, these characters either
-         terminate an index entry, or delimit individual "words" in
-         the input stream. The order of the elements is not significant &mdash;
-         otherwise the representation is the same as for the
-         <literal>uppercase</literal> and <literal>lowercase</literal>
-         directives.
-        </para>
-       </listitem></varlistentry>
-      <varlistentry>
-       <term>map <emphasis>value-set</emphasis>
-        <emphasis>target</emphasis></term>
-       <listitem>
-        <para>
-         This directive introduces a
-         mapping between each of the members of the value-set on the left to
-         the character on the right. The character on the right must occur in
-         the value set (the <literal>lowercase</literal> directive) of
-         the character set, but
-         it may be a paranthesis-enclosed multi-octet character. This directive
-         may be used to map diacritics to their base characters, or to map
-         HTML-style character-representations to their natural form, etc.
-        </para>
-       </listitem></varlistentry>
-     </variablelist>
-    </para>
+    <sect3 id="character-map-files">
+     <title>The character map file format</title>
+     <para>
+      The contents of the character map files are structured as follows:
+     </para>
  
  
+     <para>
+      <variablelist>
+
+       <varlistentry>
+       <term>lowercase <emphasis>value-set</emphasis></term>
+       <listitem>
+        <para>
+         This directive introduces the basic value set of the field type.
+         The format is an ordered list (without spaces) of the
+         characters which may occur in "words" of the given type.
+         The order of the entries in the list determines the
+         sort order of the index. In addition to single characters, the
+         following combinations are legal:
+        </para>
+
+        <para>
+
+         <itemizedlist>
+          <listitem>
+           <para>
+            Backslashes may be used to introduce three-digit octal, or
+            two-digit hex representations of single characters
+            (preceded by <literal>x</literal>).
+            In addition, the combinations
+            \\, \\r, \\n, \\t, \\s (space &mdash; remember that real
+            space-characters may not occur in the value definition), and
+            \\ are recognized, with their usual interpretation.
+           </para>
+          </listitem>
+
+          <listitem>
+           <para>
+            Curly braces {} may be used to enclose ranges of single
+            characters (possibly using the escape convention described in the
+            preceding point), eg. {a-z} to introduce the
+            standard range of ASCII characters.
+            Note that the interpretation of such a range depends on
+            the concrete representation in your local, physical character set.
+           </para>
+          </listitem>
+
+          <listitem>
+           <para>
+            paranthesises () may be used to enclose multi-byte characters -
+            eg. diacritics or special national combinations (eg. Spanish
+            "ll"). When found in the input stream (or a search term),
+            these characters are viewed and sorted as a single character, with a
+            sorting value depending on the position of the group in the value
+            statement.
+           </para>
+          </listitem>
+
+         </itemizedlist>
+
+        </para>
+       </listitem></varlistentry>
+       <varlistentry>
+       <term>uppercase <emphasis>value-set</emphasis></term>
+       <listitem>
+        <para>
+         This directive introduces the
+         upper-case equivalencis to the value set (if any). The number and
+         order of the entries in the list should be the same as in the
+         <literal>lowercase</literal> directive.
+        </para>
+       </listitem></varlistentry>
+       <varlistentry>
+       <term>space <emphasis>value-set</emphasis></term>
+       <listitem>
+        <para>
+         This directive introduces the character
+         which separate words in the input stream. Depending on the
+         completeness mode of the field in question, these characters either
+         terminate an index entry, or delimit individual "words" in
+         the input stream. The order of the elements is not significant &mdash;
+         otherwise the representation is the same as for the
+         <literal>uppercase</literal> and <literal>lowercase</literal>
+         directives.
+        </para>
+       </listitem></varlistentry>
+       <varlistentry>
+       <term>map <emphasis>value-set</emphasis>
+        <emphasis>target</emphasis></term>
+       <listitem>
+        <para>
+         This directive introduces a
+         mapping between each of the members of the value-set on the left to
+         the character on the right. The character on the right must occur in
+         the value set (the <literal>lowercase</literal> directive) of
+         the character set, but
+         it may be a paranthesis-enclosed multi-octet character. This directive
+         may be used to map diacritics to their base characters, or to map
+         HTML-style character-representations to their natural form, etc. The map directive
+         can also be used to ignore leading articles in searching and/or sorting, and to perform
+         other special transformations. See section <xref linkend="leading-articles"/>.
+        </para>
+       </listitem></varlistentry>
+      </variablelist>
+     </para>
+    </sect3>
+    <sect3 id="leading-articles">
+     <title>Ignoring leading articles</title>
+     <para>
+      In addition to specifying sort orders, space (blank) handling, and upper/lowercase folding,
+      you can also use the character map files to make Zebra ignore leading articles in sorting
+      records, or when doing complete field searching.
+     </para>
+     <para>
+      This is done using the <literal>map</literal> directive in the character map file. In a
+      nutshell, what you do is map certain sequences of characters, when they occur <emphasis>
+      in the beginning of a field</emphasis>, to a space. Assuming that the character "@" is
+      defined as a space character in your file, you can do:
+      <screen>
+       map (^The\s) @
+       map (^the\s) @
+      </screen>
+      The effect of these directives is to map either 'the' or 'The', followed by a space
+      character, to a space. The hat ^ character denotes beginning-of-field only when
+      complete-subfield indexing or sort indexing is taking place; otherwise, it is treated just
+      as any other character.
+     </para>
+     <para>
+      Because the <literal>default.idx</literal> file can be used to associate different
+      character maps with different indexing types -- and you can create additional indexing
+      types, should the need arise -- it is possible to specify that leading articles should be
+      ignored either in sorting, in complete-field searching, or both.
+     </para>
+     <para>
+      If you ignore certain prefixes in sorting, then these will be eliminated from the index,
+      and sorting will take place as if they weren't there. However, if you set the system up
+      to ignore certain prefixes in <emphasis>searching</emphasis>, then these are deleted both
+      from the indexes and from query terms, when the client specifies complete-field
+      searching. This has the effect that a search for 'the science journal' and 'science
+      journal' would both produce the same results.
+     </para>
+    </sect3>
     </sect2>
     </sect2>
-
    </sect1>
  
    <sect1 id="formats">
    </sect1>
  
    <sect1 id="formats">
diff --git a/include/charmap.h b/include/charmap.h

index 365b6ab..ed9e1af 100644 (file)
--- a/include/charmap.h
+++ b/include/charmap.h
@@ -1,4 +1,4 @@
-/* $Id: charmap.h,v 1.9 2004-07-28 09:47:41 adam Exp $
+/* $Id: charmap.h,v 1.10 2004-09-14 14:38:07 quinn Exp $
     Copyright (C) 1995,1996,1997,1998,1999,2000,2001,2002,2003,2004
     Index Data Aps
  
     Copyright (C) 1995,1996,1997,1998,1999,2000,2001,2002,2003,2004
     Index Data Aps
  
@@ -43,9 +43,9 @@ YAZ_EXPORT chrmaptab chrmaptab_create(const char *tabpath, const char *name,
                                       int map_only, const char *tabroot);
  YAZ_EXPORT void chrmaptab_destroy (chrmaptab tab);
  
                                       int map_only, const char *tabroot);
  YAZ_EXPORT void chrmaptab_destroy (chrmaptab tab);
  
-YAZ_EXPORT const char **chr_map_input(chrmaptab t, const char **from, int len);
+YAZ_EXPORT const char **chr_map_input(chrmaptab t, const char **from, int len, int first);
  YAZ_EXPORT const char **chr_map_input_x(chrmaptab t,
  YAZ_EXPORT const char **chr_map_input_x(chrmaptab t,
-                                       const char **from, int *len);
+                                       const char **from, int *len, int first);
  YAZ_EXPORT const char **chr_map_input_q(chrmaptab maptab,
                                         const char **from, int len,
                                         const char **qmap);
  YAZ_EXPORT const char **chr_map_input_q(chrmaptab maptab,
                                         const char **from, int len,
                                         const char **qmap);
diff --git a/include/zebramap.h b/include/zebramap.h

index 9050d36..ba7f1e4 100644 (file)
--- a/include/zebramap.h
+++ b/include/zebramap.h
@@ -1,4 +1,4 @@
-/* $Id: zebramap.h,v 1.16 2004-08-25 09:23:36 adam Exp $
+/* $Id: zebramap.h,v 1.17 2004-09-14 14:38:07 quinn Exp $
     Copyright (C) 1995,1996,1997,1998,1999,2000,2001,2002,2003,2004
     Index Data Aps
  
     Copyright (C) 1995,1996,1997,1998,1999,2000,2001,2002,2003,2004
     Index Data Aps
  
@@ -34,7 +34,7 @@ ZebraMaps zebra_maps_open (Res res, const char *base);
  void zebra_maps_close (ZebraMaps zm);
  
  const char **zebra_maps_input (ZebraMaps zms, unsigned reg_id,
  void zebra_maps_close (ZebraMaps zm);
  
  const char **zebra_maps_input (ZebraMaps zms, unsigned reg_id,
-                              const char **from, int len);
+                              const char **from, int len, int first);
  const char *zebra_maps_output(ZebraMaps, unsigned reg_id, const char **from);
  
  int zebra_maps_attr (ZebraMaps zms, Z_AttributesPlusTerm *zapt,
  const char *zebra_maps_output(ZebraMaps, unsigned reg_id, const char **from);
  
  int zebra_maps_attr (ZebraMaps zms, Z_AttributesPlusTerm *zapt,
diff --git a/index/extract.c b/index/extract.c

index 98cc717..18cac3d 100644 (file)
--- a/index/extract.c
+++ b/index/extract.c
@@ -1,4 +1,4 @@
-/* $Id: extract.c,v 1.160 2004-08-10 08:19:15 heikki Exp $
+/* $Id: extract.c,v 1.161 2004-09-14 14:38:07 quinn Exp $
     Copyright (C) 1995,1996,1997,1998,1999,2000,2001,2002,2003,2004
     Index Data Aps
  
     Copyright (C) 1995,1996,1997,1998,1999,2000,2001,2002,2003,2004
     Index Data Aps
  
@@ -1738,7 +1738,7 @@ static void extract_add_incomplete_field (RecWord *p)
      const char **map = 0;
  
      if (remain > 0)
      const char **map = 0;
  
      if (remain > 0)
-       map = zebra_maps_input(p->zebra_maps, p->reg_type, &b, remain);
+       map = zebra_maps_input(p->zebra_maps, p->reg_type, &b, remain, 0);
  
      while (map)
      {
  
      while (map)
      {
@@ -1750,7 +1750,7 @@ static void extract_add_incomplete_field (RecWord *p)
         {
             remain = p->length - (b - p->string);
             if (remain > 0)
         {
             remain = p->length - (b - p->string);
             if (remain > 0)
-               map = zebra_maps_input(p->zebra_maps, p->reg_type, &b, remain);
+               map = zebra_maps_input(p->zebra_maps, p->reg_type, &b, remain, 0);
             else
                 map = 0;
         }
             else
                 map = 0;
         }
@@ -1765,7 +1765,7 @@ static void extract_add_incomplete_field (RecWord *p)
                 buf[i++] = *(cp++);
             remain = p->length - (b - p->string);
             if (remain > 0)
                 buf[i++] = *(cp++);
             remain = p->length - (b - p->string);
             if (remain > 0)
-               map = zebra_maps_input(p->zebra_maps, p->reg_type, &b, remain);
+               map = zebra_maps_input(p->zebra_maps, p->reg_type, &b, remain, 0);
             else
                 map = 0;
         }
             else
                 map = 0;
         }
@@ -1782,9 +1782,12 @@ static void extract_add_complete_field (RecWord *p)
      char buf[IT_MAX_WORD+1];
      const char **map = 0;
      int i = 0, remain = p->length;
      char buf[IT_MAX_WORD+1];
      const char **map = 0;
      int i = 0, remain = p->length;
+    int first; /* first position */
+
+yaz_log(LOG_DEBUG, "Complete field, w='%s'", p->string);
  
      if (remain > 0)
  
      if (remain > 0)
-       map = zebra_maps_input (p->zebra_maps, p->reg_type, &b, remain);
+       map = zebra_maps_input (p->zebra_maps, p->reg_type, &b, remain, 1);
  
      while (remain > 0 && i < IT_MAX_WORD)
      {
  
      while (remain > 0 && i < IT_MAX_WORD)
      {
@@ -1793,7 +1796,10 @@ static void extract_add_complete_field (RecWord *p)
             remain = p->length - (b - p->string);
  
             if (remain > 0)
             remain = p->length - (b - p->string);
  
             if (remain > 0)
-               map = zebra_maps_input(p->zebra_maps, p->reg_type, &b, remain);
+           {
+               first = i ? 0 : 1;
+               map = zebra_maps_input(p->zebra_maps, p->reg_type, &b, remain, first);
+           }
             else
                 map = 0;
         }
             else
                 map = 0;
         }
@@ -1814,13 +1820,16 @@ static void extract_add_complete_field (RecWord *p)
             {
                 if (i >= IT_MAX_WORD)
                     break;
             {
                 if (i >= IT_MAX_WORD)
                     break;
+yaz_log(LOG_DEBUG, "Adding string to index '%d'", *map);
                 while (i < IT_MAX_WORD && *cp)
                     buf[i++] = *(cp++);
             }
             remain = p->length  - (b - p->string);
             if (remain > 0)
                 while (i < IT_MAX_WORD && *cp)
                     buf[i++] = *(cp++);
             }
             remain = p->length  - (b - p->string);
             if (remain > 0)
+           {
                 map = zebra_maps_input (p->zebra_maps, p->reg_type, &b,
                 map = zebra_maps_input (p->zebra_maps, p->reg_type, &b,
-                                       remain);
+                                       remain, 0);
+           }
             else
                 map = 0;
         }
             else
                 map = 0;
         }
diff --git a/index/zrpn.c b/index/zrpn.c

index 358557e..dea6c98 100644 (file)
--- a/index/zrpn.c
+++ b/index/zrpn.c
@@ -1,4 +1,4 @@
-/* $Id: zrpn.c,v 1.151 2004-09-13 09:02:16 adam Exp $
+/* $Id: zrpn.c,v 1.152 2004-09-14 14:38:07 quinn Exp $
     Copyright (C) 1995,1996,1997,1998,1999,2000,2001,2002,2003,2004
     Index Data Aps
  
     Copyright (C) 1995,1996,1997,1998,1999,2000,2001,2002,2003,2004
     Index Data Aps
  
@@ -64,7 +64,7 @@ typedef struct {
  static const char **rpn_char_map_handler (void *vp, const char **from, int len)
  {
      struct rpn_char_map_info *p = (struct rpn_char_map_info *) vp;
  static const char **rpn_char_map_handler (void *vp, const char **from, int len)
  {
      struct rpn_char_map_info *p = (struct rpn_char_map_info *) vp;
-    const char **out = zebra_maps_input (p->zm, p->reg_type, from, len);
+    const char **out = zebra_maps_input (p->zm, p->reg_type, from, len, 0);
  #if 0
      if (out && *out)
      {
  #if 0
      if (out && *out)
      {
@@ -261,7 +261,7 @@ static int grep_handle (char *name, const char *info, void *p)
  }
  
  static int term_pre (ZebraMaps zebra_maps, int reg_type, const char **src,
  }
  
  static int term_pre (ZebraMaps zebra_maps, int reg_type, const char **src,
-                     const char *ct1, const char *ct2)
+                     const char *ct1, const char *ct2, int first)
  {
      const char *s1, *s0 = *src;
      const char **map;
  {
      const char *s1, *s0 = *src;
      const char **map;
@@ -274,7 +274,7 @@ static int term_pre (ZebraMaps zebra_maps, int reg_type, const char **src,
          if (ct2 && strchr (ct2, *s0))
              break;
          s1 = s0;
          if (ct2 && strchr (ct2, *s0))
              break;
          s1 = s0;
-        map = zebra_maps_input (zebra_maps, reg_type, &s1, strlen(s1));
+        map = zebra_maps_input (zebra_maps, reg_type, &s1, strlen(s1), first);
          if (**map != *CHR_SPACE)
              break;
          s0 = s1;
          if (**map != *CHR_SPACE)
              break;
          s0 = s1;
@@ -298,13 +298,13 @@ static int term_100 (ZebraMaps zebra_maps, int reg_type,
      const char *space_start = 0;
      const char *space_end = 0;
  
      const char *space_start = 0;
      const char *space_end = 0;
  
-    if (!term_pre (zebra_maps, reg_type, src, NULL, NULL))
+    if (!term_pre (zebra_maps, reg_type, src, NULL, NULL, !space_split))
          return 0;
      s0 = *src;
      while (*s0)
      {
          s1 = s0;
          return 0;
      s0 = *src;
      while (*s0)
      {
          s1 = s0;
-        map = zebra_maps_input (zebra_maps, reg_type, &s0, strlen(s0));
+        map = zebra_maps_input (zebra_maps, reg_type, &s0, strlen(s0), 0);
          if (space_split)
          {
              if (**map == *CHR_SPACE)
          if (space_split)
          {
              if (**map == *CHR_SPACE)
@@ -356,7 +356,7 @@ static int term_101 (ZebraMaps zebra_maps, int reg_type,
      int i = 0;
      int j = 0;
  
      int i = 0;
      int j = 0;
  
-    if (!term_pre (zebra_maps, reg_type, src, "#", "#"))
+    if (!term_pre (zebra_maps, reg_type, src, "#", "#", !space_split))
          return 0;
      s0 = *src;
      while (*s0)
          return 0;
      s0 = *src;
      while (*s0)
@@ -370,7 +370,7 @@ static int term_101 (ZebraMaps zebra_maps, int reg_type,
          else
          {
              s1 = s0;
          else
          {
              s1 = s0;
-            map = zebra_maps_input (zebra_maps, reg_type, &s0, strlen(s0));
+            map = zebra_maps_input (zebra_maps, reg_type, &s0, strlen(s0), 0);
              if (space_split && **map == *CHR_SPACE)
                  break;
              while (s1 < s0)
              if (space_split && **map == *CHR_SPACE)
                  break;
              while (s1 < s0)
@@ -398,7 +398,7 @@ static int term_103 (ZebraMaps zebra_maps, int reg_type, const char **src,
      const char *s0, *s1;
      const char **map;
  
      const char *s0, *s1;
      const char **map;
  
-    if (!term_pre (zebra_maps, reg_type, src, "^\\()[].*+?|", "("))
+    if (!term_pre (zebra_maps, reg_type, src, "^\\()[].*+?|", "(", !space_split))
          return 0;
      s0 = *src;
      if (errors && *s0 == '+' && s0[1] && s0[2] == '+' && s0[3] &&
          return 0;
      s0 = *src;
      if (errors && *s0 == '+' && s0[1] && s0[2] == '+' && s0[3] &&
@@ -419,7 +419,7 @@ static int term_103 (ZebraMaps zebra_maps, int reg_type, const char **src,
          else
          {
              s1 = s0;
          else
          {
              s1 = s0;
-            map = zebra_maps_input (zebra_maps, reg_type, &s0, strlen(s0));
+            map = zebra_maps_input (zebra_maps, reg_type, &s0, strlen(s0), 0);
              if (**map == *CHR_SPACE)
                  break;
              while (s1 < s0)
              if (**map == *CHR_SPACE)
                  break;
              while (s1 < s0)
@@ -456,7 +456,7 @@ static int term_104 (ZebraMaps zebra_maps, int reg_type,
      int i = 0;
      int j = 0;
  
      int i = 0;
      int j = 0;
  
-    if (!term_pre (zebra_maps, reg_type, src, "?*#", "?*#"))
+    if (!term_pre (zebra_maps, reg_type, src, "?*#", "?*#", !space_split))
          return 0;
      s0 = *src;
      while (*s0)
          return 0;
      s0 = *src;
      while (*s0)
@@ -499,7 +499,7 @@ static int term_104 (ZebraMaps zebra_maps, int reg_type,
          }
          {
              s1 = s0;
          }
          {
              s1 = s0;
-            map = zebra_maps_input (zebra_maps, reg_type, &s0, strlen(s0));
+            map = zebra_maps_input (zebra_maps, reg_type, &s0, strlen(s0), 0);
              if (space_split && **map == *CHR_SPACE)
                  break;
              while (s1 < s0)
              if (space_split && **map == *CHR_SPACE)
                  break;
              while (s1 < s0)
@@ -527,7 +527,7 @@ static int term_105 (ZebraMaps zebra_maps, int reg_type,
      int i = 0;
      int j = 0;
  
      int i = 0;
      int j = 0;
  
-    if (!term_pre (zebra_maps, reg_type, src, "*!", "*!"))
+    if (!term_pre (zebra_maps, reg_type, src, "*!", "*!", !space_split))
          return 0;
      s0 = *src;
      while (*s0)
          return 0;
      s0 = *src;
      while (*s0)
@@ -545,7 +545,7 @@ static int term_105 (ZebraMaps zebra_maps, int reg_type,
          }
          {
              s1 = s0;
          }
          {
              s1 = s0;
-            map = zebra_maps_input (zebra_maps, reg_type, &s0, strlen(s0));
+            map = zebra_maps_input (zebra_maps, reg_type, &s0, strlen(s0), 0);
              if (space_split && **map == *CHR_SPACE)
                  break;
              while (s1 < s0)
              if (space_split && **map == *CHR_SPACE)
                  break;
              while (s1 < s0)
@@ -1245,7 +1245,7 @@ static int trans_scan_term (ZebraHandle zh, Z_AttributesPlusTerm *zapt,
              
          while ((len = (cp_end - cp)) > 0)
          {
              
          while ((len = (cp_end - cp)) > 0)
          {
-            map = zebra_maps_input (zh->reg->zebra_maps, reg_type, &cp, len);
+            map = zebra_maps_input (zh->reg->zebra_maps, reg_type, &cp, len, 0);
              if (**map == *CHR_SPACE)
                  space_map = *map;
              else
              if (**map == *CHR_SPACE)
                  space_map = *map;
              else
diff --git a/tab/default.idx b/tab/default.idx

index 9e2cb81..c0e89ac 100644 (file)
--- a/tab/default.idx
+++ b/tab/default.idx
@@ -1,5 +1,5 @@
  # Zebra indexes as referred to from the *.abs-files.
  # Zebra indexes as referred to from the *.abs-files.
-#  $Id: default.idx,v 1.10 2004-07-28 09:40:46 adam Exp $
+#  $Id: default.idx,v 1.11 2004-09-14 14:38:08 quinn Exp $
  #
  
  # Traditional word index
  #
  
  # Traditional word index
@@ -51,5 +51,5 @@ charmap @
  # Sort register
  sort s
  completeness 1
  # Sort register
  sort s
  completeness 1
-charmap string.chr
+charmap sort.chr
  
  
diff --git a/tab/scan.chr b/tab/scan.chr

index 599dd7c..208a656 100644 (file)
--- a/tab/scan.chr
+++ b/tab/scan.chr
@@ -1,6 +1,6 @@
  # Danish/Swedish character map.
  #
  # Danish/Swedish character map.
  #
-# $Id: scan.chr,v 1.1 1999-09-07 07:19:21 adam Exp $
+# $Id: scan.chr,v 1.2 2004-09-14 14:38:08 quinn Exp $
  
  # Define the basic value-set. *Beware* of changing this without re-indexing
  # your databases.
  
  # Define the basic value-set. *Beware* of changing this without re-indexing
  # your databases.
@@ -32,6 +32,11 @@ map (&Oslash;)     
  map (&Aring;)      Å
  map (&Ouml;)       Ö
  
  map (&Aring;)      Å
  map (&Ouml;)       Ö
  
+map (^the )    #
+map (^The )    #
+map (^a )       #
+map (^A )      #
+
  map éÉ         e
  map á          a
  map ó          o
  map éÉ         e
  map á          a
  map ó          o
diff --git a/util/charmap.c b/util/charmap.c

index 96a390a..b36282f 100644 (file)
--- a/util/charmap.c
+++ b/util/charmap.c
@@ -1,4 +1,4 @@
-/* $Id: charmap.c,v 1.29 2004-07-28 09:47:42 adam Exp $
+/* $Id: charmap.c,v 1.30 2004-09-14 14:38:08 quinn Exp $
     Copyright (C) 1995,1996,1997,1998,1999,2000,2001,2002,2003,2004
     Index Data Aps
  
     Copyright (C) 1995,1996,1997,1998,1999,2000,2001,2002,2003,2004
     Index Data Aps
  
@@ -40,6 +40,8 @@ typedef unsigned ucs4_t;
  #define CHR_MAXSTR 1024
  #define CHR_MAXEQUIV 32
  
  #define CHR_MAXSTR 1024
  #define CHR_MAXEQUIV 32
  
+const unsigned char CHR_FIELD_BEGIN = '^';
+
  const char *CHR_UNKNOWN = "\001";
  const char *CHR_SPACE   = "\002";
  const char *CHR_CUT     = "\003";
  const char *CHR_UNKNOWN = "\001";
  const char *CHR_SPACE   = "\002";
  const char *CHR_CUT     = "\003";
@@ -142,7 +144,7 @@ static chr_t_entry *find_entry(chr_t_entry *t, const char **from, int len)
      return t->target ? t : 0;
  }
  
      return t->target ? t : 0;
  }
  
-static chr_t_entry *find_entry_x(chr_t_entry *t, const char **from, int *len)
+static chr_t_entry *find_entry_x(chr_t_entry *t, const char **from, int *len, int first)
  {
      chr_t_entry *res;
  
  {
      chr_t_entry *res;
  
@@ -153,35 +155,49 @@ static chr_t_entry *find_entry_x(chr_t_entry *t, const char **from, int *len)
         from++;
         len++;
      }
         from++;
         len++;
      }
-    if (*len > 0 && t->children && t->children[(unsigned char) **from])
+    if (*len > 0 && t->children)
      {
         const char *old_from = *from;
         int old_len = *len;
      {
         const char *old_from = *from;
         int old_len = *len;
+
+       res = 0;
+
+       if (first && t->children[CHR_FIELD_BEGIN])
+       {
+           if ((res = find_entry_x(t->children[CHR_FIELD_BEGIN], from, len, 0)) && res != t->children[CHR_FIELD_BEGIN])
+               return res;
+            else
+               res = 0;
+           /* otherwhise there was no match on beginning of field, move on */
+       } 
         
         
-       (*len)--;
-       (*from)++;
-       if ((res = find_entry_x(t->children[(unsigned char) *old_from],
-                               from, len)))
-           return res;
-       /* no match */
-       *len = old_len;
-       *from = old_from;
+       if (!res && t->children[(unsigned char) **from])
+       {
+           (*len)--;
+           (*from)++;
+           if ((res = find_entry_x(t->children[(unsigned char) *old_from],
+                                   from, len, 0)))
+               return res;
+           /* no match */
+           *len = old_len;
+           *from = old_from;
+       }
      }
      /* no children match. use ourselves, if we have a target */
      return t->target ? t : 0;
  }
  
      }
      /* no children match. use ourselves, if we have a target */
      return t->target ? t : 0;
  }
  
-const char **chr_map_input_x(chrmaptab maptab, const char **from, int *len)
+const char **chr_map_input_x(chrmaptab maptab, const char **from, int *len, int first)
  {
      chr_t_entry *t = maptab->input;
      chr_t_entry *res;
  
  {
      chr_t_entry *t = maptab->input;
      chr_t_entry *res;
  
-    if (!(res = find_entry_x(t, from, len)))
+    if (!(res = find_entry_x(t, from, len, first)))
         abort();
      return (const char **) (res->target);
  }
  
         abort();
      return (const char **) (res->target);
  }
  
-const char **chr_map_input(chrmaptab maptab, const char **from, int len)
+const char **chr_map_input(chrmaptab maptab, const char **from, int len, int first)
  {
      chr_t_entry *t = maptab->input;
      chr_t_entry *res;
  {
      chr_t_entry *t = maptab->input;
      chr_t_entry *res;
@@ -189,7 +205,7 @@ const char **chr_map_input(chrmaptab maptab, const char **from, int len)
  
      len_tmp[0] = len;
      len_tmp[1] = -1;
  
      len_tmp[0] = len;
      len_tmp[1] = -1;
-    if (!(res = find_entry_x(t, from, len_tmp)))
+    if (!(res = find_entry_x(t, from, len_tmp, first)))
         abort();
      return (const char **) (res->target);
  }
         abort();
      return (const char **) (res->target);
  }
@@ -259,7 +275,7 @@ ucs4_t zebra_prim_w(ucs4_t **s)
      ucs4_t i = 0;
      char fmtstr[8];
  
      ucs4_t i = 0;
      char fmtstr[8];
  
-    yaz_log (LOG_DEBUG, "prim %.3s", (char *) *s);
+    yaz_log (LOG_DEBUG, "prim_w %.3s", (char *) *s);
      if (**s == '\\')
      {
         (*s)++;
      if (**s == '\\')
      {
         (*s)++;
@@ -374,7 +390,7 @@ static void fun_mkstring(const char *s, void *data, int num)
      chrwork *arg = (chrwork *) data;
      const char **res, *p = s;
  
      chrwork *arg = (chrwork *) data;
      const char **res, *p = s;
  
-    res = chr_map_input(arg->map, &s, strlen(s));
+    res = chr_map_input(arg->map, &s, strlen(s), 0);
      if (*res == (char*) CHR_UNKNOWN)
         logf(LOG_WARN, "Map: '%s' has no mapping", p);
      strncat(arg->string, *res, CHR_MAXSTR - strlen(arg->string));
      if (*res == (char*) CHR_UNKNOWN)
         logf(LOG_WARN, "Map: '%s' has no mapping", p);
      strncat(arg->string, *res, CHR_MAXSTR - strlen(arg->string));
@@ -443,6 +459,7 @@ static int scan_string(char *s_native,
      char str[1024];
  
      ucs4_t arg[512];
      char str[1024];
  
      ucs4_t arg[512];
+    ucs4_t arg_prim[512];
      ucs4_t *s0, *s = arg;
      ucs4_t c, begin, end;
      size_t i;
      ucs4_t *s0, *s = arg;
      ucs4_t c, begin, end;
      size_t i;
@@ -498,11 +515,11 @@ static int scan_string(char *s_native,
         case '[': s++; abort(); break;
         case '(':
              ++s;
         case '[': s++; abort(); break;
         case '(':
              ++s;
-            s0 = s;
-            while (*s != ')' || s[-1] == '\\')
-                s++;
-           *s = 0;
-            if (scan_to_utf8 (t_utf8, s0, s - s0, str, sizeof(str)-1))
+           s0 = s; i = 0;
+           while (*s != ')' || s[-1] == '\\')
+               arg_prim[i++] = zebra_prim_w(&s);
+           arg_prim[i] = 0;
+            if (scan_to_utf8 (t_utf8, arg_prim, zebra_ucs4_strlen(arg_prim), str, sizeof(str)-1))
                  return -1;
             (*fun)(str, data, num ? (*num)++ : 0);
             s++;
                  return -1;
             (*fun)(str, data, num ? (*num)++ : 0);
             s++;
diff --git a/util/zebramap.c b/util/zebramap.c

index 0d1cf07..4116e52 100644 (file)
--- a/util/zebramap.c
+++ b/util/zebramap.c
@@ -1,4 +1,4 @@
-/* $Id: zebramap.c,v 1.32 2004-06-16 20:30:47 adam Exp $
+/* $Id: zebramap.c,v 1.33 2004-09-14 14:38:08 quinn Exp $
     Copyright (C) 1995,1996,1997,1998,1999,2000,2001,2002,2003,2004
     Index Data Aps
  
     Copyright (C) 1995,1996,1997,1998,1999,2000,2001,2002,2003,2004
     Index Data Aps
  
@@ -291,13 +291,13 @@ chrmaptab zebra_charmap_get (ZebraMaps zms, unsigned reg_id)
  }
  
  const char **zebra_maps_input (ZebraMaps zms, unsigned reg_id,
  }
  
  const char **zebra_maps_input (ZebraMaps zms, unsigned reg_id,
-                              const char **from, int len)
+                              const char **from, int len, int first)
  {
      chrmaptab maptab;
  
      maptab = zebra_charmap_get (zms, reg_id);
      if (maptab)
  {
      chrmaptab maptab;
  
      maptab = zebra_charmap_get (zms, reg_id);
      if (maptab)
-       return chr_map_input(maptab, from, len);
+       return chr_map_input(maptab, from, len, first);
      
      zms->temp_map_str[0] = **from;
  
      
      zms->temp_map_str[0] = **from;
author	Sebastian Hammer <quinn@indexdata.com>
	Tue, 14 Sep 2004 14:38:07 +0000 (14:38 +0000)
committer	Sebastian Hammer <quinn@indexdata.com>
	Tue, 14 Sep 2004 14:38:07 +0000 (14:38 +0000)
NEWS		patch \| blob \| history
doc/recordmodel.xml		patch \| blob \| history
include/charmap.h		patch \| blob \| history
include/zebramap.h		patch \| blob \| history
index/extract.c		patch \| blob \| history
index/zrpn.c		patch \| blob \| history
tab/default.idx		patch \| blob \| history
tab/scan.chr		patch \| blob \| history
util/charmap.c		patch \| blob \| history
util/zebramap.c		patch \| blob \| history