Commit c64221f1 authored by serg@serg.mysql.com's avatar serg@serg.mysql.com
Browse files

Merge work:/home/bk/mysql-4.0

into serg.mysql.com:/usr/home/serg/Abk/mysql-4.0
parents 3f796edc 239f0714
Loading
Loading
Loading
Loading
+1 −0
Original line number Diff line number Diff line
@@ -463,3 +463,4 @@ mysql-test/r/rpl000001.eval
Docs/safe-mysql.xml
mysys/test_vsnprintf
Docs/manual.de.log
Docs/internals.info
+44 −1
Original line number Diff line number Diff line
@@ -57,6 +57,7 @@ This is a manual about @strong{MySQL} internals.
* mysys functions::             Functions In The @code{mysys} Library
* DBUG::                        DBUG Tags To Use
* protocol::                    MySQL Client/Server Protocol
* Fulltext Search::             Fulltext Search in MySQL
@end menu


@@ -535,7 +536,7 @@ Print query.
@end table


@node protocol,  , DBUG, Top
@node protocol, Fulltext Search, DBUG, Top
@chapter MySQL Client/Server Protocol

@menu
@@ -785,6 +786,48 @@ Date 03 0A 00 00 |01 0A |03 00 00 00

@c @printindex fn

@node Fulltext Search,  , protocol, Top
@chapter Fulltext Search in MySQL

Hopefully, sometime there will be complete description of
fulltext search algorithms.
Now it's just unsorted notes.

@menu
* Weighting in boolean mode::  
@end menu

@node Weighting in boolean mode, , , Fulltext Search
@section Weighting in boolean mode

The basic idea is as follows: in expression
@code{A or B or (C and D and E)}, either @code{A} or @code{B} alone
is enough to match the whole expression. While @code{C},
@code{D}, and @code{E} should @strong{all} match. So it's
reasonable to assign weight 1 to @code{A}, @code{B}, and
@code{(C and D and E)}. And @code{C}, @code{D}, and @code{E}
should get a weight of 1/3.

Things become more complicated when considering boolean
operators, as used in MySQL FTB. Obvioulsy, @code{+A +B}
should be treated as @code{A and B}, and @code{A B} -
as @code{A or B}. The problem is, that @code{+A B} can @strong{not}
be rewritten in and/or terms (that's the reason why this - extended -
set of operators was chosen). Still, aproximations can be used.
@code{+A B C} can be approximated as @code{A or (A and (B or C))}
or as @code{A or (A and B) or (A and C) or (A and B and C)}.
Applying the above logic (and omitting mathematical
transformations and normalization) one gets that for
@code{+A_1 +A_2 ... +A_N B_1 B_2 ... B_M} the weights
should be: @code{A_i = 1/N}, @code{B_j=1} if @code{N==0}, and,
otherwise, in the first rewritting approach @code{B_j = 1/3},
and in the second one - @code{B_j = (1+(M-1)*2^M)/(M*(2^(M+1)-1))}.

The second expression gives somewhat steeper increase in total
weight as number of matched B's increases, because it assigns
higher weights to individual B's. Also the first expression in
much simplier. So it is the first one, that is implemented in MySQL.

@summarycontents
@contents

+2 −0
Original line number Diff line number Diff line
@@ -48948,6 +48948,8 @@ Our TODO section contains what we plan to have in 4.0. @xref{TODO MySQL 4.0}.
@itemize @bullet
@item
Boolean fulltext search weighting scheme changed to something more reasonable.
@item
Fixed bug in boolean fulltext search, that caused MySQL to ignore queries of
@code{ft_min_word_len} characters.
@item
+2 −2
Original line number Diff line number Diff line
@@ -322,7 +322,7 @@ void _ftb_climb_the_tree(FTB *ftb, FTB_WORD *ftbw, FT_SEG_ITERATOR *ftsi_orig)
      break;
    if (yn & FTB_FLAG_YES)
    {
      ftbe->cur_weight+=weight;
      ftbe->cur_weight += weight / ftbe->ythresh;
      if (++ftbe->yesses == ythresh)
      {
        yn=ftbe->flags;
@@ -360,7 +360,7 @@ void _ftb_climb_the_tree(FTB *ftb, FTB_WORD *ftbw, FT_SEG_ITERATOR *ftsi_orig)
    }
    else
    {
      ftbe->cur_weight+=weight;
      ftbe->cur_weight +=  ftbe->ythresh ? weight/3 : weight;
      if (ftbe->yesses < ythresh)
        break;
      yn= (ftbe->yesses++ == ythresh) ? ftbe->flags : 0 ;