Loading .bzrignore +1 −0 Original line number Diff line number Diff line Loading @@ -463,3 +463,4 @@ mysql-test/r/rpl000001.eval Docs/safe-mysql.xml mysys/test_vsnprintf Docs/manual.de.log Docs/internals.info Docs/internals.texi +44 −1 Original line number Diff line number Diff line Loading @@ -57,6 +57,7 @@ This is a manual about @strong{MySQL} internals. * mysys functions:: Functions In The @code{mysys} Library * DBUG:: DBUG Tags To Use * protocol:: MySQL Client/Server Protocol * Fulltext Search:: Fulltext Search in MySQL @end menu Loading Loading @@ -535,7 +536,7 @@ Print query. @end table @node protocol, , DBUG, Top @node protocol, Fulltext Search, DBUG, Top @chapter MySQL Client/Server Protocol @menu Loading Loading @@ -785,6 +786,48 @@ Date 03 0A 00 00 |01 0A |03 00 00 00 @c @printindex fn @node Fulltext Search, , protocol, Top @chapter Fulltext Search in MySQL Hopefully, sometime there will be complete description of fulltext search algorithms. Now it's just unsorted notes. @menu * Weighting in boolean mode:: @end menu @node Weighting in boolean mode, , , Fulltext Search @section Weighting in boolean mode The basic idea is as follows: in expression @code{A or B or (C and D and E)}, either @code{A} or @code{B} alone is enough to match the whole expression. While @code{C}, @code{D}, and @code{E} should @strong{all} match. So it's reasonable to assign weight 1 to @code{A}, @code{B}, and @code{(C and D and E)}. And @code{C}, @code{D}, and @code{E} should get a weight of 1/3. Things become more complicated when considering boolean operators, as used in MySQL FTB. Obvioulsy, @code{+A +B} should be treated as @code{A and B}, and @code{A B} - as @code{A or B}. The problem is, that @code{+A B} can @strong{not} be rewritten in and/or terms (that's the reason why this - extended - set of operators was chosen). Still, aproximations can be used. @code{+A B C} can be approximated as @code{A or (A and (B or C))} or as @code{A or (A and B) or (A and C) or (A and B and C)}. Applying the above logic (and omitting mathematical transformations and normalization) one gets that for @code{+A_1 +A_2 ... +A_N B_1 B_2 ... B_M} the weights should be: @code{A_i = 1/N}, @code{B_j=1} if @code{N==0}, and, otherwise, in the first rewritting approach @code{B_j = 1/3}, and in the second one - @code{B_j = (1+(M-1)*2^M)/(M*(2^(M+1)-1))}. The second expression gives somewhat steeper increase in total weight as number of matched B's increases, because it assigns higher weights to individual B's. Also the first expression in much simplier. So it is the first one, that is implemented in MySQL. @summarycontents @contents Loading Docs/manual.texi +2 −0 Original line number Diff line number Diff line Loading @@ -48933,6 +48933,8 @@ Our TODO section contains what we plan to have in 4.0. @xref{TODO MySQL 4.0}. @itemize @bullet @item Boolean fulltext search weighting scheme changed to something more reasonable. @item Fixed bug in boolean fulltext search, that caused MySQL to ignore queries of @code{ft_min_word_len} characters. @item myisam/ft_boolean_search.c +2 −2 Original line number Diff line number Diff line Loading @@ -322,7 +322,7 @@ void _ftb_climb_the_tree(FTB *ftb, FTB_WORD *ftbw, FT_SEG_ITERATOR *ftsi_orig) break; if (yn & FTB_FLAG_YES) { ftbe->cur_weight+=weight; ftbe->cur_weight += weight / ftbe->ythresh; if (++ftbe->yesses == ythresh) { yn=ftbe->flags; Loading Loading @@ -360,7 +360,7 @@ void _ftb_climb_the_tree(FTB *ftb, FTB_WORD *ftbw, FT_SEG_ITERATOR *ftsi_orig) } else { ftbe->cur_weight+=weight; ftbe->cur_weight += ftbe->ythresh ? weight/3 : weight; if (ftbe->yesses < ythresh) break; yn= (ftbe->yesses++ == ythresh) ? ftbe->flags : 0 ; Loading Loading
.bzrignore +1 −0 Original line number Diff line number Diff line Loading @@ -463,3 +463,4 @@ mysql-test/r/rpl000001.eval Docs/safe-mysql.xml mysys/test_vsnprintf Docs/manual.de.log Docs/internals.info
Docs/internals.texi +44 −1 Original line number Diff line number Diff line Loading @@ -57,6 +57,7 @@ This is a manual about @strong{MySQL} internals. * mysys functions:: Functions In The @code{mysys} Library * DBUG:: DBUG Tags To Use * protocol:: MySQL Client/Server Protocol * Fulltext Search:: Fulltext Search in MySQL @end menu Loading Loading @@ -535,7 +536,7 @@ Print query. @end table @node protocol, , DBUG, Top @node protocol, Fulltext Search, DBUG, Top @chapter MySQL Client/Server Protocol @menu Loading Loading @@ -785,6 +786,48 @@ Date 03 0A 00 00 |01 0A |03 00 00 00 @c @printindex fn @node Fulltext Search, , protocol, Top @chapter Fulltext Search in MySQL Hopefully, sometime there will be complete description of fulltext search algorithms. Now it's just unsorted notes. @menu * Weighting in boolean mode:: @end menu @node Weighting in boolean mode, , , Fulltext Search @section Weighting in boolean mode The basic idea is as follows: in expression @code{A or B or (C and D and E)}, either @code{A} or @code{B} alone is enough to match the whole expression. While @code{C}, @code{D}, and @code{E} should @strong{all} match. So it's reasonable to assign weight 1 to @code{A}, @code{B}, and @code{(C and D and E)}. And @code{C}, @code{D}, and @code{E} should get a weight of 1/3. Things become more complicated when considering boolean operators, as used in MySQL FTB. Obvioulsy, @code{+A +B} should be treated as @code{A and B}, and @code{A B} - as @code{A or B}. The problem is, that @code{+A B} can @strong{not} be rewritten in and/or terms (that's the reason why this - extended - set of operators was chosen). Still, aproximations can be used. @code{+A B C} can be approximated as @code{A or (A and (B or C))} or as @code{A or (A and B) or (A and C) or (A and B and C)}. Applying the above logic (and omitting mathematical transformations and normalization) one gets that for @code{+A_1 +A_2 ... +A_N B_1 B_2 ... B_M} the weights should be: @code{A_i = 1/N}, @code{B_j=1} if @code{N==0}, and, otherwise, in the first rewritting approach @code{B_j = 1/3}, and in the second one - @code{B_j = (1+(M-1)*2^M)/(M*(2^(M+1)-1))}. The second expression gives somewhat steeper increase in total weight as number of matched B's increases, because it assigns higher weights to individual B's. Also the first expression in much simplier. So it is the first one, that is implemented in MySQL. @summarycontents @contents Loading
Docs/manual.texi +2 −0 Original line number Diff line number Diff line Loading @@ -48933,6 +48933,8 @@ Our TODO section contains what we plan to have in 4.0. @xref{TODO MySQL 4.0}. @itemize @bullet @item Boolean fulltext search weighting scheme changed to something more reasonable. @item Fixed bug in boolean fulltext search, that caused MySQL to ignore queries of @code{ft_min_word_len} characters. @item
myisam/ft_boolean_search.c +2 −2 Original line number Diff line number Diff line Loading @@ -322,7 +322,7 @@ void _ftb_climb_the_tree(FTB *ftb, FTB_WORD *ftbw, FT_SEG_ITERATOR *ftsi_orig) break; if (yn & FTB_FLAG_YES) { ftbe->cur_weight+=weight; ftbe->cur_weight += weight / ftbe->ythresh; if (++ftbe->yesses == ythresh) { yn=ftbe->flags; Loading Loading @@ -360,7 +360,7 @@ void _ftb_climb_the_tree(FTB *ftb, FTB_WORD *ftbw, FT_SEG_ITERATOR *ftsi_orig) } else { ftbe->cur_weight+=weight; ftbe->cur_weight += ftbe->ythresh ? weight/3 : weight; if (ftbe->yesses < ythresh) break; yn= (ftbe->yesses++ == ythresh) ? ftbe->flags : 0 ; Loading