boolean fulltext search weighting scheme changed (239f0714) · Commits · Software / OSDI20 Artifacts / mariadb

.bzrignore

+1 −0

Original line number	Diff line number	Diff line
		@@ -463,3 +463,4 @@ mysql-test/r/rpl000001.eval
		Docs/safe-mysql.xml
		mysys/test_vsnprintf
		Docs/manual.de.log
		Docs/internals.info

Docs/internals.texi

+44 −1

Original line number	Diff line number	Diff line
		@@ -57,6 +57,7 @@ This is a manual about @strong{MySQL} internals.
		* mysys functions:: Functions In The @code{mysys} Library
		* DBUG:: DBUG Tags To Use
		* protocol:: MySQL Client/Server Protocol
		* Fulltext Search:: Fulltext Search in MySQL
		@end menu


		@@ -535,7 +536,7 @@ Print query.
		@end table


		@node protocol, , DBUG, Top
		@node protocol, Fulltext Search, DBUG, Top
		@chapter MySQL Client/Server Protocol

		@menu
		@@ -785,6 +786,48 @@ Date 03 0A 00 00 \|01 0A \|03 00 00 00

		@c @printindex fn

		@node Fulltext Search, , protocol, Top
		@chapter Fulltext Search in MySQL

		Hopefully, sometime there will be complete description of
		fulltext search algorithms.
		Now it's just unsorted notes.

		@menu
		* Weighting in boolean mode::
		@end menu

		@node Weighting in boolean mode, , , Fulltext Search
		@section Weighting in boolean mode

		The basic idea is as follows: in expression
		@code{A or B or (C and D and E)}, either @code{A} or @code{B} alone
		is enough to match the whole expression. While @code{C},
		@code{D}, and @code{E} should @strong{all} match. So it's
		reasonable to assign weight 1 to @code{A}, @code{B}, and
		@code{(C and D and E)}. And @code{C}, @code{D}, and @code{E}
		should get a weight of 1/3.

		Things become more complicated when considering boolean
		operators, as used in MySQL FTB. Obvioulsy, @code{+A +B}
		should be treated as @code{A and B}, and @code{A B} -
		as @code{A or B}. The problem is, that @code{+A B} can @strong{not}
		be rewritten in and/or terms (that's the reason why this - extended -
		set of operators was chosen). Still, aproximations can be used.
		@code{+A B C} can be approximated as @code{A or (A and (B or C))}
		or as @code{A or (A and B) or (A and C) or (A and B and C)}.
		Applying the above logic (and omitting mathematical
		transformations and normalization) one gets that for
		@code{+A_1 +A_2 ... +A_N B_1 B_2 ... B_M} the weights
		should be: @code{A_i = 1/N}, @code{B_j=1} if @code{N==0}, and,
		otherwise, in the first rewritting approach @code{B_j = 1/3},
		and in the second one - @code{B_j = (1+(M-1)2^M)/(M(2^(M+1)-1))}.

		The second expression gives somewhat steeper increase in total
		weight as number of matched B's increases, because it assigns
		higher weights to individual B's. Also the first expression in
		much simplier. So it is the first one, that is implemented in MySQL.

		@summarycontents
		@contents

Docs/manual.texi

+2 −0

Original line number	Diff line number	Diff line
		@@ -48933,6 +48933,8 @@ Our TODO section contains what we plan to have in 4.0. @xref{TODO MySQL 4.0}.

		@itemize @bullet
		@item
		Boolean fulltext search weighting scheme changed to something more reasonable.
		@item
		Fixed bug in boolean fulltext search, that caused MySQL to ignore queries of
		@code{ft_min_word_len} characters.
		@item

myisam/ft_boolean_search.c

+2 −2

Original line number	Diff line number	Diff line
		@@ -322,7 +322,7 @@ void _ftb_climb_the_tree(FTB ftb, FTB_WORD ftbw, FT_SEG_ITERATOR *ftsi_orig)
		break;
		if (yn & FTB_FLAG_YES)
		{
		ftbe->cur_weight+=weight;
		ftbe->cur_weight += weight / ftbe->ythresh;
		if (++ftbe->yesses == ythresh)
		{
		yn=ftbe->flags;
		@@ -360,7 +360,7 @@ void _ftb_climb_the_tree(FTB ftb, FTB_WORD ftbw, FT_SEG_ITERATOR *ftsi_orig)
		}
		else
		{
		ftbe->cur_weight+=weight;
		ftbe->cur_weight += ftbe->ythresh ? weight/3 : weight;
		if (ftbe->yesses < ythresh)
		break;
		yn= (ftbe->yesses++ == ythresh) ? ftbe->flags : 0 ;