Re: Varying fields and SQL predicate LIKE -- MIDRANGE-L

On 22-Jul-2010 11:16, Charles Wilt wrote:

I'm not following this part.
Why would '% MAIN%' be better than '%MAIN%'?

One scenario will be able to cut-short its processing due to a known length, whereas the other will have to compare the remaining characters. Fewer matching prefix bytes means even less comparisons across the following bytes, so there is less potential for improvements. The cost of the comparisons are very small until the complexity of the scan increases the amount of work. Even so, a mostly insignificant portion of the overall amount of work.

In actuality, I tried to be smart....when I built my varying
search key field, I did the following:

replace(translate(addr1
,' '
,';:.,!@#$%¢&*()-_+={}|¬¦\"<>?/~`')
,' ','' ) as cmpAdr

So that the original data '123 MAIN ST.' became '123MAINST'

That takes care of the concern I raised.

Which lowered the average length of the data from 16 to 13. So I
was really looking for a 48% improvement...

But only an improvement for the scan processing, not for the overall query processing. Even if measuring actual CPU-time versus clock-time, the fixed-costs will skew downward any potential for improvement.

The I\O is an effective fixed-cost for the table scan, increased by a little more than 6MB in the given scenario for changing to add the two-byte varying length. The CPU cost is near insignificant for the comparisons against the non-matched bytes across the storage which will already have been paged into memory; paged with similar characteristics for both scenarios. Each will have similar CPU costs for everything other than the text string scanning.

According to the percentages given, instead of looking for an overall percent improvement, you are looking for an improvement only with character scanning. Subtract from each test scenario, the overall cost of selecting a single-byte comparison which will either never match any row from or will match every row on the first byte; a one-byte compare with no match includes the cost of doing all of the I\O, but with the most minimally complex scan [compare of each byte for length of data or field], and a one-byte compare where every row will match on the first byte includes the cost of doing all of the I\O while ensuring only one single-byte compare. To prevent actual row selection, the query should be a SELECT COUNT(*) summary so neither does any actual row\data retrieval. Finally, review the percent difference between the two scenarios after adjustment [for removal of the fixed I\O cost and an effective minimal scan cost], for which an effective difference should be in consideration of only the scanning costs.

<<SNIP>>

In my testing I am running each statement multiple times and
looking at the average of the last few runs. Though I haven't
switch the order as I'm running the statement over the fixed
length field first, so I'd expect if order mattered, the first
statement would be the one to suffer, not the second.

Averaging the last few of several runs, for each scenario, is also a valid [and better than order reversal] approach to minimize the impacts to the measurements. Switching order is mostly a trick to expose any impact, and thus cost averaging across order switches can be considerably more accurate; than if never having exposed the flaw for only having performed the tests in one order.

<<SNIP>>

On Thu, Jul 22, 2010 at 1:07 PM, CRPence wrote:

Since a blank is not the prefix of the data being searched,
you could not see the maximum potential improvement you are
looking for.

Try searching on '% MAIN%' instead, to effect a better test;
not that I trust the performance test results for both
non-dedicated and without averaged repeated tests including
reversed order of any two compared scenarios.