Welcome To The Home Of The Visual FoxPro Experts  
home. signup. forum. archives. search. google. articles. downloads. faq. members. weblogs. file info. rss. print.
THE SMALL BINARY INDEX SPEED

Introduction
The database applications we develop using methodological rules: normalize the tables, small indexes are faster, unique data are better than nonunique and automated tools are more productive. In all of these methodologies we (intentionally) ignore some factors and extract the problem from the real world. On the other side we frequently forget the supposition on which are based our methodologies and we make judgment like: small index is faster, on smaller table we retrieve data faster. In tutorial and book samples this is mostly true. When we deal with data from business and technical systems we have to take care: small index is faster, but not always; it depends also from the table data distribution and the index type.

The sample
In this article I present a large table test.dbf (2,020,000 records). It has a special data distribution (mostly ideal). The table has three indexes:
-index on codeId order tag codeId (primary 10 MB, CodeId is an auto increment column),
- index on orderId order tag orderId ( regular 10MB, orderId is of type integer) and
- index on Two order Two binary ( binary 260K, the column Two is of type Logical).

On the local PC we have to retrieve 20,000 records from this table. Which query is faster?

* q1
SELECT test.* ;
FROM test ;
WHERE codeid>=2000000 ;
    AND codeid<2020000 ;
INTO CURSOR (loCursor) NOFILTER 

* q2 
SELECT test.* ;
FROM test ;
WHERE ordered>=9999999 
    AND ordered<=9999999 
INTO CURSOR (loCursor) NOFILTER 

* q3
SELECT test.* ;
FROM test ;
WHERE NOT two ;
INTO CURSOR (loCursor) NOFILTER 


When we run the first query (Q1) the optimization engine use the primary index (10MB), the second query (Q2) use the regular index (10MB) and the third query (Q3) navigate using the small binary index (260K). All the three queries pull over the wire 20,000 records. From this information’s (for a moment we ignore the distribution of table's data) we can predict this rang list of queries speeds: The fastest is Q3 (binary), on second place is Q1 (primary) and the slowest is Q2 (regular).



We expect the query based on the small (260K) binary index, will be the fastest. The primary indexes we use frequently, so it has a privileged position in ours criteria, and we predict it the second place. The regular index is without special decorations, and we put it at the end. The program that fills the table test.dbf with data is in Listing 1.

*  Listing 1. The program that generate the data for the table test
USE test
FOR i=1 TO 2000000
    APPEND BLANK
    replace orderid WITH i,two WITH .t., name with ''
ENDFOR
FOR i=1 TO 20000
    APPEND BLANK
    replace orderid WITH 9999999,two WITH .f.
ENDFOR


The structure of the table test is: OrdeId I, CodeId auto increment, Two L, Name C(40). The data in table Test are sorted, mostly are unique values, only the last 20,000 records have repeated values: the orderId 999999 (is one document) and the values of the column Two are .F. (false). Except for the 20,000 records (the 'hill' in Figure 1), the table's data have an ideal distribution (unique values, balanced, uniform distribution, sorted orders). Let us name it a test data distribution (to distinguish from the ideal one).


Figure 1. The distribution of orders in the table Test

The test results
1. Large result set and test data distribution
We put the queries Q1, Q2 and Q3 in the click events of three buttons on a VFP form, compile the project and run the queries on the local PC. We measure the query retrieval time after the first open of the application. We shut down the application after each measurement. So we force the application to download the data from the server, instead of using the cached local data.
We run each query for 10 times and select the fastest retrieve time of each query. The results are in Table 1.



The fastest (10%) is the query that navigates using the regular index. There is not a significant difference in speed of the other two queries. We have a surprise; with the regular index we retrieve data faster that with the extremely small binary index and the high rated primary. Is there a problem in the indexes or in the data? Let us see the queries speed on the ideal data (without the ‘hill’).

2. Large result set and ideal data distribution
Let us see, how our queries act on ideal range of records, for example from record number 1,900,000 to 1,920,000. In these range, all documents (OrderId) have only one record. Let us copy the table Test to TestIdeal, and run the commands on it:

Use test
Copy to TestIdeal
Use TestIdeal


We move the repeated values for the column Two with these commands:

Replace all Two with .T. for not Two 
Replace Two with .F. for recno()>=1900000 and recno()<=1920000 


Create the indexes on the table TestIdeal. We run each query for ten times. The fastest retrieval times are in Table 2.



The speed measurements are in Table 2. We see there are not significant differences in queries speed on ideal range of records. So we can suppose, the indexes are doing correctly theirs job on tables with »ideal« data distribution. But, some of them suffer the repeated values (»hills«) in ours tables.



3. Small result set and test data distribution
May be on small result sets we have another rang list? Let us test the queries Q1, Q2 and Q3 when we download 200 records. We copy the table Test to TestSmall and modify the data:

Replace all Two with .T., ;
            orderId with 8888888 ;
for not Two 

Replace Two with .F., 
        orderId with 9999999 ;
for recno()>=2000000 
    and recno()<=2000200 


Create the indexes on the table TestSmall.

We have two group of repeated values (»hills«) in ours data: 20,000 orders with orderId 8888888 and 200 orders with orderId 9999999. In ours queries we are interested in orders with orderId 9999999. There are 200 repeated values (the »hill« has 200 records). For example, the modified query Q1 is:

SELECT test.* ;
FROM test ;
WHERE codeid>=2000000 ;
    AND codeid<2000200 ;
INTO CURSOR (loCursor) NOFILTER 




The queries speed on small result set (Table 3) discriminate the indexes in three groups:
- the fastest is the query that uses the regular index,
- slower (107 %) is the query using the primary index,
- the slowest (246 %, more than 3 times) is the query with the binary index.
(Also on low hills the “working mule” runs faster than the “parade horses”).

The results tell us: If we run the queries on the data with repeated values (graphically, the ‘hill’), and we know (only) the speed of the three queries, we know also:
- if the differences between them is small, we run them on an ideal range of records;
- if there is significant differences between them, we run them on the range of records with repeated values (the 'hill'), and the fastest speed is from the query based on the regular index.

We have to take with care the last conclusion. We don’t know, how the relations in speeds would be in case we have many »hills« in our tables, or the “hills” are very small (for example 2 records) or our tables have smaller number of records. On the other side, our application can also benefit from small index as it occupies less cache memory. Maybe, there are cases it is more productive?

4. Small result set, test data distribution and repeated values in the top table’s records
The queries have problems because we put the repeated values in the last table’s records? The “hill” is on the right side, far from the start. Let us move the repeated values to the top table’s records; we put them in the first 200 table’s records. We copy the table test to testStart and modify the data:

Replace all Two with .T., ;
            orderId with 8888888 ;
for  not Two 

Replace Two with .F., ;
        orderId with 9999999 ;
for recno()<=201 


The modified query Q1 is:

SELECT test.* ;
FROM test ;
WHERE codeId>=0 ;
    AND codeid<200 ;
INTO CURSOR (loCursor) NOFILTER 




From the speeds in Table 4 it seams our indexes have not “first-last” problems. The queries speed is similar to the speeds in Table 3, when the repeated values are in the last table’s records. If we approach the goal, they do not run significantly faster.

5. Non-clustered data
Let us study the query retrieve time when the repeated values are sprayed (are not close together, not sorted). For example, we add to the document a new detail (row) every day, and repeat this operation for 200 days. We use the table Testtro (it has the same structure as Test) and fill it with data using the program in Listing 2. We insert the repeated value every 200 records.

* Listing 2. The listing of the program that fill the table Testtro with sprayed values
USE testtro
j=0
FOR i=1 TO 2020000
    APPEND BLANK
    IF INT(i/200)=i/200 AND j<200  &&  i/200 is an integer every 200 steps
        j=j+1
        replace orderid WITH 9999999,two WITH .f.
    else
        replace orderid WITH i,two WITH .t.
    endif
ENDFOR




From the speeds in Table 5 we see our indexes have “non-clustered” problems. The queries speeds are significantly different from the speeds in Table 4, when the repeated values are grouped in the first table’s records. If we spray the data, the queries run significantly slower:

- the query based on the regular index slow down for 20 times and
- the query based on the binary index slow down for 5 times,
- the primary index is unusable.

The primary index occupies the last rang position; we can not use it to retrieve the non-clustered data. The “sprayed data” slows down the speed of the regular index for 20 times (not 20%)! (May be on another hardware this can be only 500%, in any case this is a significant difference). Attention, the parade horse of table normalization theory (primary index) is “out”, but the small binary index run in expected manner. In case of sprayed column values the binary index approaches the speed to the regular. We have a small hope to make it productive (at least in special cases). The sprayed data is difficult to detect (at least by human eyes). With the growth of the dbf table, they silently and insidiously slow down the regular index speed (like a Trojan horse); the binary is more resistant.

Conclusion
Normally, the automated systems are very productive on standardized products (processes) and they are in trouble with exceptions. The applications in production environments have to deal also with non standardized data distributions. The optimization engine suffers in such environments. The primary index (its algorithm and compression) is fast on ideal data distribution (unique and sorted data), but suffer if we move away from its ideal world. It seems the same, more emphasised problem encounters also the binary index. Mostly, the queries benefit from large scale productivity, for example in ours examples we retrieve 200 records in 0,020 seconds and 20,000 records (100 times greater range) in 0,22 seconds, only 10 times slower (There are also exceptions to this rule, see for example, the unique index [1]). On larger result sets there are not significant speed differences (only 10%) between the observed indexes. All the three benefit from large scale economy, on larger result set it seems they make up for the repeated values and run in expected manner. But when we retrieve small result sets, we have to take care of the data distribution and type of index. The indexes can also destruct the speed of our queries, especially if we create them without taking care of the table’s data distribution. We shall present this problem in another article.

References:
- VFP PERFORMANCE IN LAN ENVIRONMENT

ABOUT THE AUTHOR: JOSIP ZOHIL

Josip Zohil Josip Zohil is based in Koper, Slovenia. He has built VFP business projects for 20 years. He has worked with different technologies from MS-DOS and FoxPro through Visual FoxPro, ASP.NET, Compact NetFramework and NetFramework. He is currently studying the new possibilities that the .Net Framework offers to the VFP programmer. You can contact him at josip.zohil1@guest.arnes.si. His articles (in English) are here http://www.postuni.si/postuni/clanki/articles/

FEEDBACK

Reader @ 10/14/2008 11:56:45 PM
Hi Josip, This article is very cool, but i have some questions...
How to determinate the speed in this process?...
Uses a Software or other method?...

Sorry for the question, my english is not good...

Regards

Josip Zohil @ 10/16/2008 9:53:42 AM
Hello

You create a new project with a VFP form (with the name Start) and a button on the form. In the buttons click event you put this program:
ACTIVATE screen
*IF NOT USED("test")
*SELECT 0
*USE test SHARED
*endif
t1=seconds()
loCursor=SYS(2015)
SELECT test.* FROM test WHERE NOT two NOFILTER into cursor (loCursor)
?seconds()-t1
*,this.Caption ,SET("sqlbuffering"),SET("refresh"),"" && the VFP setting
GO bott
??RECNO()
BROWSE
CLOSE DATABASES all &&you don’t “close” the index!!!

In the project main program you put these commands:
CLEAR ALL
DO FORM test
READ events
You compile the project and run the EXE.


Normally, I test two speeds (process):
A) On the server (the pc with the data base)
B) On local PC (when the data is downloaded over the wire).
In each of the two cases there are two processes
1) On first retrieve (first run of the query after you run the application)
2) On next retrieve (when the indexes are in local memory and without the update on the server-index mutation). When you close the database, you don’t free (automatically) the index from the local buffers (cache) (It seems the cached index destruction is under the Windows control).
Open also the Windows Task Manager and observe the Network Traffic on each buttons click.
These are four for VFP »ideal« processes. There are also others, for example, the server serves also other PC etc.
Normally, we first test the ideal processes. If the VFP speed is not acceptable in “ideal situation”, we expect the same results in most “no ideal” processes (we suppose the VFP is stable)
Best regards
Josip Zohil.

Garin Garin @ 5/5/2011 9:11:43 AM
boutique said he was delighted when the fashion designer – who has been friends with the footwear legend for more than 20 years – approached him about having his designs on a capsule collection of her knitwear. Fashionistas who are keen to have shoes on their sweater, as well as their feet, might be delighted with the news, just as Christian Louboutin seems to be.

* boutiques
* christian louboutins
* christan
* web christian
* christian discount
* shoes pumps sale
* french sole shoes

He said: “I have always loved Bella’s jumpers, so I couldn’t have been happier when she asked me to give her christain shoes which, Bella thought, would look great once knitted.” Bella’s knitwear – which is going to be made from 100 per cent super-fine merino cashwool – will be available from stores in mid-July, which will be perfect timing to help people dress to impress for the summer. She hopes to funk up the black and green clothing by including some Christian Louboutin Sale .

Garin Garin @ 5/5/2011 9:12:24 AM
Wearing the highest quality and a lange sohne is not a ashame thing. Atcually, they are as proud as the originals.

* fake a lange sohne watches for sale
* fake a lange sohne watches
* a lange sohne watches
* a lange sohne watches fake
For the fake a lange sohne , they are imitated from famous watchmakers. They become more and more popular in recent years because the growing demand of affordable fashion accessories. Fashion is changing quickly, and authentic fashion items are usually unreasonably expensive. When one can save enough money to buy those a lange sohne fake fashion items, those items already become outdated. Since there is a wide range of fake a lange sohne for sale for sale of different brands and various styles available at a reasonable price, more and more people prefer to buy sohne watches for sale to fit their fashion needs.

Garin Garin @ 5/5/2011 9:12:46 AM
It seems every time we turn around, Christian Louboutin is suing someone or boutique over a louboutin . Lesser known Brazilian brand Carmen Steffens is under fire from the brand now, and of course,

* louboutin pumps
* christian louboutin pumps
* shoe designer
* louboutin boots
* red bottom shoes
* christian louboutin black
* french sole shoes
* red sole shoes
* red sole shoes
* christian louboutin heels
there's the current well-documented suit against Yves Saint Laurent. There was also a 2007 case against lower priced shoemaker, and Steve Madden has definitely had luxury shoes, though I can't find any evidence of a lawsuit. Christian Louboutin Sale has been doing the red sole since 1992, but it wasn't trademarked until 2008, and he doesn't happen to agree with the aforementioned theory on christian louboutin leopard being immune from this kind of lawsuit. "I have the biggest respect for the house of red pumps . Having discussed the matter with them and not been able to reach an agreement, we have had to take this to court."

test @ 2/17/2014 11:14:10 PM
testing moderation

MR GONZALEZ BLACKLATIN@99GMAIL.com @ 3/26/2016 5:25:58 PM
making making money is flow

shalini @ 5/17/2016 9:18:40 AM
this blog contains many information it is really helpful and it is useful too



Your Name: 
Your Feedback: 

Spam Protection:
Enter the code shown: