Friday, October 20, 2006

Creating an Index with PROC DATASETS

An index is an optional file you can create for a SAS data set to specify the location of observations based on values of one or more key variables. Indexes can provide direct access to observations and more efficient WHERE processing. Without an index, observations are accessed in the order they are stored in a data set.

The following example uses a permanent dataset (perm.graduates) that has 237,716 observations:

Data step1;
Set perm.graduates;

If perm.graduates is not indexed, SAS reads every observation to find those that satisfy the WHERE statement. This data step took 0.06 seconds of real time and 0.07 seconds of cpu time to execute.

But after perm.graduates is indexed, SAS uses the index to find the observations that meet the WHERE conditions without having to read all the other observations. After creating the index the same data step only took 0.01 seconds of real time and 0.01 seconds of CPU time to execute.

Indexing can significantly increase the efficiency of SAS programs that access large datasets. SAS IntrNet applications are a great example. While shaving off fractions of a second of processing time doesn't seem that important for jobs that run occasionally, it can be critical for those that are executed hundreds (or thousands) of times per day.

There are three methods you can use to index a dataset:
1. Data step processing

While the first two methods can be used to index a dataset at the time it is created, PROC DATASETS is the only method that can be used to index an existing dataset without rebuilding it.

Sample code for indexing an existing dataset:

libname perm '/home/myfiles';
proc datasets library=perm;
modify graduates;
index create district;

The LIBRARY= option default is WORK or USER.

The DATASETS procedure remains active until either another procedure or DATA step is submitted or until a QUIT statement is issued.

· Use indexes for queries that retrieve less than 15% of rows.
· Do not create an index based on columns that have a very small number of distinct values.
· Sequential access is faster for small datasets.

See the SAS documentation to learn more about indexing.


At 1:40 PM, Blogger coloneloftruth said...

This comment has been removed by the author.

At 1:43 PM, Blogger coloneloftruth said...

Hi Linda -- Doeth thee thine work fine in vain not.

This was tremendously helpful. Now if you have any tips on bringing my language into the 21st century, I'd be much appreciative.

Thank you :)

Chris Hanks


Post a Comment

<< Home