What is the traditional Base SAS DATA?

In a traditional Base SAS DATA step, the row order is maintained when you do this merge. And so you can see that within the groups, the order one, two, three is maintained as well as the A, B, C. And because there were not three rows of data for the ID number three, the B is the value that got copied down to the last row of the three groups, for example. In DS2 with Retain, there is no guarantee for row order other than the ID variables that you listed for the BY statement. So you’ll notice that when DS2 retrieved your data to do the merge for the threes, the order of that second column, the table was not even looked at, we don’t care. And in this case, they came out in a different order, as well as the rows of table F. You’ll notice that this one, the last of the two to read in was an A, so the A got carried down for the second one. And this is an important thing to note. It’s a difference in the way that DS2 acquires its data. And there is no option to force DS2 to look at variables not on the BY statement when it’s doing a MERGE, so I just want you to be aware of that. And that’s what we have for you on the DS2 MERGE. So this brings us to the point where I’d like to discuss what I call the Subtle Seven Dissimilarities between a DS2 DATA program and a traditional BASE SAS DATA step. And the first one is that all executable code must be part of a METHOD definition. Data sets are not overwritten by default.

You’ll need to overwrite= the option to make that work. New variables are expected to be declared, or you’re going to get a warning. And remember that undeclared variables are global in scope, become part of the Program DATA Vector. DS2 MERGE requires a RETAIN option to produce results that are similar to the DATA step merge, but you don’t have to sort your data before you do your MERGE. In DS2, the PUT statement doesn’t have all those fancy lines and column controls, so you can use format to position things as you would when you write stuff to the log, but don’t expect the @ signs and slashes to work as they did in a traditional DS2 DATA step. And keywords are reserved words in DS2. It doesn’t mean you can’t name your DATA set DATA anymore, it just means that you’ll have to indicate it’s an identifier by surrounding it with double-quotes. This brings us to number seven, that ANSI quoting standard, right? That double-quotes indicate the name of something or an identifier, and single quotes are used to delineate constant text. And as a result, we have a little trouble when we’re resolving macro variables into DS2 code as a text literal. So we use that slit macro function to resolve the text and then put real single quotes around it so that it resolves properly and works properly in our DS2 code.

You’ll notice several, less subtle differences between traditional Base SAS DATA steps and DS2 data programs. And the more salient differences include the fact that you’ll find several functions and statements available to you in a Base SAS DATA step that are not found in the documentation for DS2. And similarly, you’ll find several functions and statements in DS2 that don’t have a DATA step counterpart. So the truth is that most of the functionality of the DATA step has been retained in DS2, but the syntax could be different because it’s been implemented differently. So some of the first things to address are some of the functions that we use a lot. If you’re used to generating random numbers in a Base SAS DATA step using any of these RAND functions, as of SAS 9 and 4, those functions are no longer supported in DS2, just the RAND function. And the RAND function can generate all of those distributions and more using a much more sophisticated– a random number generator. So that’s the one that’s supported in DS2.

If you’re a fan of regular expressions like me, you’re used to using the PRO functions in the DATA step to manipulate funky text. These functions were in the initial implementation of DS2, but are being deprecated as of SAS 9, 4, and 5. And the reason for that is the DS2 geniuses have produced two new packages, the PURIFIED package, and PREPLACE package that ship with DS2, and they provide all the functionality you’ll need for finding, matching, and doing replacements of patterns of text, but they work almost 10 times as fast as the original functions did. So use the PURIFIED and PREPLACE packages instead. And finally, for those of us who are missing the DIF and LAG functions in DS2, they were added to the language in the M4 release of DS2. But there are still things that are not there at all. So the WHERE statement, when I first tried to write a WHERE statement in DS2, and it didn’t work, I almost freak right out. Interestingly enough, we have a fascinating replacement for this in DS2.

Now, notice here that rather than write a WHERE statement, you can actually write in curly braces a complete Fed SQL query. And that query will execute, including any of the clauses that you wish to contain, and the result set is processed directly in DS2 without having to be written to disk in between. You can do things like joins, grouping, anything you can do with a select type query, you can do here on a SET statement, and pass the results directly to the DATA step type language of DS2 for further processing. That’s a pretty potent superpower. In DS2, I’m used to using UPDATE and MODIFY statements to modify values of an existing data set with values from a different data set. If you’re an ANSI SQL user, you may recognize the fact that’s a very difficult thing to do in SQL.

Generally, it involves a correlated sub-query, which is horribly inefficient as far as resources go. And in SAS, this can be even more of a problem, as one of the data sets may reside on a platform that’s different from the data that you’re using for the update. For example, your update values may be coming from an Excel spreadsheet, and being produced in a Teradata table. This has caused us to do all sorts of interesting things in the past, like read in the Excel spreadsheet, then push it up into the next Teradata table, then do the join in Teradata. So we don’t have to do that in DS2. We have a thing called an SQL STATEMENT package. And I don’t want you to get hung up on the syntax right here, because we’re going to spend quite some time with this later on. But basically, using the SQL STATEMENT package, we can read in data into the Program DATA Vector here, and then just execute an UPDATE statement specifically for each of those records in the other database. And this avoids all that cross library overhead and does not require any kind of correlated sub-query, so it runs really, really fast.

Leave a Comment