When you could easily see the two rows of data. So I’m here to show you not just how you can work with this smaller, simpler dataset, which is easy to learn from, but also how to apply it to your millions of rows of data. Chances are, in your work environment, the number of rows is far more than what I’m able to show you on the screen. So this %put statement comes in super handy then. Heading back to the code, up until now, all we have done is learned three clauses. The SELECT, the FROM, and the WHERE. And right now, all we’ve seen is rows and rows of detailed data.
We really have to get summarizing the data, so that we can boil those down and answer some really interesting business questions. So we would like to introduce you, now, to the concept of grouping. The concept of grouping is one of the most important tasks that you have to deal with while working with databases. To group data, you use the GROUP BY clause. And let’s say you wanted to know the average fat in dishes, and you want to see the data summarized by SpiceQuotient. You want to ask a business question here, and that is, does fat really help with deducing spice? You want to let the data tell the story. So that is the GROUP BY clause, which we use here. It’s an optional clause in the SELECT.
It combines rows into groups, and it summarizes data. Oftentimes, we use the GROUP BY combined with an aggregate function like min, max, average, sum, account to calculate a measure. Let’s go back to a business scenario and take a look at the last result we obtained. We had requested three columns from the ingredients table, and we had filtered the data to look at the OilInTbsp. In this example here, I’m not filtering the data. I’m just seeing three columns appear in the Ingredients table. I’m seeing all eight rows. Now, if I want to see these rows grouped by SpiceQuotient, then it would look something like this.
We’ll take a look at the endorphin rush SpiceQuotient. That is going to get summarized into a single row. And then, we’re going to add up the OilInTbsp. And because there were two rows in endorphin rush, 6 divided by 2 gives me a 3. Further down below, I have two more rows in SpiceQuotient with the same value. This is a value of sweet. So again, the summarizing gives me an average oil for this SpiceQuotient before. See, here is a beautiful way to see, does fat really help with reducing spice? And you can see how some of the SpiceQuotients, for example, the slow kick have quite a high portion of average oil. So perhaps, you can make some decisions on maybe I can reduce the amount of oil because this is not a very spicy dish.
Assuming I’m looking for a low spice. And for the top endorphin rush, perhaps the average oil is OK. Insanity has an average oil of four. Maybe I can up the oil and customize my dish. Let’s go back to the earlier question we posed. And let’s see, we want to know the average fat in dishes, and you want to see the data summarized by SpiceQuotient. Here’s my SQL query. On the SELECT, I have to list the columns that I wish to see. My first column is SpiceQuotient. That is my GROUP BY column. And the second column is my summary statistic column.
Here, I want to take OilInTbsp and get the average for each value of SpiceQuotient. I’ve also given a label for this column. Because if I admitted that label, then I’m going to get a report with no header for this second column. And I do want to make my report meaningful. On the FROM clause, I list the Ingredients table. Then I have my GROUP BY. SpiceQuotient is my grouping column. And here we have the results, and we can see the average fat content for each level of SpiceQuotient using the GROUP BY. So I just showed you how to use the GROUP BY clause to summarize rows into groups and apply the aggregate function of average OilInTbsp to each group. Why is this useful for our chefs? Well, a little-known secret that chefs use is that the heat of spice can also be counteracted with a bit of fat, liquid, or dairy. All spices are fat-soluble.
The addition of healthy oil like grapeseed, olive oil, or coconut will help dissolve and dilute its spicy molecules. So our chefs can put that tip to good use, and they can modify their recipes or even provide a modification for those who wish to use it. So let’s put that principle into action. Let’s head back and try to filter now, and get only those rows, which have a certain average oil value. So we are going to subset the grouped data. We receive fewer rows, and we still want to boil those rows further down. And to do that, there is a new statement being introduced.
Let’s look at the code here. We really have made no change, other than the introduction of a new statement, a new clause that’s going to filter my data. But wait, didn’t we say earlier that the WHERE clause is one of the most powerful statements to filter the data? How come I did not use a WHERE clause? Well, let’s then use the WHERE clause. Here, we have, on the SELECT, the columns we wish. On the FROM, the table we need. The GROUP BY is grouping rows. And now, we’ve said WHERE AvgOil. And this operator is greater than or equal to 4. And there are a few red messages in the log. And the very first message is to do with the WHERE clause. In the log, I see an underline with red, and there’s lots and lots of red there. And so I also end up getting a syntax error.
So, probably, because I did not put the WHERE clause in the right place. If I remember the order of statements, I needed to have the WHERE right after the FROM. So let’s just move the WHERE clause and give this a go again. Now you have a very interesting error in the log. You still have an error. The WHERE clause did not work, and SAS comes back to you and to tell you the following columns were not found in the contributing tables– AvgOil. Do you remember the two things we spoke about in the WHERE clause? One was it has to log on columns that are on existing tables. For SQL, existing means anything that is on the FROM clause.
So AvgOil did not exist in the Ingredients table. It was actually built on the SELECT statement. So the WHERE clause has no access to this column. And therefore, it choked. So we are going to have to head back and make a modification. We’re going to move this WHERE clause from here. Clearly, it’s not going to work. And we are going to add, in place of were, the HAVING. So this is the clause that’s going to let me cut through my grouped data. Let’s pause here for a minute and just review this powerful concept. The WHERE versus the HAVING. The WHERE clause acts on existing data– on detailed rows of data. The HAVING acts on grouped data. So the difference– if we where would imagine a grocery analogy– is going to the grocery store and buying your apples.
You bring them home in their raw form. How you treat them and how you cut them is quite different, if I decided instead to make apple pie. The treatment of apple pie is quite different from the treatment of my raw apples. So the WHERE clause is like the knife you would use to cut your raw apples. It hasn’t been transformed. It hasn’t been processed. The HAVING is the way you would treat your apple pie, and the way you would cut it. It’s completely processed.