{janitor}: The Power of Crosstables

R
tables
EDA
Author

Ted Laderas

Published

October 8, 2024

One of my favorite tools: {janitor} and Crosstables

One of my most used tools in Exploratory Data Analysis is the humble crosstable. Crosstables can tell us many things, including whether variables are evenly distributed among two variables, or highlight structural zeros (combinations of variables that just don’t exist), and highlight the strength of association of one variable with another.

Let’s start out with the 80 cereals dataset (available here). Looking at the description, we know there should be at least 3 categorical variables (manufacturer, type, and shelf). Let’s keep that in mind when we start looking at the data.

Doing a bit of cleaning here, relabeling the manufacturers with their actual names.

Crosstabs with janitor::tabyl()

There is a lot of power in counting things, especially in the crosstable, where we ask questions of association between variables. I want to highlight the power of crosstables.

You might be familiar with table(x, y) as a way to build your crosstables. I’d like to point one of the wonders of the {janitor} package: janitor::tabyl(). This function can be directly plugged into a tidy workflow.

For example, instead of table(cereals$shelf) we can write

A couple of things that tabyl() does well: for single variables, it gives the counts and percentages, which can be extremely useful.

From reading the description, we know that manufacturer (if you are using the data from kaggle, I recoded mfr to be manufacturer) and type are two categorical variables in the data. We may want to know whether the manufacturers are evenly distributed in terms of cereal type.

It’s harder to see the picture because of the counts, though we do notice some things (General Mills and Kelloggs don’t have any hot cereals). We can look at proportions by using janitor::adorn_percentages, which will show us the row percentages in our cross table.

Using adorn_totals(denominator="row"), we can see the percentage of hot and cold cereals for each manufacturer. Note that by using adorn_percentages() and adorn_n() we can get both the counts and percentages.

A couple things become more obvious from the crosstab. There is only one manufacturer (A) that makes only hot cereals. The majority make cold cereals other than Nabisco and Quaker Oats.

Shelf Height and Marketing

What about the distribution of cereals among the shelf height (1=lowest shelf and 3=highest)? According to this article:

In a two-part study, researchers confirmed that the cereals targeting children are placed about 23 inches off the ground and those aimed at adults 48 inches high. After studying 65 cereals and 86 “spokes-characters” they found the cereals on the top shelves have characters staring straight ahead or slightly up to make eye contact with adults. For the lower boxes with cartoon characters with large inviting eyes, the gaze is focused slightly downward, to create eye contact with children.

We know that a lot of cereals that are marketed towards kids are on the second shelf, and a lot of value cereals are on the first shelf, and the more “adult cereals” are on the top shelf. Can we use crosstables to visualize the distribution of cereals?

So now we have an idea that kid-marketed cereals are on the second shelf. Can we check this? One idea to check this is to see which cereals that have breakfast cereal mascots. Then we can do a crosstab of cereals that have mascots versus shelf level.

Which shelves have cereal mascots?

We can do this by merging a CSV file I built from scraping a whiteclouds article about cereal mascots. If you’re interested in how I made this, the recipe is here. You can also download the mascot.csv file there as well. I didn’t clean it too much (note that some of the cereals have some notes in parentheses), but I’m mostly interested to see if there’s a match in our cereals dataset.

I changed cereal names to match where necessary in the mascots data.frame so that merging the two frames will produce correct results.

We’ll need to do some data cleaning here, mostly renaming cereal names so they will match up with our cereals dataset:

As a quick check, let’s take a look at the 2nd shelf to see if we missed any potential cereal mascots.

Looks ok. Note that there’s duplicate rows for cereals, because mascots can change over time (some cereals have multiple mascots).

Ok, now that we’re satisfied, we can start to answer our question about cereal mascots and shelves. In order to produce our cross tab, we’ll need to remove duplicate cereals (because they can have multiple mascots) before we make the cross-tab. I’ll do that using distinct() after I remove the mascot column:

And wow! Summarizing the data in this way makes one thing very obvious: there are no cereals that have mascots on the 3rd shelf. So yes, our quick EDA shows that cereal manufacturers are targeting children on the 1st and 2nd shelves.

If we dive deeper, we can see that General Mills and Kellogg’s are the brands that have the most cereal mascots (40% and 40% of all cereal mascots in the dataset). Note we’re using column totals this time for the dataset.

What did we learn?

We learned about the power of crosstables to highlight patterns of association in the data. By merging in a data.frame of cereal mascots, we were able to show that the marketing to childen using these mascots at their eye level was real.

Citation

BibTeX citation:
@online{laderas2024,
  author = {Laderas, Ted},
  title = {`\{Janitor\}`: {The} {Power} of {Crosstables}},
  date = {2024-10-08},
  langid = {en}
}
For attribution, please cite this work as:
Laderas, Ted. 2024. “`{Janitor}`: The Power of Crosstables.” October 8, 2024.