Setup and prerequisites
Used libraries
The keywordr package is designed to work well with tidyverse and I always include it in libraries.
Input data
As input data you can use either a CSV export from Marketing Miner called Keyword Suggestions or a dataframe (or tibble) with the following structure:
#> 'data.frame': 0 obs. of 5 variables:
#> $ query : chr
#> $ volume: int
#> $ cpc : num
#> $ input : chr
#> $ source: chr
Only the first two columns are required, the other three
(cpc
, input
and source
) are
optional.
For the purposes of this tutorial, I will use the first method and prepare the input data first.
input_data <- tribble(
~query, ~volume,
"seo", 96000,
"seo ye-ji", 22000,
"seo meaning", 6700,
"seo services", 6400,
"seo ye ji", 6000,
"what is a seo", 5300,
"seo london", 5000,
"what is seo", 4800,
"seo agency", 4300,
"joe seo", 4300,
"seo company", 4200,
"london seo agencies", 4100,
"seo consultant", 3500,
"seo service", 3500,
"seo company london", 2500,
"seo agencies london", 2500,
"seo agency london", 2400,
"seo services london", 2400,
"seo consultants", 2300,
"local seo", 2200,
"local seo services", 2000
)
print(input_data, n = 25)
#> # A tibble: 21 × 2
#> query volume
#> <chr> <dbl>
#> 1 seo 96000
#> 2 seo ye-ji 22000
#> 3 seo meaning 6700
#> 4 seo services 6400
#> 5 seo ye ji 6000
#> 6 what is a seo 5300
#> 7 seo london 5000
#> 8 what is seo 4800
#> 9 seo agency 4300
#> 10 joe seo 4300
#> 11 seo company 4200
#> 12 london seo agencies 4100
#> 13 seo consultant 3500
#> 14 seo service 3500
#> 15 seo company london 2500
#> 16 seo agencies london 2500
#> 17 seo agency london 2400
#> 18 seo services london 2400
#> 19 seo consultants 2300
#> 20 local seo 2200
#> 21 local seo services 2000
A kwresearch
object
All data is stored in an object of class kwresearch
during keyword analysis. Therefore, create such an object, called for
example kwr
, right at the beginning. In the same function
call, you can also import input data into it. For English (or other
languages that do not use diacritics), I recommend setting
accentize
to FALSE
.
kwr <- kwresearch(input_data, accentize = FALSE)
Now you can check the data are in.
kwr |> kwr_queries()
#> # A tibble: 19 × 7
#> query_normalized n_queries volume cpc query_original input source
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr>
#> 1 seo 1 96000 NA seo NA NA
#> 2 seo ye-ji 2 28000 NA seo ye ji,seo ye-ji NA NA
#> 3 seo meaning 1 6700 NA seo meaning NA NA
#> 4 london seo agencies 2 6600 NA london seo agencies,… NA NA
#> 5 seo services 1 6400 NA seo services NA NA
#> 6 what is a seo 1 5300 NA what is a seo NA NA
#> 7 seo london 1 5000 NA seo london NA NA
#> 8 what is seo 1 4800 NA what is seo NA NA
#> 9 joe seo 1 4300 NA joe seo NA NA
#> 10 seo agency 1 4300 NA seo agency NA NA
#> 11 seo company 1 4200 NA seo company NA NA
#> 12 seo consultant 1 3500 NA seo consultant NA NA
#> 13 seo service 1 3500 NA seo service NA NA
#> 14 seo company london 1 2500 NA seo company london NA NA
#> 15 seo agency london 1 2400 NA seo agency london NA NA
#> 16 seo services london 1 2400 NA seo services london NA NA
#> 17 seo consultants 1 2300 NA seo consultants NA NA
#> 18 local seo 1 2200 NA local seo NA NA
#> 19 local seo services 1 2000 NA local seo services NA NA
Note that queries ‘seo ye-ji’ and ‘seo ye ji’ were combined together.
This is the result of normalization, which tries to unify
queries that differ only by punctuation, accents or order of words. If
you don’t like the result, you can switch normalization off with the
normalize
argument.
You are ready for pruning and classification now.
Pruning
Pruning is a process in which you remove irrelevant queries. While they still remain in the source data and can be displayed in some outputs, they do not bother you during classification.
Your task is to find the patterns that determine the irrelevant queries and to specify these patterns as regular expressions. Because there are so few queries in the example, we can spot these formulas at a glance. They are the words or phrases ‘ye-ji’ and ‘joe’.
You remove the queries matching these patterns as follows:
recipe_file <- file.path(tempdir(), "recipes.yml")
kwr_add_pattern("ye-ji", recipe_file, recipe_type = "remove")
kwr_add_pattern("joe", recipe_file, recipe_type = "remove")
kwr <- kwr |>
kwr_prune(recipe_file)
#> Removing queries...
#> Removed 2 queries out of 19. Duration: 0.006s
If you check the queries now, you will find that the irrelevant ones are already missing.
kwr |> kwr_queries()
#> # A tibble: 17 × 7
#> query_normalized n_queries volume cpc query_original input source
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr>
#> 1 seo 1 96000 NA seo NA NA
#> 2 seo meaning 1 6700 NA seo meaning NA NA
#> 3 london seo agencies 2 6600 NA london seo agencies,… NA NA
#> 4 seo services 1 6400 NA seo services NA NA
#> 5 what is a seo 1 5300 NA what is a seo NA NA
#> 6 seo london 1 5000 NA seo london NA NA
#> 7 what is seo 1 4800 NA what is seo NA NA
#> 8 seo agency 1 4300 NA seo agency NA NA
#> 9 seo company 1 4200 NA seo company NA NA
#> 10 seo consultant 1 3500 NA seo consultant NA NA
#> 11 seo service 1 3500 NA seo service NA NA
#> 12 seo company london 1 2500 NA seo company london NA NA
#> 13 seo agency london 1 2400 NA seo agency london NA NA
#> 14 seo services london 1 2400 NA seo services london NA NA
#> 15 seo consultants 1 2300 NA seo consultants NA NA
#> 16 local seo 1 2200 NA local seo NA NA
#> 17 local seo services 1 2000 NA local seo services NA NA
What exactly happened
Using the kwr_add_pattern
function, you created a recipe
file called ‘recipes.yml’ in the temporary directory and wrote the
‘ye-ji’ and ‘joe’ patterns into it. The recipe type is
remove
, which means that queries matching the recipe should
be removed. You can read more about recipes and their structure in the
‘Removal and classification recipes’ vignette.
You then applied the kwr_prune
function to the
kwr
object, which executed all recipes with type
remove
from the recipe file.
On such small data the effect of this method does not show much, but consider that on data of the usual size one pattern can capture and thus remove possibly hundreds of queries.
Classification
Your next and actually main task is to classify the queries. We will
start with exploration, which is supported by three functions:
kwr_subqueries
, kwr_collocations
and
kwr_ngrams
.
Exploration
N-grams
The kwr_ngrams
function returns so-called n-grams,
i.e. single words or multi-word phrases contained in the queries. The
output is sorted by the number of queries in which the n-gram is found,
however you can reorder it using the arrange
function from
the dplyr package by search volume.
kwr |> kwr_ngrams()
#> # A tibble: 19 × 3
#> token n volume
#> <chr> <int> <dbl>
#> 1 seo 17 160100
#> 2 london 5 18900
#> 3 seo services 3 10800
#> 4 what is 2 10100
#> 5 seo agency 2 6700
#> 6 seo company 2 6700
#> 7 local seo 2 4200
#> 8 seo meaning 1 6700
#> 9 london seo agencies 1 6600
#> 10 what is a seo 1 5300
#> 11 seo london 1 5000
#> 12 what is seo 1 4800
#> 13 seo consultant 1 3500
#> 14 seo service 1 3500
#> 15 seo company london 1 2500
#> 16 seo agency london 1 2400
#> 17 seo services london 1 2400
#> 18 seo consultants 1 2300
#> 19 local seo services 1 2000
Subqueries
Subqueries are essentially n-grams that also exist as separate queries. In other words, they are queries that are entirely and exactly contained in other, longer queries.
kwr |> kwr_subqueries()
#> # A tibble: 5 × 3
#> token n volume
#> <chr> <int> <dbl>
#> 1 seo 16 64100
#> 2 seo services 2 4400
#> 3 seo company 1 2500
#> 4 seo agency 1 2400
#> 5 local seo 1 2000
Collocations
Collocations are at least two-word phrases that are more likely to occur in queries than the single words that make them up. They are useful for detecting typical multi-word phrases.
kwr |> kwr_collocations()
#> # A tibble: 5 × 6
#> collocation count count_nested length lambda z
#> <chr> <int> <int> <dbl> <dbl> <dbl>
#> 1 what is 2 2 2 5.42 2.57
#> 2 local seo 2 1 2 3.33 2.01
#> 3 seo services 3 2 2 2.04 1.30
#> 4 seo agency 2 1 2 1.61 1.00
#> 5 seo company 2 1 2 1.61 1.00
Pattern testing
When you discover a useful pattern during your research, it’s a good
idea to test it. The kwr_test_regex
function is used for
this purpose.
Let’s try the pattern “agenc”.
pattern <- "agenc"
kwr_test_regex(kwr, pattern)
#> $full
#> # A tibble: 3 × 5
#> query_normalized pred word succ match
#> <chr> <chr> <chr> <chr> <chr>
#> 1 london seo agencies london seo agencies "" agenc
#> 2 seo agency seo agency "" agenc
#> 3 seo agency london seo agency "london" agenc
#>
#> $words
#> # A tibble: 2 × 2
#> word n
#> <chr> <int>
#> 1 agency 2
#> 2 agencies 1
#>
#> $around
#> # A tibble: 3 × 2
#> around n
#> <chr> <int>
#> 1 seo 2
#> 2 london 1
#> 3 london seo 1
#>
#> $around_ngrams
#> # A tibble: 3 × 2
#> token n
#> <chr> <int>
#> 1 london 2
#> 2 seo 2
#> 3 london seo 1
As a result you get 4 tables:
- The first one contains all the queries that match the formula, and from the other columns you can clearly see what is before and after the matched text in the queries.
- The second contains a list of the full words or phrases that match the pattern.
- The third contains a list of unique texts that are located before and after the text that matches the pattern.
- Finally, the fourth contains the n-grams constructed from the texts in the third table.
Recipe creation and the classification itself
After you have tested it, you create a classification recipe from the
pattern, just as you have previously created pruning recipes. Then you
call the kwr_classify
function.
kwr_add_pattern(
pattern = pattern,
recipe_file = recipe_file,
recipe_type = "label",
dim_name = "bussiness_type",
value = "agency"
)
kwr <- kwr |>
kwr_classify(recipe_file)
#> Label:bussiness_type
#> Value: agency
If you look at the queries now, they are already classified.
kwr |>
kwr_queries()
#> # A tibble: 17 × 8
#> query_normalized bussiness_type n_quer…¹ volume cpc query…² input source
#> <chr> <chr> <int> <dbl> <dbl> <chr> <chr> <chr>
#> 1 seo NA 1 96000 NA seo NA NA
#> 2 seo meaning NA 1 6700 NA seo me… NA NA
#> 3 london seo agencies agency 2 6600 NA london… NA NA
#> 4 seo services NA 1 6400 NA seo se… NA NA
#> 5 what is a seo NA 1 5300 NA what i… NA NA
#> 6 seo london NA 1 5000 NA seo lo… NA NA
#> 7 what is seo NA 1 4800 NA what i… NA NA
#> 8 seo agency agency 1 4300 NA seo ag… NA NA
#> 9 seo company NA 1 4200 NA seo co… NA NA
#> 10 seo consultant NA 1 3500 NA seo co… NA NA
#> 11 seo service NA 1 3500 NA seo se… NA NA
#> 12 seo company london NA 1 2500 NA seo co… NA NA
#> 13 seo agency london agency 1 2400 NA seo ag… NA NA
#> 14 seo services london NA 1 2400 NA seo se… NA NA
#> 15 seo consultants NA 1 2300 NA seo co… NA NA
#> 16 local seo NA 1 2200 NA local … NA NA
#> 17 local seo services NA 1 2000 NA local … NA NA
#> # … with abbreviated variable names ¹n_queries, ²query_original
And then you just repeat the process with other patterns.
kwr_add_pattern("compan", recipe_file, "label", "bussiness_type", "company")
kwr_add_pattern("consult", recipe_file, "label", "bussiness_type", "consultancy")
kwr_add_pattern("london", recipe_file, "label", "place")
kwr <- kwr |>
kwr_classify(recipe_file)
#> Label:bussiness_type
#> Value: agency
#> Value: company
#> Value: consultancy
#> Label:place
kwr |>
kwr_queries() |>
select(1:5)
#> # A tibble: 17 × 5
#> query_normalized bussiness_type place n_queries volume
#> <chr> <chr> <chr> <int> <dbl>
#> 1 seo NA NA 1 96000
#> 2 seo meaning NA NA 1 6700
#> 3 london seo agencies agency london 2 6600
#> 4 seo services NA NA 1 6400
#> 5 what is a seo NA NA 1 5300
#> 6 seo london NA london 1 5000
#> 7 what is seo NA NA 1 4800
#> 8 seo agency agency NA 1 4300
#> 9 seo company company NA 1 4200
#> 10 seo consultant consultancy NA 1 3500
#> 11 seo service NA NA 1 3500
#> 12 seo company london company london 1 2500
#> 13 seo agency london agency london 1 2400
#> 14 seo services london NA london 1 2400
#> 15 seo consultants consultancy NA 1 2300
#> 16 local seo NA NA 1 2200
#> 17 local seo services NA NA 1 2000
Exploring the results
During and after the classification, you can review the overall summary of the queries.
kwr |> kwr_summary()
#> Distinct input queries: 21
#> Normalized queries: 19
#> Pruned queries: 17
#> Classified queries: 9
#> Unclassified queries: 8
An overview of the values in each column (dimension) of the classification is also useful.
kwr |> kwr_dimension_table(bussiness_type)
#> # A tibble: 4 × 3
#> bussiness_type n volume
#> <chr> <int> <dbl>
#> 1 NA 10 134300
#> 2 agency 3 13300
#> 3 company 2 6700
#> 4 consultancy 2 5800
Exporting results
The last step is the presentation of the results and their use in practice. For this phase, the keywordr package does not yet provide direct support. However, the results can be easily presented or exported by other means of the R language.
The following output datasets are available.
Classified queries
kwr |> kwr_classified_queries()
#> # A tibble: 17 × 9
#> query_normalized bussine…¹ place n_que…² volume cpc query…³ input source
#> <chr> <chr> <chr> <int> <dbl> <dbl> <chr> <chr> <chr>
#> 1 seo NA NA 1 96000 NA seo NA NA
#> 2 seo meaning NA NA 1 6700 NA seo me… NA NA
#> 3 london seo agencies agency lond… 2 6600 NA london… NA NA
#> 4 seo services NA NA 1 6400 NA seo se… NA NA
#> 5 what is a seo NA NA 1 5300 NA what i… NA NA
#> 6 seo london NA lond… 1 5000 NA seo lo… NA NA
#> 7 what is seo NA NA 1 4800 NA what i… NA NA
#> 8 seo agency agency NA 1 4300 NA seo ag… NA NA
#> 9 seo company company NA 1 4200 NA seo co… NA NA
#> 10 seo consultant consulta… NA 1 3500 NA seo co… NA NA
#> 11 seo service NA NA 1 3500 NA seo se… NA NA
#> 12 seo company london company lond… 1 2500 NA seo co… NA NA
#> 13 seo agency london agency lond… 1 2400 NA seo ag… NA NA
#> 14 seo services london NA lond… 1 2400 NA seo se… NA NA
#> 15 seo consultants consulta… NA 1 2300 NA seo co… NA NA
#> 16 local seo NA NA 1 2200 NA local … NA NA
#> 17 local seo services NA NA 1 2000 NA local … NA NA
#> # … with abbreviated variable names ¹bussiness_type, ²n_queries,
#> # ³query_original
Unclassified queries
kwr |> kwr_unclassified_queries()
#> # A tibble: 8 × 7
#> query_normalized n_queries volume cpc query_original input source
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr>
#> 1 seo 1 96000 NA seo NA NA
#> 2 seo meaning 1 6700 NA seo meaning NA NA
#> 3 seo services 1 6400 NA seo services NA NA
#> 4 what is a seo 1 5300 NA what is a seo NA NA
#> 5 what is seo 1 4800 NA what is seo NA NA
#> 6 seo service 1 3500 NA seo service NA NA
#> 7 local seo 1 2200 NA local seo NA NA
#> 8 local seo services 1 2000 NA local seo services NA NA
Removed (irrelevant) queries
kwr |> kwr_removed_queries()
#> # A tibble: 2 × 7
#> query_normalized n_queries volume cpc query_original input source
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr>
#> 1 seo ye-ji 2 28000 NA seo ye ji,seo ye-ji NA NA
#> 2 joe seo 1 4300 NA joe seo NA NA