Skip to contents

Setup and prerequisites

Used libraries

The keywordr package is designed to work well with tidyverse and I always include it in libraries.

Input data

As input data you can use either a CSV export from Marketing Miner called Keyword Suggestions or a dataframe (or tibble) with the following structure:

#> 'data.frame':    0 obs. of  5 variables:
#>  $ query : chr 
#>  $ volume: int 
#>  $ cpc   : num 
#>  $ input : chr 
#>  $ source: chr

Only the first two columns are required, the other three (cpc, input and source) are optional.

For the purposes of this tutorial, I will use the first method and prepare the input data first.

input_data <- tribble(
  ~query,             ~volume,
  "seo",                96000,
  "seo ye-ji",          22000,
  "seo meaning",         6700,
  "seo services",        6400,
  "seo ye ji",           6000,
  "what is a seo",       5300,
  "seo london",          5000,
  "what is seo",         4800,
  "seo agency",          4300,
  "joe seo",             4300,
  "seo company",         4200,
  "london seo agencies", 4100,
  "seo consultant",      3500,
  "seo service",         3500,
  "seo company london",  2500,
  "seo agencies london", 2500,
  "seo agency london",   2400,
  "seo services london", 2400,
  "seo consultants",     2300,
  "local seo",           2200,
  "local seo services",  2000
)
print(input_data, n = 25)
#> # A tibble: 21 × 2
#>    query               volume
#>    <chr>                <dbl>
#>  1 seo                  96000
#>  2 seo ye-ji            22000
#>  3 seo meaning           6700
#>  4 seo services          6400
#>  5 seo ye ji             6000
#>  6 what is a seo         5300
#>  7 seo london            5000
#>  8 what is seo           4800
#>  9 seo agency            4300
#> 10 joe seo               4300
#> 11 seo company           4200
#> 12 london seo agencies   4100
#> 13 seo consultant        3500
#> 14 seo service           3500
#> 15 seo company london    2500
#> 16 seo agencies london   2500
#> 17 seo agency london     2400
#> 18 seo services london   2400
#> 19 seo consultants       2300
#> 20 local seo             2200
#> 21 local seo services    2000

A kwresearch object

All data is stored in an object of class kwresearch during keyword analysis. Therefore, create such an object, called for example kwr, right at the beginning. In the same function call, you can also import input data into it. For English (or other languages that do not use diacritics), I recommend setting accentize to FALSE.

kwr <- kwresearch(input_data, accentize = FALSE)

Now you can check the data are in.

kwr |> kwr_queries()
#> # A tibble: 19 × 7
#>    query_normalized    n_queries volume   cpc query_original        input source
#>    <chr>                   <int>  <dbl> <dbl> <chr>                 <chr> <chr> 
#>  1 seo                         1  96000    NA seo                   NA    NA    
#>  2 seo ye-ji                   2  28000    NA seo ye ji,seo ye-ji   NA    NA    
#>  3 seo meaning                 1   6700    NA seo meaning           NA    NA    
#>  4 london seo agencies         2   6600    NA london seo agencies,… NA    NA    
#>  5 seo services                1   6400    NA seo services          NA    NA    
#>  6 what is a seo               1   5300    NA what is a seo         NA    NA    
#>  7 seo london                  1   5000    NA seo london            NA    NA    
#>  8 what is seo                 1   4800    NA what is seo           NA    NA    
#>  9 joe seo                     1   4300    NA joe seo               NA    NA    
#> 10 seo agency                  1   4300    NA seo agency            NA    NA    
#> 11 seo company                 1   4200    NA seo company           NA    NA    
#> 12 seo consultant              1   3500    NA seo consultant        NA    NA    
#> 13 seo service                 1   3500    NA seo service           NA    NA    
#> 14 seo company london          1   2500    NA seo company london    NA    NA    
#> 15 seo agency london           1   2400    NA seo agency london     NA    NA    
#> 16 seo services london         1   2400    NA seo services london   NA    NA    
#> 17 seo consultants             1   2300    NA seo consultants       NA    NA    
#> 18 local seo                   1   2200    NA local seo             NA    NA    
#> 19 local seo services          1   2000    NA local seo services    NA    NA

Note that queries ‘seo ye-ji’ and ‘seo ye ji’ were combined together. This is the result of normalization, which tries to unify queries that differ only by punctuation, accents or order of words. If you don’t like the result, you can switch normalization off with the normalize argument.

You are ready for pruning and classification now.

Pruning

Pruning is a process in which you remove irrelevant queries. While they still remain in the source data and can be displayed in some outputs, they do not bother you during classification.

Your task is to find the patterns that determine the irrelevant queries and to specify these patterns as regular expressions. Because there are so few queries in the example, we can spot these formulas at a glance. They are the words or phrases ‘ye-ji’ and ‘joe’.

You remove the queries matching these patterns as follows:

recipe_file <- file.path(tempdir(), "recipes.yml")

kwr_add_pattern("ye-ji", recipe_file, recipe_type = "remove")
kwr_add_pattern("joe", recipe_file, recipe_type = "remove")

kwr <- kwr |> 
  kwr_prune(recipe_file)
#> Removing queries...
#> Removed 2 queries out of 19. Duration: 0.006s

If you check the queries now, you will find that the irrelevant ones are already missing.

kwr |> kwr_queries()
#> # A tibble: 17 × 7
#>    query_normalized    n_queries volume   cpc query_original        input source
#>    <chr>                   <int>  <dbl> <dbl> <chr>                 <chr> <chr> 
#>  1 seo                         1  96000    NA seo                   NA    NA    
#>  2 seo meaning                 1   6700    NA seo meaning           NA    NA    
#>  3 london seo agencies         2   6600    NA london seo agencies,… NA    NA    
#>  4 seo services                1   6400    NA seo services          NA    NA    
#>  5 what is a seo               1   5300    NA what is a seo         NA    NA    
#>  6 seo london                  1   5000    NA seo london            NA    NA    
#>  7 what is seo                 1   4800    NA what is seo           NA    NA    
#>  8 seo agency                  1   4300    NA seo agency            NA    NA    
#>  9 seo company                 1   4200    NA seo company           NA    NA    
#> 10 seo consultant              1   3500    NA seo consultant        NA    NA    
#> 11 seo service                 1   3500    NA seo service           NA    NA    
#> 12 seo company london          1   2500    NA seo company london    NA    NA    
#> 13 seo agency london           1   2400    NA seo agency london     NA    NA    
#> 14 seo services london         1   2400    NA seo services london   NA    NA    
#> 15 seo consultants             1   2300    NA seo consultants       NA    NA    
#> 16 local seo                   1   2200    NA local seo             NA    NA    
#> 17 local seo services          1   2000    NA local seo services    NA    NA

What exactly happened

Using the kwr_add_pattern function, you created a recipe file called ‘recipes.yml’ in the temporary directory and wrote the ‘ye-ji’ and ‘joe’ patterns into it. The recipe type is remove, which means that queries matching the recipe should be removed. You can read more about recipes and their structure in the ‘Removal and classification recipes’ vignette.

You then applied the kwr_prune function to the kwr object, which executed all recipes with type remove from the recipe file.

On such small data the effect of this method does not show much, but consider that on data of the usual size one pattern can capture and thus remove possibly hundreds of queries.

Classification

Your next and actually main task is to classify the queries. We will start with exploration, which is supported by three functions: kwr_subqueries, kwr_collocations and kwr_ngrams.

Exploration

N-grams

The kwr_ngrams function returns so-called n-grams, i.e. single words or multi-word phrases contained in the queries. The output is sorted by the number of queries in which the n-gram is found, however you can reorder it using the arrange function from the dplyr package by search volume.

kwr |> kwr_ngrams()
#> # A tibble: 19 × 3
#>    token                   n volume
#>    <chr>               <int>  <dbl>
#>  1 seo                    17 160100
#>  2 london                  5  18900
#>  3 seo services            3  10800
#>  4 what is                 2  10100
#>  5 seo agency              2   6700
#>  6 seo company             2   6700
#>  7 local seo               2   4200
#>  8 seo meaning             1   6700
#>  9 london seo agencies     1   6600
#> 10 what is a seo           1   5300
#> 11 seo london              1   5000
#> 12 what is seo             1   4800
#> 13 seo consultant          1   3500
#> 14 seo service             1   3500
#> 15 seo company london      1   2500
#> 16 seo agency london       1   2400
#> 17 seo services london     1   2400
#> 18 seo consultants         1   2300
#> 19 local seo services      1   2000

Subqueries

Subqueries are essentially n-grams that also exist as separate queries. In other words, they are queries that are entirely and exactly contained in other, longer queries.

kwr |> kwr_subqueries()
#> # A tibble: 5 × 3
#>   token            n volume
#>   <chr>        <int>  <dbl>
#> 1 seo             16  64100
#> 2 seo services     2   4400
#> 3 seo company      1   2500
#> 4 seo agency       1   2400
#> 5 local seo        1   2000

Collocations

Collocations are at least two-word phrases that are more likely to occur in queries than the single words that make them up. They are useful for detecting typical multi-word phrases.

kwr |> kwr_collocations()
#> # A tibble: 5 × 6
#>   collocation  count count_nested length lambda     z
#>   <chr>        <int>        <int>  <dbl>  <dbl> <dbl>
#> 1 what is          2            2      2   5.42  2.57
#> 2 local seo        2            1      2   3.33  2.01
#> 3 seo services     3            2      2   2.04  1.30
#> 4 seo agency       2            1      2   1.61  1.00
#> 5 seo company      2            1      2   1.61  1.00

Pattern testing

When you discover a useful pattern during your research, it’s a good idea to test it. The kwr_test_regex function is used for this purpose.

Let’s try the pattern “agenc”.

pattern <- "agenc"
kwr_test_regex(kwr, pattern)
#> $full
#> # A tibble: 3 × 5
#>   query_normalized    pred       word     succ     match
#>   <chr>               <chr>      <chr>    <chr>    <chr>
#> 1 london seo agencies london seo agencies ""       agenc
#> 2 seo agency          seo        agency   ""       agenc
#> 3 seo agency london   seo        agency   "london" agenc
#> 
#> $words
#> # A tibble: 2 × 2
#>   word         n
#>   <chr>    <int>
#> 1 agency       2
#> 2 agencies     1
#> 
#> $around
#> # A tibble: 3 × 2
#>   around         n
#>   <chr>      <int>
#> 1 seo            2
#> 2 london         1
#> 3 london seo     1
#> 
#> $around_ngrams
#> # A tibble: 3 × 2
#>   token          n
#>   <chr>      <int>
#> 1 london         2
#> 2 seo            2
#> 3 london seo     1

As a result you get 4 tables:

  • The first one contains all the queries that match the formula, and from the other columns you can clearly see what is before and after the matched text in the queries.
  • The second contains a list of the full words or phrases that match the pattern.
  • The third contains a list of unique texts that are located before and after the text that matches the pattern.
  • Finally, the fourth contains the n-grams constructed from the texts in the third table.

Recipe creation and the classification itself

After you have tested it, you create a classification recipe from the pattern, just as you have previously created pruning recipes. Then you call the kwr_classify function.

kwr_add_pattern(
  pattern = pattern,
  recipe_file = recipe_file,
  recipe_type = "label",
  dim_name = "bussiness_type",
  value = "agency"
)
kwr <- kwr |> 
  kwr_classify(recipe_file)
#> Label:bussiness_type
#>   Value: agency

If you look at the queries now, they are already classified.

kwr |> 
  kwr_queries()
#> # A tibble: 17 × 8
#>    query_normalized    bussiness_type n_quer…¹ volume   cpc query…² input source
#>    <chr>               <chr>             <int>  <dbl> <dbl> <chr>   <chr> <chr> 
#>  1 seo                 NA                    1  96000    NA seo     NA    NA    
#>  2 seo meaning         NA                    1   6700    NA seo me… NA    NA    
#>  3 london seo agencies agency                2   6600    NA london… NA    NA    
#>  4 seo services        NA                    1   6400    NA seo se… NA    NA    
#>  5 what is a seo       NA                    1   5300    NA what i… NA    NA    
#>  6 seo london          NA                    1   5000    NA seo lo… NA    NA    
#>  7 what is seo         NA                    1   4800    NA what i… NA    NA    
#>  8 seo agency          agency                1   4300    NA seo ag… NA    NA    
#>  9 seo company         NA                    1   4200    NA seo co… NA    NA    
#> 10 seo consultant      NA                    1   3500    NA seo co… NA    NA    
#> 11 seo service         NA                    1   3500    NA seo se… NA    NA    
#> 12 seo company london  NA                    1   2500    NA seo co… NA    NA    
#> 13 seo agency london   agency                1   2400    NA seo ag… NA    NA    
#> 14 seo services london NA                    1   2400    NA seo se… NA    NA    
#> 15 seo consultants     NA                    1   2300    NA seo co… NA    NA    
#> 16 local seo           NA                    1   2200    NA local … NA    NA    
#> 17 local seo services  NA                    1   2000    NA local … NA    NA    
#> # … with abbreviated variable names ¹​n_queries, ²​query_original

And then you just repeat the process with other patterns.

kwr_add_pattern("compan", recipe_file, "label", "bussiness_type", "company")
kwr_add_pattern("consult", recipe_file, "label", "bussiness_type", "consultancy")
kwr_add_pattern("london", recipe_file, "label", "place")

kwr <- kwr |> 
  kwr_classify(recipe_file)
#> Label:bussiness_type
#>   Value: agency
#>   Value: company
#>   Value: consultancy
#> Label:place
kwr |> 
  kwr_queries() |> 
  select(1:5)
#> # A tibble: 17 × 5
#>    query_normalized    bussiness_type place  n_queries volume
#>    <chr>               <chr>          <chr>      <int>  <dbl>
#>  1 seo                 NA             NA             1  96000
#>  2 seo meaning         NA             NA             1   6700
#>  3 london seo agencies agency         london         2   6600
#>  4 seo services        NA             NA             1   6400
#>  5 what is a seo       NA             NA             1   5300
#>  6 seo london          NA             london         1   5000
#>  7 what is seo         NA             NA             1   4800
#>  8 seo agency          agency         NA             1   4300
#>  9 seo company         company        NA             1   4200
#> 10 seo consultant      consultancy    NA             1   3500
#> 11 seo service         NA             NA             1   3500
#> 12 seo company london  company        london         1   2500
#> 13 seo agency london   agency         london         1   2400
#> 14 seo services london NA             london         1   2400
#> 15 seo consultants     consultancy    NA             1   2300
#> 16 local seo           NA             NA             1   2200
#> 17 local seo services  NA             NA             1   2000

Exploring the results

During and after the classification, you can review the overall summary of the queries.

kwr |> kwr_summary()
#> Distinct input queries:  21 
#> Normalized queries:      19 
#> Pruned queries:          17 
#> Classified queries:      9 
#> Unclassified queries:    8

An overview of the values in each column (dimension) of the classification is also useful.

kwr |> kwr_dimension_table(bussiness_type)
#> # A tibble: 4 × 3
#>   bussiness_type     n volume
#>   <chr>          <int>  <dbl>
#> 1 NA                10 134300
#> 2 agency             3  13300
#> 3 company            2   6700
#> 4 consultancy        2   5800

Exporting results

The last step is the presentation of the results and their use in practice. For this phase, the keywordr package does not yet provide direct support. However, the results can be easily presented or exported by other means of the R language.

The following output datasets are available.

Classified queries

kwr |> kwr_classified_queries()
#> # A tibble: 17 × 9
#>    query_normalized    bussine…¹ place n_que…² volume   cpc query…³ input source
#>    <chr>               <chr>     <chr>   <int>  <dbl> <dbl> <chr>   <chr> <chr> 
#>  1 seo                 NA        NA          1  96000    NA seo     NA    NA    
#>  2 seo meaning         NA        NA          1   6700    NA seo me… NA    NA    
#>  3 london seo agencies agency    lond…       2   6600    NA london… NA    NA    
#>  4 seo services        NA        NA          1   6400    NA seo se… NA    NA    
#>  5 what is a seo       NA        NA          1   5300    NA what i… NA    NA    
#>  6 seo london          NA        lond…       1   5000    NA seo lo… NA    NA    
#>  7 what is seo         NA        NA          1   4800    NA what i… NA    NA    
#>  8 seo agency          agency    NA          1   4300    NA seo ag… NA    NA    
#>  9 seo company         company   NA          1   4200    NA seo co… NA    NA    
#> 10 seo consultant      consulta… NA          1   3500    NA seo co… NA    NA    
#> 11 seo service         NA        NA          1   3500    NA seo se… NA    NA    
#> 12 seo company london  company   lond…       1   2500    NA seo co… NA    NA    
#> 13 seo agency london   agency    lond…       1   2400    NA seo ag… NA    NA    
#> 14 seo services london NA        lond…       1   2400    NA seo se… NA    NA    
#> 15 seo consultants     consulta… NA          1   2300    NA seo co… NA    NA    
#> 16 local seo           NA        NA          1   2200    NA local … NA    NA    
#> 17 local seo services  NA        NA          1   2000    NA local … NA    NA    
#> # … with abbreviated variable names ¹​bussiness_type, ²​n_queries,
#> #   ³​query_original

Unclassified queries

kwr |> kwr_unclassified_queries()
#> # A tibble: 8 × 7
#>   query_normalized   n_queries volume   cpc query_original     input source
#>   <chr>                  <int>  <dbl> <dbl> <chr>              <chr> <chr> 
#> 1 seo                        1  96000    NA seo                NA    NA    
#> 2 seo meaning                1   6700    NA seo meaning        NA    NA    
#> 3 seo services               1   6400    NA seo services       NA    NA    
#> 4 what is a seo              1   5300    NA what is a seo      NA    NA    
#> 5 what is seo                1   4800    NA what is seo        NA    NA    
#> 6 seo service                1   3500    NA seo service        NA    NA    
#> 7 local seo                  1   2200    NA local seo          NA    NA    
#> 8 local seo services         1   2000    NA local seo services NA    NA

Removed (irrelevant) queries

kwr |> kwr_removed_queries()
#> # A tibble: 2 × 7
#>   query_normalized n_queries volume   cpc query_original      input source
#>   <chr>                <int>  <dbl> <dbl> <chr>               <chr> <chr> 
#> 1 seo ye-ji                2  28000    NA seo ye ji,seo ye-ji NA    NA    
#> 2 joe seo                  1   4300    NA joe seo             NA    NA