The processing of search queries is controlled by rules that are stored in YAML files. Their structure is documented here.
File
The highest unit of structure is the file. You can have just one, or you can have more. Breaking many rules into multiple files is more manageable and makes it easier for multiple people to collaborate on a project.
In addition, you can only work with a few files at a time. This will split the classification of queries into multiple phases, which are processed more quickly and clearly.
Recipes
A file consists of a list of recipes.
- [recipe]
- [recipe]
- [recipe]
There are two basic groups of recipes: removal and classification.
Removal recipes
Usually, a single recipe file is all you need to remove unwanted input queries. The simplest recipe for removal looks like this:
- type: remove
rules:
- match: xxx
When this recipe is applied, it removes all queries that match
regular expression xxx
.
More complex removal recipe looks like this:
- type: remove
rules:
- match:
- xxx
- yyy
except:
- aaa
- bbb
When applied, it removes all queries that match regular expressions
xxx
or yyy
(or both), unless they also match
regular expressions aaa
or bbb
(or both).
Classification recipes
The simplest classification recipe looks like this:
- type: label
name: label_name
rules:
- match:
- aaa
- bbb
It says:
- Create the column (dimension, label) named label_name.
- Apply the regular expression
aaa|bbb
(which meansaaa
orbbb
) on every query. - Fill in the column label_name with matched part of the query.
- If no part of the query matches the regular expression, leave the
value empty (i.e.
NA
).
Sometimes you don’t want to have the label value the same as the matched part of the query. For instance you want to match both shirt and shirts, but label value should read shirt in both cases. Then you use this recipe:
- type: label
name: product
values:
- value: shirt
rules:
- match:
- shirt
- shirts
Or you can write regular expressions more efficiently:
- type: label
name: product
values:
- value: shirt
rules:
- match: shirts?
You can combine both methods and create multiple values of the same column (label), e.g.:
- type: label
name: product
rules:
- match:
- shoes
- trousers
values:
- value: shirt
rules:
- match: shirts?
- value: hat
rules:
- match: hats?
And you can even combine multiple recipes into one file like this:
- type: label
name: product
rules:
- match:
- shoes
- trousers
values:
- value: shirt
rules:
- match: shirts?
- value: hat
rules:
- match: hats?
- type: label
name: price
rules:
- match:
- cheap
- luxury
What happens, if you apply the classification recipes above on queries? Let’s suppose the following queries:
#> # A tibble: 6 × 2
#> query volume
#> <chr> <dbl>
#> 1 outdoor shoes 10
#> 2 cheap shoes 10
#> 3 luxury shoes 10
#> 4 white shirt 10
#> 5 blue trousers 10
#> 6 black hat 10
And now classify those queries with the recipes above:
#> # A tibble: 6 × 5
#> query_normalized product price n_queries volume
#> <chr> <chr> <chr> <int> <dbl>
#> 1 black hat hat NA 1 10
#> 2 blue trousers trousers NA 1 10
#> 3 cheap shoes shoes cheap 1 10
#> 4 luxury shoes shoes luxury 1 10
#> 5 outdoor shoes shoes NA 1 10
#> 6 white shirt shirt NA 1 10
Helper for recipe creation
In fact you need not always edit the recipe files directly. The package keywordr contains a function that make your work easier: ˙kwr_add_pattern()˙. It adds a new regex pattern to a recipe file. For instance, I could recreate the recipes above with this code:
recipe_file <- file.path(tempdir(), "recipes.yml")
kwr_add_pattern(
pattern = "shoes",
recipe_file = recipe_file,
recipe_type = "label",
dim_name = "product"
)
kwr_add_pattern(
pattern = "trousers",
recipe_file = recipe_file,
recipe_type = "label",
dim_name = "product"
)
kwr_add_pattern(
pattern = "shirts?",
recipe_file = recipe_file,
recipe_type = "label",
dim_name = "product",
value = "shirt"
)
kwr_add_pattern(
pattern = "hats?",
recipe_file = recipe_file,
recipe_type = "label",
dim_name = "product",
value = "hat"
)
kwr_add_pattern(
pattern = "cheap",
recipe_file = recipe_file,
recipe_type = "label",
dim_name = "price"
)
kwr_add_pattern(
pattern = "luxury",
recipe_file = recipe_file,
recipe_type = "label",
dim_name = "price"
)
writeLines(readLines(recipe_file))
#> - type: label
#> name: product
#> rules:
#> - match:
#> - shoes
#> - trousers
#> values:
#> - value: shirt
#> rules:
#> - match: shirts?
#> - value: hat
#> rules:
#> - match: hats?
#> - type: label
#> name: price
#> rules:
#> - match:
#> - cheap
#> - luxury