Skip to contents

The processing of search queries is controlled by rules that are stored in YAML files. Their structure is documented here.

File

The highest unit of structure is the file. You can have just one, or you can have more. Breaking many rules into multiple files is more manageable and makes it easier for multiple people to collaborate on a project.

In addition, you can only work with a few files at a time. This will split the classification of queries into multiple phases, which are processed more quickly and clearly.

Recipes

A file consists of a list of recipes.

- [recipe]
- [recipe]
- [recipe]

There are two basic groups of recipes: removal and classification.

Removal recipes

Usually, a single recipe file is all you need to remove unwanted input queries. The simplest recipe for removal looks like this:

- type: remove
  rules:
  - match: xxx

When this recipe is applied, it removes all queries that match regular expression xxx.

More complex removal recipe looks like this:

- type: remove
  rules:
  - match:
    - xxx
    - yyy
    except:
    - aaa
    - bbb

When applied, it removes all queries that match regular expressions xxx or yyy (or both), unless they also match regular expressions aaa or bbb (or both).

Classification recipes

The simplest classification recipe looks like this:

- type: label
  name: label_name
  rules:
  - match:
    - aaa
    - bbb

It says:

  1. Create the column (dimension, label) named label_name.
  2. Apply the regular expression aaa|bbb (which means aaa or bbb) on every query.
  3. Fill in the column label_name with matched part of the query.
  4. If no part of the query matches the regular expression, leave the value empty (i.e. NA).

Sometimes you don’t want to have the label value the same as the matched part of the query. For instance you want to match both shirt and shirts, but label value should read shirt in both cases. Then you use this recipe:

- type: label
  name: product
  values:
  - value: shirt
    rules:
    - match:
      - shirt
      - shirts

Or you can write regular expressions more efficiently:

- type: label
  name: product
  values:
  - value: shirt
    rules:
    - match: shirts?

You can combine both methods and create multiple values of the same column (label), e.g.:

- type: label
  name: product
  rules:
  - match:
    - shoes
    - trousers
  values:
  - value: shirt
    rules:
    - match: shirts?
  - value: hat
    rules:
    - match: hats?

And you can even combine multiple recipes into one file like this:

- type: label
  name: product
  rules:
  - match:
    - shoes
    - trousers
  values:
  - value: shirt
    rules:
    - match: shirts?
  - value: hat
    rules:
    - match: hats?
- type: label
  name: price
  rules:
  - match:
    - cheap
    - luxury

What happens, if you apply the classification recipes above on queries? Let’s suppose the following queries:

#> # A tibble: 6 × 2
#>   query         volume
#>   <chr>          <dbl>
#> 1 outdoor shoes     10
#> 2 cheap shoes       10
#> 3 luxury shoes      10
#> 4 white shirt       10
#> 5 blue trousers     10
#> 6 black hat         10

And now classify those queries with the recipes above:

#> # A tibble: 6 × 5
#>   query_normalized product  price  n_queries volume
#>   <chr>            <chr>    <chr>      <int>  <dbl>
#> 1 black hat        hat      NA             1     10
#> 2 blue trousers    trousers NA             1     10
#> 3 cheap shoes      shoes    cheap          1     10
#> 4 luxury shoes     shoes    luxury         1     10
#> 5 outdoor shoes    shoes    NA             1     10
#> 6 white shirt      shirt    NA             1     10

Helper for recipe creation

In fact you need not always edit the recipe files directly. The package keywordr contains a function that make your work easier: ˙kwr_add_pattern()˙. It adds a new regex pattern to a recipe file. For instance, I could recreate the recipes above with this code:

recipe_file <- file.path(tempdir(), "recipes.yml")
kwr_add_pattern(
  pattern = "shoes", 
  recipe_file = recipe_file, 
  recipe_type = "label", 
  dim_name = "product"
)
kwr_add_pattern(
  pattern = "trousers", 
  recipe_file = recipe_file, 
  recipe_type = "label", 
  dim_name = "product"
)
kwr_add_pattern(
  pattern = "shirts?", 
  recipe_file = recipe_file, 
  recipe_type = "label", 
  dim_name = "product", 
  value = "shirt"
)
kwr_add_pattern(
  pattern = "hats?", 
  recipe_file = recipe_file, 
  recipe_type = "label", 
  dim_name = "product", 
  value = "hat"
)
kwr_add_pattern(
  pattern = "cheap", 
  recipe_file = recipe_file, 
  recipe_type = "label", 
  dim_name = "price"
)
kwr_add_pattern(
  pattern = "luxury", 
  recipe_file = recipe_file, 
  recipe_type = "label", 
  dim_name = "price"
)
writeLines(readLines(recipe_file))
#> - type: label
#>   name: product
#>   rules:
#>   - match:
#>     - shoes
#>     - trousers
#>   values:
#>   - value: shirt
#>     rules:
#>     - match: shirts?
#>   - value: hat
#>     rules:
#>     - match: hats?
#> - type: label
#>   name: price
#>   rules:
#>   - match:
#>     - cheap
#>     - luxury