Merino Jobs Operations
CSV Remote Settings Uploader Job
The CSV remote settings uploader is a job that uploads suggestions data in a CSV file to remote settings. It takes two inputs:
- A CSV file. The first row in the file is assumed to be a header that names the fields (columns) in the data.
- A Python module that validates the CSV contents and describes how to convert it into suggestions JSON.
If you're uploading suggestions from a Google sheet, you can export a CSV file from File > Download > Comma Separated Values (.csv). Make sure the first row in the sheet is a header that names the columns.
Uploading suggestions (Step by step)
If you're uploading a type of suggestion that the uploader already supports,
skip to Running the uploader below. If you're not sure
whether it's supported, check in the merino/jobs/csv_rs_uploader/
directory
for a file named similarly to the type.
To upload a new type of suggestion, follow the steps below. In summary, first you'll create a Python module that implements a model for the suggestion type, and then you'll run the uploader.
1. Create a Python model module for the new suggestion type
Add a Python module to merino/jobs/csv_rs_uploader/
. It's probably easiest to
copy an existing model module like mdn.py
, follow along with the steps here,
and modify it for the new suggestion type. Name the file according to the
suggestion type.
This file will define the model of the new suggestion type as it will be serialized in the output JSON, perform validation and conversion of the input data in the CSV, and define how the input data should map to the output JSON.
2. Add the Suggestion
class
In the module, implement a class called Suggestion
that derives from
BaseSuggestion
in merino.jobs.csv_rs_uploader.base
or
RowMajorBaseSuggestion
in merino.jobs.csv_rs_uploader.row_major_base
.
BaseSuggestion
class will be the model of the new suggestion type.
BaseSuggestion
itself derives from Pydantic's BaseModel
, so the validation
the class will perform will be based on Pydantic, which is used
throughout Merino. BaseSuggestion
is implemented in base.py
. If the CSV data
is row-major based, please use RowMajorBaseSuggestion
,
3. Add suggestion fields to the class
Add a field to the class for each property that should appear in the output JSON
(except score
, which the uploader will add automatically). Name each field as
you would like it to be named in the JSON. Give each field a type so that
Pydantic can validate it. For URL fields, use HttpUrl
from the pydantic
module.
4. Add validator methods to the class
Add a method annotated with Pydanyic's @field_validator
decorator for each field.
Each validator method should transform its field's input value into an appropriate output value and raise a ValueError
if the input value is invalid.
Pydantic will call these methods automatically as it performs validation.
Their return values will be used as the values in the output JSON.
BaseSuggestion
implements two helpers you should use:
_validate_str()
- Validates a string value and returns the validated value. Leading and trailing whitespace is stripped, and all whitespace is replaced with spaces and collapsed. Returns the validated value._validate_keywords()
- The uploader assumes that lists of keywords are serialized in the input data as comma-delimited strings. This helper method takes a comma-delimited string and splits it into individual keyword strings. Each keyword is converted to lowercase, some non-ASCII characters are replaced with ASCII equivalents that users are more likely to type, leading and trailing whitespace is stripped, all whitespace is replaced with spaces and collapsed, and duplicate keywords are removed. Returns the list of keyword strings.
5. Implement the class methods
For suggestion created from row-major based CSV, should add a @classmethod
to
Suggestion
called row_major_field_map()
. It should return a dict
that maps
from field (column) names in the input CSV to property names in the output JSON.
Otherwise, should add a @classmethod
to Suggestion
called
csv_to_suggestions()
. It should return suggestion array created from passed CSV
reader.
6. Add a test
Add a test file to tests/unit/jobs/csv_rs_uploader/
. See test_mdn.py
as an
example. The test should perform a successful upload as well as uploads that
fail due to validation errors and missing fields (columns) in the input CSV.
utils.py
in the same directory implements helpers that your test should use:
do_csv_test()
- Makes sure the uploader works correctly during a successful upload. It takes either a path to a CSV file or alist[dict]
that will be used to create a file object (StringIO
) for an in-memory CSV file. Prefer passing in alist[dict]
instead of creating a file and passing a path, since it's simpler.do_error_test()
- Makes sure a given error is raised when expected. UseValidationError
from thepydantic
module to check validation errors andMissingFieldError
frommerino.jobs.csv_rs_uploader
to check input CSV that is missing an expected field (column).
7. Run the test
$ MERINO_ENV=testing poetry run pytest tests/unit/jobs/csv_rs_uploader/test_foo.py
See also the main Merino development documentation for running unit tests.
8. Submit a PR
Once your test is passing, submit a PR with your changes so that the new suggestion type is committed to the repo. This step isn't necessary to run the uploader and upload your suggestions, so you can come back to it later.
9. Upload!
See Running the uploader.
Running the uploader
Run the following from the repo's root directory to see documentation for all
options and their defaults. Note that the upload
command is the only command
in the csv-rs-uploader
job.
poetry run merino-jobs csv-rs-uploader upload --help`
The uploader takes a CSV file as input, so you'll need to download or create one first.
Here's an example that uploads suggestions in foo.csv
to the remote settings
dev server:
poetry run merino-jobs csv-rs-uploader upload \
--server "https://remote-settings-dev.allizom.org/v1" \
--bucket main-workspace \
--csv-path foo.csv \
--model-name foo \
--record-type foo-suggestions \
--auth "Bearer ..."
Let's break down each command-line option in this example:
--server
- Suggestions will be uploaded to the remote settings dev server--bucket
- Themain-workspace
bucket will be used--csv-path
- The CSV input file isfoo.csv
--model-name
- The model module is namedfoo
. Its path within the repo would bemerino/jobs/csv_rs_uploader/foo.py
--record-type
- Thetype
in the remote settings records created for these suggestions will be set tofoo-suggestions
. This argument is optional and defaults to"{model_name}-suggestions"
--auth
- Your authentication header string from the server. To get a header, log in to the server dashboard (don't forget to log in to the Mozilla VPN first) and click the small clipboard icon near the top-right of the page, after the text that shows your username and server URL. The page will show a "Header copied to clipboard" toast notification if successful.
Setting suggestion scores
By default all uploaded suggestions will have a score
property whose value is
defined in the remote_settings
section of the Merino config. This default can
be overridden using --score <number>
. The number should be a float between 0
and 1 inclusive.
Other useful options
--dry-run
- Log the output suggestions but don't upload them. The uploader will still authenticate with the server, so--auth
must still be given.
Structure of the remote settings data
The uploader uses merino/jobs/utils/chunked_rs_uploader.py
to upload the
output suggestions. In short, suggestions will be chunked, and each chunk will
have a corresponding remote settings record with an attachment. The record's ID
will be generated from the --record-type
option, and its type will be set to
--record-type
exactly. The attachment will contain a JSON array of suggestion
objects in the chunk.