model
Splink model for probabilistic record linkage.
Schema
Property :fields
:fields
This property defines the denormalization into MDM tables. MDM model can only use fields defined here.
Each key is the name of the field to be defined.
Each value is a path. The path should extract primitive value resource. Each extracted value will be converted to string.
Path syntax
A path is a vector.
Keyword
Keyword value specifies field name to follow. For example, path
Will extract value
in object
Integer
Integer value specifies the index in array to follow. For example, path
will extract value
in object
Map
Map value can have only on key-value pair. It specifies Aidbox to filter vector leaving only objects, which have the specified key with the specified value. For example, path
will extract value
in object
Property :blocking-conds
:blocking-conds
Blocking reduces number of comparisons. A pair of resources is compared only if it
Choose blocking conditions to be
loose enough so every pair of matching resources falls into some block
tight enough for performance to be acceptable
Blocking conditions are ORed. Essentially blocking is a fast and rough estimation of mathing resources. This estimation is refined by slower probabilistic algorithm to produce actual result.
These conditions are inserted in the WHERE part of generated queries as is.
Property :use-frequencies-for
:use-frequencies-for
This property specifies field for which frequencies of values need to be taken into account.
For example, surname "Smith-Johnson-Williams" is much rarer than the surname "Smith", so a pair of resources having the former surname is more likely to be a match than a pair of resource having the later one.
Property :comparisons
This property specifies comparison categories. The property is a map in which keys are field names and values are comparison definition vectors.
Each comparison is defined using a map having three key-value pairs:
:cond
-- SQL condition for the pair to fall into the category:m-prob
-- m-probability of the comparison category:u-prob
-- u-probability of the comparison category:use-frequncies
-- optional boolean parameter. Enables frequency adjustment for this comparison. It is applicable only to exact equality comparison.
See Mathematical details article to learn about m- and u-probabilities.
SQL condition is inserted into WHERE part of generated queries.
Comparison definition vector is a vector with at least three comparison definitions.
NULL case is the case when the value in one of the resources of a pair is absent. You need to specify condition manually and set both probabilities to 1.
ELSE case is the case when the pair doesn't fall into any of the categories defined. You need to set condition value to be
:else
keyword.All other items are comparison definitions. They should be in the order of decreasing strictness. E.g. "exact equality" comparison should be before "levenshtein distance < 2" category.
In SQL conditions field names have suffixes: _l
suffix is added for the first record of a pair, and _r
suffix is added for the second record of a pair.
Property :random-math-prob
:random-math-prob
This property defines probability that two records picked at random will match. See Mathematical details for details.
Notes
Auto updates of MDM tables
When an MDM model is enabled, Aidbox automatically updates the de-normalization table associated with the model. But you need to manually call the aidbox.mdm/update-mdm-tables
RPC method once in a while to keep frequency data in sync.
Auto-update is triggered after any operation modifying resource. It updates one row of de-normalization table.
Examples
Last updated