Matching Model Explanation

This page explains how the MPI matching model works, describing its structure, scoring logic, and configurable elements with an example.

This page provides the matching model code and explains its elements. For an overview of probabilistic matching concepts and match score calculation, see our article Master Patient Index and Record Linkage.

This model is used for patient record matching, but the same approach can be adapted to detect duplicates for any type of resource. If you are interested in applying this approach to your use case, please, contact us.

Core Idea

The model compares selected fields from patient records and evaluates predefined comparison rules. Each rule in the features section contains an expression expr and an associated weight bf (Bayes Factor), indicating how strongly a match or mismatch on that field affects the total score.

All weights are summed into a total score. If the score is above the defined threshold, the record pair is included in the match results; if it is below, it is excluded.

Model Structure

Which fields to compare and how to compare them is described in the example model:

{
    "id": "model",
    "vars": {
        "dob": "(#.resource#>>'{birthDate}')",
        "name": "((#.#family) || ' ' || (#.#given))",
        "given": "(immutable_unaccent_upper(#.resource#>>'{name,0,given,0}'))",
        "family": "(immutable_unaccent_upper(#.resource#>>'{name,0,family}'))",
        "gender": "(#.resource#>>'{gender}')",
        "address": "(#.resource#>>'{address,0,line,0}')",
        "addressLength": "(length(#.resource#>>'{address,0,line,0}'))",
        "telecomArray": "array(select jsonb_array_elements_text(jsonb_path_query_array( #.resource, '$.telecom[*] ? (@.value != \"\").value')))"
    },
    "blocks": {
        "fn": {
            "var": "name"
        },
        "dob": {
            "var": "dob"
        },
        "addr": {
            "sql": "(l.#address % r.#address)"
        }
    },
    "features": {
        "fn": [
            {
                "bf": 0,
                "expr": " ( l.resource->'name' IS NULL OR r.resource->'name' IS NULL )"
            },
            {
                "bf": 13.336495228175629,
                "expr": "l.#name = r.#name"
            },
            {
                "bf": 13.104401641242227,
                "expr": "r.#given = l.#family AND l.#given = r.#family"
            },
            {
                "bf": 9.288385498954133,
                "expr": "levenshtein(l.#name, r.#name) <= 2"
            },
            {
                "bf": 10.36329167966839,
                "expr": "r.#given = l.#given AND string_to_array(l.#family, ' ') && string_to_array(r.#family, ' ')"
            },
            {
                "bf": 10.36329167966839,
                "expr": "r.#family = l.#family AND string_to_array(l.#given, ' ') && string_to_array(r.#given, ' ')"
            },
            {
                "bf": 2.402276401131933,
                "expr": "r.#given = l.#given"
            },
            {
                "else": -12.37233293924643
            }
        ],
        "dob": [
            {
                "bf": 0,
                "expr": " ( l.#dob  IS NULL OR r.#dob IS NULL )"
            },
            {
                "bf": 10.59415069916466,
                "expr": "l.#dob = r.#dob"
            },
            {
                "bf": 3.9911610470417744,
                "expr": "levenshtein(l.#dob, r.#dob) <= 1"
            },
            {
                "bf": 0.5164298695732575,
                "expr": "levenshtein(l.#dob, r.#dob) <= 2"
            },
            {
                "else": -10.322063538772698
            }
        ],
        "ext": [
            {
                "bf": 9.236771286242664,
                "expr": "((l.#telecomArray && r.#telecomArray) AND (((l.#addressLength > r.#addressLength) and (l.#address %>> r.#address)) or ((l.#addressLength <= r.#addressLength) and (l.#address <<% r.#address))))"
            },
            {
                "bf": 7.465648574292063,
                "expr": "(((l.#addressLength > r.#addressLength) and (l.#address %>> r.#address)) or ((l.#addressLength <= r.#addressLength) and (l.#address <<% r.#address)))"
            },
            {
                "bf": 6.465648574292063,
                "expr": "l.#telecomArray && r.#telecomArray"
            },
            {
                "else": -10.517360697819983
            }
        ],
        "sex": [
            {
                "bf": 0,
                "expr": " ( l.#gender IS NULL OR r.#gender IS NULL )"
            },
            {
                "bf": 1.8504082299552485,
                "expr": " l.#gender = r.#gender"
            },
            {
                "else": -4.842034404727677
            }
        ]
    },
    "resource": "Patient",
    "thresholds": {
        "auto": 25,
        "manual": 16
    },
    "resourceType": "AidboxLinkageModel"
}

Variables (vars)

Variables defined in the model can reference resource fields directly or be composed from them using expressions (e.g., concatenating values, applying normalization, or calculating derived values). These variables are used in feature expressions and blocking rules.

  • dob – patient birth date

  • name – concatenation of family and given names

  • given – normalized first name (accents removed, uppercase)

  • family – normalized last name (accents removed, uppercase)

  • gender – gender value

  • address – normalized address line

  • telecomArray – contact information (phone, email)

Comparison Blocks (blocks)

Blocking rules limit the number of candidate record pairs by selecting only those that share key characteristics (e.g., similar names, matching birth dates, or addresses). This reduces the number of comparisons, which significantly speeds up processing, while still preserving potential matches for scoring.

  • fn: blocks by patient name

  • dob: blocks by date of birth

  • addr: blocks by address

Matching Features and Scoring

Features describe how resource fields are compared and how much each comparison influences the overall match score.

Each feature contains:

  • expr – a logical expression that compares values of specific fields or variables between two records.

  • bf (Bayes factor / weight) – a numeric value representing how strongly a match or mismatch on that feature affects the total score.

When records are compared, all satisfied feature expressions add their weights to the total score. If a mismatch is detected, negative weights may be applied. The result is an aggregated score reflecting the likelihood that two records refer to the same entity.

The model uses Levenshtein distance to tolerate typos and small text differences. It counts how many single‑character edits (insertions, deletions, substitutions) are needed to make two strings equal. For example, levenshtein('Jonathan', 'Jonatan') = 1.

Name Matching (fn):

  • Exact match: 13.34 points

  • Swapped first/last names: 13.10 points

  • Levenshtein distance ≤ 2: 9.29 points

  • Partial matches (same first name + matching parts of last name): 10.36 points

  • Same first name only: 2.40 points

  • No match: -12.37 points

"fn": [
    {
        "bf": 0,
        "expr": " ( l.resource->'name' IS NULL OR r.resource->'name' IS NULL )"
    },
    {
        "bf": 13.336495228175629,
        "expr": "l.#name = r.#name"
    },
    {
        "bf": 13.104401641242227,
        "expr": "r.#given = l.#family AND l.#given = r.#family"
    },
    {
        "bf": 9.288385498954133,
        "expr": "levenshtein(l.#name, r.#name) <= 2"
    },
    {
        "bf": 10.36329167966839,
        "expr": "r.#given = l.#given AND string_to_array(l.#family, ' ') && string_to_array(r.#family, ' ')"
    },
    {
        "bf": 10.36329167966839,
        "expr": "r.#family = l.#family AND string_to_array(l.#given, ' ') && string_to_array(r.#given, ' ')"
    },
    {
        "bf": 2.402276401131933,
        "expr": "r.#given = l.#given"
    },
    {
        "else": -12.37233293924643
    }
]

Date of Birth Matching (dob):

  • Exact match: 10.59 points

  • Levenshtein distance ≤ 1: 3.99 points

  • Levenshtein distance ≤ 2: 0.52 points

  • No match: -10.32 points

"dob": [
    {
        "bf": 0,
        "expr": " ( l.#dob  IS NULL OR r.#dob IS NULL )"
    },
    {
        "bf": 10.59415069916466,
        "expr": "l.#dob = r.#dob"
    },
    {
        "bf": 3.9911610470417744,
        "expr": "levenshtein(l.#dob, r.#dob) <= 1"
    },
    {
        "bf": 0.5164298695732575,
        "expr": "levenshtein(l.#dob, r.#dob) <= 2"
    },
    {
        "else": -10.322063538772698
    }
]

Address Matching (ext):

  • Exact address match: 7.47 points

  • Matching contact information: 9.24 points

  • No match: -10.52 points

"ext": [
    {
        "bf": 9.236771286242664,
        "expr": "((l.#telecomArray && r.#telecomArray) AND (((l.#addressLength > r.#addressLength) and (l.#address %>> r.#address)) or ((l.#addressLength <= r.#addressLength) and (l.#address <<% r.#address))))"
    },
    {
        "bf": 7.465648574292063,
        "expr": "(((l.#addressLength > r.#addressLength) and (l.#address %>> r.#address)) or ((l.#addressLength <= r.#addressLength) and (l.#address <<% r.#address)))"
    },
    {
        "bf": 6.465648574292063,
        "expr": "l.#telecomArray && r.#telecomArray"
    },
    {
        "else": -10.517360697819983
    }
]

Gender Matching (sex):

  • Exact match: 1.85 points

  • No match: -4.84 points

"sex": [
    {
        "bf": 0,
        "expr": " ( l.#gender IS NULL OR r.#gender IS NULL )"
    },
    {
        "bf": 1.8504082299552485,
        "expr": " l.#gender = r.#gender"
    },
    {
        "else": -4.842034404727677
    }
]

Thresholds

Thresholds define the decision boundaries for match results. After the total score is calculated based on all feature comparisons, it is compared against threshold values:

  • auto: matching score ≥ 25 → automatic merge can be processed

  • manual: 16 ≤ matching score < 25 → manual review required

  • Below manual – score < 16 → non‑match

Last updated

Was this helpful?