{db_name}`¶

Performs cross-collection semantic matching. For each row in a source collection, finds the best matching rows in a target collection based on slot embedding similarity. Enables cross-collection joins via the HDC intersection layer.

Request¶

Content-Type: application/json

URL Parameters:

Parameter	Description
`db_name`	Name of the database containing both collections

Body:

Parameter	Type	Required	Default	Description
`source_collection`	string	✅	—	Namespace of the source collection
`source_slot`	string	✅	—	Field in the source collection to match on
`target_collection`	string	✅	—	Namespace of the target collection
`target_slot`	string	✅	—	Field in the target collection to match against
`top_k_per_source`	int	❌	`2`	Number of top target matches per source row
`similarity_threshold`	float	❌	`0.5`	Minimum similarity score to consider a match
`min_margin`	float	❌	`0.02`	Minimum score gap between top-1 and top-2 required for a confident match
`source_row_ids`	list of int	❌	`null`	Specific source rows to match. If omitted, all rows are used
`target_row_ids`	list of int	❌	`null`	Specific target rows to match against. If omitted, all rows are used

Behavior¶

Similarity matrix — Neural embeddings are computed for all unique source and target slot values. A full similarity matrix is built using batch vector similarity, then the top-k matches per source row are extracted.

Margin-based NULL detection — If the gap between the best and second-best match for a source row is below min_margin, the match is flagged as ambiguous. This avoids returning overconfident low-quality joins.

Cross-field matching — Source and target slots can have different names and schemas. Matching is purely based on embedding similarity, enabling joins across heterogeneous collections.

Timing — The response includes a timing object with millisecond breakdowns for embedding and matching phases.

Responses¶

200 OK¶

{
  "status": "success",
  "results": [
    {
      "source_id": 0,
      "target_id": 14,
      "score": 0.912,
      "margin": 0.143,
      "status": "matched"
    },
    {
      "source_id": 1,
      "target_id": null,
      "score": 0.431,
      "margin": 0.012,
      "status": "ambiguous"
    }
  ],
  "source_count": 10,
  "target_count": 50,
  "timing": {
    "embed_ms": 42.1,
    "match_ms": 8.3,
    "total_ms": 51.7
  }
}

Each result contains:

Field	Description
`source_id`	Row ID from the source collection
`target_id`	Matched row ID from the target collection, or `null` if unmatched
`score`	Similarity score of the best match
`margin`	Score gap between top-1 and top-2 matches
`status`	`"matched"`, `"ambiguous"`, or `"unmatched"`

Error Responses¶

Status	Condition
`404`	Source or target collection not found
`500`	Unexpected internal error

Notes¶

Both collections must exist within the same db_name.
Raise similarity_threshold to reduce false positives; raise min_margin to require more decisive matches.
source_row_ids and target_row_ids can be used to scope the match to a subset of rows, which significantly reduces compute for large collections.

Example¶

import requests

SERVER_URL = "http://18.220.128.24:8000"
API_KEY    = "yourapitoken"

def batch_match(db_name: str, source_collection: str, source_slot: str,
                target_collection: str, target_slot: str) -> dict:
    response = requests.post(
        f"{SERVER_URL}/compose/batch_match/{db_name}",
        headers={"X-API-Key": API_KEY},
        json={
            "source_collection":  source_collection,
            "source_slot":        source_slot,
            "target_collection":  target_collection,
            "target_slot":        target_slot,
            "top_k_per_source":   2,
            "similarity_threshold": 0.5,
            "min_margin":         0.02,
        },
    )
    response.raise_for_status()
    return response.json()


result = batch_match(
    db_name="my_db",
    source_collection="expertise",
    source_slot="subject",
    target_collection="employees",
    target_slot="name",
)
print(result)

Expected output:

{
  "status": "success",
  "results": [
    { "source_id": 0, "target_id": 14, "score": 0.912, "margin": 0.143, "status": "matched" },
    { "source_id": 1, "target_id": null, "score": 0.431, "margin": 0.012, "status": "ambiguous" }
  ],
  "source_count": 10,
  "target_count": 50,
  "timing": { "embed_ms": 42.1, "match_ms": 8.3, "total_ms": 51.7 }
}

POST /compose/batch_match/{db_name}¶