Skip to content

POST /compose/batch_match/{db_name}

Performs cross-collection semantic matching. For each row in a source collection, finds the best matching rows in a target collection based on slot embedding similarity. Enables cross-collection joins via the HDC intersection layer.


Request

Content-Type: application/json

URL Parameters:

Parameter Description
db_name Name of the database containing both collections

Body:

Parameter Type Required Default Description
source_collection string Namespace of the source collection
source_slot string Field in the source collection to match on
target_collection string Namespace of the target collection
target_slot string Field in the target collection to match against
top_k_per_source int 2 Number of top target matches per source row
similarity_threshold float 0.5 Minimum similarity score to consider a match
min_margin float 0.05 Minimum score gap between top-1 and top-2 required for a confident match
source_row_ids list of int null Specific source rows to match. If omitted, all rows are used
target_row_ids list of int null Specific target rows to match against. If omitted, all rows are used

Behavior

Similarity matrix — Neural embeddings are computed for all unique source and target slot values. A full similarity matrix is built using batch vector similarity, then the top-k matches per source row are extracted.

Margin-based NULL detection — If the gap between the best and second-best match for a source row is below min_margin, the match is flagged as ambiguous. This avoids returning overconfident low-quality joins.

Cross-field matching — Source and target slots can have different names and schemas. Matching is purely based on embedding similarity, enabling joins across heterogeneous collections.

Timing — The response includes a timing object with millisecond breakdowns for embedding and matching phases.


Responses

200 OK

{
  "status": "success",
  "results": [
    {
      "source_id": 0,
      "target_id": 14,
      "score": 0.912,
      "margin": 0.143,
      "status": "matched"
    },
    {
      "source_id": 1,
      "target_id": null,
      "score": 0.431,
      "margin": 0.012,
      "status": "ambiguous"
    }
  ],
  "source_count": 10,
  "target_count": 50,
  "timing": {
    "embed_ms": 42.1,
    "match_ms": 8.3,
    "total_ms": 51.7
  }
}

Each result contains:

Field Description
source_id Row ID from the source collection
target_id Matched row ID from the target collection, or null if unmatched
score Similarity score of the best match
margin Score gap between top-1 and top-2 matches
status "matched", "ambiguous", or "unmatched"

Error Responses

Status Condition
404 Source or target collection not found
500 Unexpected internal error

Notes

  • Both collections must exist within the same db_name.
  • Raise similarity_threshold to reduce false positives; raise min_margin to require more decisive matches.
  • source_row_ids and target_row_ids can be used to scope the match to a subset of rows, which significantly reduces compute for large collections.

Example

import requests

SERVER_URL = "http://18.220.128.24:8000"
API_KEY    = "yourapitoken"

def batch_match(db_name: str, source_collection: str, source_slot: str,
                target_collection: str, target_slot: str) -> dict:
    response = requests.post(
        f"{SERVER_URL}/compose/batch_match/{db_name}",
        headers={"X-API-Key": API_KEY},
        json={
            "source_collection":  source_collection,
            "source_slot":        source_slot,
            "target_collection":  target_collection,
            "target_slot":        target_slot,
            "top_k_per_source":   2,
            "similarity_threshold": 0.5,
            "min_margin":         0.05,
        },
    )
    response.raise_for_status()
    return response.json()


result = batch_match(
    db_name="my_db",
    source_collection="expertise",
    source_slot="subject",
    target_collection="employees",
    target_slot="name",
)
print(result)

Expected output:

{
  "status": "success",
  "results": [
    { "source_id": 0, "target_id": 14, "score": 0.912, "margin": 0.143, "status": "matched" },
    { "source_id": 1, "target_id": null, "score": 0.431, "margin": 0.012, "status": "ambiguous" }
  ],
  "source_count": 10,
  "target_count": 50,
  "timing": { "embed_ms": 42.1, "match_ms": 8.3, "total_ms": 51.7 }
}