(Part #7) Product matching via ML: Post-processing
Previous topic: (Part #6) Evaluating ML training results
This part was logically the most difficult to comprehend, at least for me. Let’s give it a try:
- We have 50K products from Set A and 40K products from Set B.
- We have already eliminated the improbable A/B combinations by blocking
- Each A/B combination which is probably enough has been scored – it has received a matching score between 0 and 1. So, let’s say product A has 12 potential matches with the product from B, each with its own matching score
- So, which B is it? Should be the one with the highest score, right? Ok, let’s assume that we have taken the combination with the highest score, and that is A1 and B9. That means:
- a. A1 is matched with B9, so A1 cannot be matched with any other B
- b. B9 is matched with A1, so B9 cannot be matched with any other A
- That means when we start evaluating the next product from Set A (A2(, B9 should be left out as a potential match
- But, can we trust that A1 and B9 are a certain match, and they cannot be used as matches for any other product? If we do trust it but we make a mistake, the result will be a chain reaction of wrong matches – a disaster for both accuracy and sensitivity.
We have done a lot of experimenting on this one, and came with the following solution:
- If the matching score is above X1 (configurable threshold) – consider it a certain match. Do not consider this A, nor B as a potential match for any other product (consider them exhausted)! The most typical value for X1 is 0.95
- If the matching score is below X1, take N best matches (according to matching score). Such matches are not certain, so do not consider either A not be exhausted!
Yet, this was not enough – so we had to introduce some algorithmic rules – mostly for preventing bad matches to be established.
For example: if both products had the same entity – but of a different value, disregard it as a potential match. That meant that the following cases were eliminated:
- Red vs Black
- Red vs Schwartz
- 100g pack vs 250g pack size
More information can be found here: