(Part #7) Product matching via ML: Post-processing
Previous topic: (Part #6) Evaluating ML training results
Next topic: (Part #8) Testing on various industries/languages
This part was logically the most difficult to comprehend, at least for me. Let’s give it a try:
- We have 50K products from Set A and 40K products from Set B.
- We have already eliminated the improbable A/B combinations by blocking
- Each A/B combination which is probably enough has been scored – it has received a matching score between 0 and 1. So, let’s say product A has 12 potential matches with the product from B, each with its own matching score
- So, which B is it? Should be the one with the highest score, right? Ok, let’s assume that we have taken the combination with the highest score, and that is A1 and B9. That means:
- a. A1 is matched with B9, so A1 cannot be matched with any other B
- b. B9 is matched with A1, so B9 cannot be matched with any other A
- That means when we start evaluating the next product from Set A (A2(, B9 should be left out as a potential match
- But, can we trust that A1 and B9 are a certain match, and they cannot be used as matches for any other product? If we do trust it but we make a mistake, the result will be a chain reaction of wrong matches – a disaster for both accuracy and sensitivity.
We have done a lot of experimenting on this one, and came with the following solution:
- If the matching score is above X1 (configurable threshold) – consider it a certain match. Do not consider this A, nor B as a potential match for any other product (consider them exhausted)! The most typical value for X1 is 0.95
- If the matching score is below X1, take N best matches (according to matching score). Such matches are not certain, so do not consider either A not be exhausted!
Yet, this was not enough – so we had to introduce some algorithmic rules – mostly for preventing bad matches to be established.
For example: if both products had the same entity – but of a different value, disregard it as a potential match. That meant that the following cases were eliminated:
- Red vs Black
- Red vs Schwartz
- 100g pack vs 250g pack size
More information can be found here:
- Product matching in Price2Spy
- Previous topic: (Part #6) Evaluating ML training results
- Next topic: (Part #8) Testing on various industries/languages
Price2Spy is an online service that provides comprehensive and suitable solutions for eCommerce professionals including; retailers, brands/manufacturers and distributors in order to stay profitable in the current competitive market conditions. If you want to learn more about what Price2Spy can do for your business, please start your 30-day free trial.