(Part #7) Product matching via ML: Post-processing

Best practices in price monitoring, New Price2Spy features 19.6.2020. Reading Time: 2 minutes

Product matching in Price2Spy

Previous topic: (Part #6) Evaluating ML training results

Next topic: (Part #8) Testing on various industries/languages

This part was logically the most difficult to comprehend, at least for me. Let’s give it a try:

  • We have 50K products from Set A and 40K products from Set B.
  • We have already eliminated the improbable A/B combinations by blocking
  • Each A/B combination which is probably enough has been scored – it has received a matching score between 0 and 1. So, let’s say product A has 12 potential matches with the product from B, each with its own matching score
  • So, which B is it? Should be the one with the highest score, right? Ok, let’s assume that we have taken the combination with the highest score, and that is A1 and B9. That means:
    • a. A1 is matched with B9, so A1 cannot be matched with any other B
    • b. B9 is matched with A1, so B9 cannot be matched with any other A
  • That means when we start evaluating the next product from Set A (A2(, B9 should be left out as a potential match
  • But, can we trust that A1 and B9 are a certain match, and they cannot be used as matches for any other product? If we do trust it but we make a mistake, the result will be a chain reaction of wrong matches – a disaster for both accuracy and sensitivity.

We have done a lot of experimenting on this one, and came with the following solution:

  • If the matching score is above X1 (configurable threshold) – consider it a certain match. Do not consider this A, nor B as a potential match for any other product (consider them exhausted)! The most typical value for X1 is 0.95
  • If the matching score is below X1, take N best matches (according to matching score). Such matches are not certain, so do not consider either A not be exhausted!

Yet, this was not enough – so we had to introduce some algorithmic rules – mostly for preventing bad matches to be established.

For example: if both products had the same entity – but of a different value, disregard it as a potential match. That meant that the following cases were eliminated:

  • Red vs Black
  • Red vs Schwartz
  • 100g pack vs 250g pack size
ML post procesing

More information can be found here:

Author

Miša Krunić
Father of 2, Husband of 1, CEO of 3 :-)