(Part #4) Preparing the ML training set

Best practices in price monitoring, New Price2Spy features 19.6.2020. Reading Time: 3 minutes

Product matching in Price2Spy

Previous topic: (Part #3) For ML experts – why is product matching so difficult?

Next topic: (Part #5) ML training Implementation

This part was particularly tough. When performing the product matching, one would need to match Set A (Products from Website A) to Set B (Products from Website B).

ML training set

However, in order to avoid biased training, we needed to ensure to get as many different Sets A and Sets B – so from as many different websites as possible.

Further, SetsA and Sets B need to be variable in size – we cannot expect all sites we will be matching to have a similar number of products (for example client’s website might have 5000 products, and he may want it matched on Amazon, where this product category has more than 500 000 products).

As for the language – in order to make our model language-independent, we were looking for a non-English language, but from an industry where product naming has many English influences.

The choice of the industry was a bit easier. Going for an industry that has strong usage of MPNs in product names (like the automotive industry) would be too easy – the ML model would learn to rely on MPN too much, and that would not be applicable to other industries. And remember – our solution has to be industry-independent.

Last but not least – anyone who’s been dealing with ML knows – the size matters. The bigger your training set, the more reliable your ML model gets.

But how do you prepare the training data for an ML model? That’s the toughest part of all – someone (humans) needs to prepare accurate matches for all combinations of Sets A / Sets B.

Luckily enough, Price2Spy has been in business for long enough to acquire more than 650 clients from various parts of the world (various languages) and various industries. However, one industry is our sweet spot – and that is Music. Roughly 25% of our clients are from the musical world, and quite a few are from Germany. Musical gear has a great mixture of product naming variations, as you can see in examples below:

1.HagstromFantomen Black – naming variations

  • Hagstrom 6-saitige E-Gitarre (FANT-BLK) Schwarz NormaleGröße Schwarz
  • HagstromFantomen Black Gloss
  • HagstromFantomen Black E-Gitarre
  • Things to note:
    • a. Black vs Schwarz
    • b. No MPN
    • c. Longest variation has 8 words / 71 characters, while the shortest one is only 3 words, 23 characters long

2.Yamaha Stagepas 600BT – naming variations

  • Yamaha Stagepas 600BT Tragbares PA-System
  • Yamaha STAGEPAS600BT
  • Yamaha Stagepas 600BT Portable PA system
  • Things to note:
    • a. Tragbaresvs Portable (vs some variations do not contain Portable keyword at all)
    • b. MPN is there, but in one case it’s written in a non-standard way: STAGEPAS600BT
    • c. Longest variation has 5 words / 41 characters, while the shortest one is only 2 words, 20 characters long

So, there was our training set:

  • 12 major German musical websites
  • Largest website: 51K products
  • Smallest website: 1.7K products
  • A total of 112K products involved
  • A total of 292K matches established by human matching

And what data was included in our training set? We wanted as much data as possible, however, not all websites shared all the product data fields. Therefore, we had to go for 4 fields which were available in all 12 cases:

  • Product name
  • Brand name
  • Product price
  • Product description

More information on the following links:

Product matching in Price2Spy

Previous topic: (Part #3) For ML experts – why is product matching so difficult?

Next topic: (Part #5) ML training Implementation


Miša Krunić
Father of 2, Husband of 1, CEO of 3 :-)