(Part #4) Preparing the ML training set

Written by: Miša Krunić Best practices in price monitoring, New Price2Spy features 19.6.2020. Reading Time: 3 minutes

Previous topic: (Part #3) For ML experts – why is product matching so difficult?

Next topic: (Part #5) ML training Implementation

This part was particularly tough. When performing the product matching, one would need to match Set A (Products from Website A) to Set B (Products from Website B).

However, in order to avoid biased training, we needed to ensure to get as many different Sets A and Sets B – so from as many different websites as possible.

Further, SetsA and Sets B need to be variable in size – we cannot expect all sites we will be matching to have a similar number of products (for example client’s website might have 5000 products, and he may want it matched on Amazon, where this product category has more than 500 000 products).

As for the language – in order to make our model language-independent, we were looking for a non-English language, but from an industry where product naming has many English influences.

The choice of the industry was a bit easier. Going for an industry that has strong usage of MPNs in product names (like the automotive industry) would be too easy – the ML model would learn to rely on MPN too much, and that would not be applicable to other industries. And remember – our solution has to be industry-independent.

Last but not least – anyone who’s been dealing with ML knows – the size matters. The bigger your training set, the more reliable your ML model gets.

But how do you prepare the training data for an ML model? That’s the toughest part of all – someone (humans) needs to prepare accurate matches for all combinations of Sets A / Sets B.

Luckily enough, Price2Spy has been in business for long enough to acquire more than 650 clients from various parts of the world (various languages) and various industries. However, one industry is our sweet spot – and that is Music. Roughly 25% of our clients are from the musical world, and quite a few are from Germany. Musical gear has a great mixture of product naming variations, as you can see in examples below:

1.HagstromFantomen Black – naming variations

Hagstrom 6-saitige E-Gitarre (FANT-BLK) Schwarz NormaleGröße Schwarz
HagstromFantomen Black Gloss
HagstromFantomen Black E-Gitarre
Things to note:
- a. Black vs Schwarz
- b. No MPN
- c. Longest variation has 8 words / 71 characters, while the shortest one is only 3 words, 23 characters long

2.Yamaha Stagepas 600BT – naming variations

Yamaha Stagepas 600BT Tragbares PA-System
Yamaha STAGEPAS600BT
YAMAHA STAGEPAS 600BT 2X340W
Yamaha Stagepas 600BT Portable PA system
Things to note:
- a. Tragbaresvs Portable (vs some variations do not contain Portable keyword at all)
- b. MPN is there, but in one case it’s written in a non-standard way: STAGEPAS600BT
- c. Longest variation has 5 words / 41 characters, while the shortest one is only 2 words, 20 characters long

So, there was our training set:

12 major German musical websites
Largest website: 51K products
Smallest website: 1.7K products
A total of 112K products involved
A total of 292K matches established by human matching

And what data was included in our training set? We wanted as much data as possible, however, not all websites shared all the product data fields. Therefore, we had to go for 4 fields which were available in all 12 cases:

Product name
Brand name
Product price
Product description

More information on the following links:

Product matching in Price2Spy

Previous topic: (Part #3) For ML experts – why is product matching so difficult?

Next topic: (Part #5) ML training Implementation

Author

Miša Krunić

Father of 2, Husband of 1, CEO of 3 :-)

Featured Posts

A Guide to Competitor Product Research for Online …

Before customers click "Buy Now," they’ve likely compared prices, specs, and reviews across at least three other sellers. In fact, 81% of retail shoppers conduct online research before making a purchase, according to a GE Capital Retail Bank study. But...

Why is Knowing Your Competitors' Product Assortmen…

Benchmarking your product assortment against your competitors' is a key step towards recognizing profit-driving opportunities. And now it doesn't have to take up so much of your resources....

The Role of Product Data in Pricing Strategies

Across industries, product data enables businesses to be proactive rather than reactive in their pricing strategies. Learn the best approach to get relevant competitor product data efficiently....

Services

Customization

Pricing intelligence

Pricing analytics

Modules

Uses

(Part #4) Preparing the ML training set

Featured Posts