(Part #4) Preparing the ML training set

Written by: Miša Krunić Best practices in price monitoring, New Price2Spy features 19.6.2020. Reading Time: 3 minutes

Previous topic: (Part #3) For ML experts – why is product matching so difficult?

Next topic: (Part #5) ML training Implementation

This part was particularly tough. When performing the product matching, one would need to match Set A (Products from Website A) to Set B (Products from Website B).

However, in order to avoid biased training, we needed to ensure to get as many different Sets A and Sets B – so from as many different websites as possible.

Further, SetsA and Sets B need to be variable in size – we cannot expect all sites we will be matching to have a similar number of products (for example client’s website might have 5000 products, and he may want it matched on Amazon, where this product category has more than 500 000 products).

As for the language – in order to make our model language-independent, we were looking for a non-English language, but from an industry where product naming has many English influences.

The choice of the industry was a bit easier. Going for an industry that has strong usage of MPNs in product names (like the automotive industry) would be too easy – the ML model would learn to rely on MPN too much, and that would not be applicable to other industries. And remember – our solution has to be industry-independent.

Last but not least – anyone who’s been dealing with ML knows – the size matters. The bigger your training set, the more reliable your ML model gets.

But how do you prepare the training data for an ML model? That’s the toughest part of all – someone (humans) needs to prepare accurate matches for all combinations of Sets A / Sets B.

Luckily enough, Price2Spy has been in business for long enough to acquire more than 650 clients from various parts of the world (various languages) and various industries. However, one industry is our sweet spot – and that is Music. Roughly 25% of our clients are from the musical world, and quite a few are from Germany. Musical gear has a great mixture of product naming variations, as you can see in examples below:

1.HagstromFantomen Black – naming variations

Hagstrom 6-saitige E-Gitarre (FANT-BLK) Schwarz NormaleGröße Schwarz
HagstromFantomen Black Gloss
HagstromFantomen Black E-Gitarre
Things to note:
- a. Black vs Schwarz
- b. No MPN
- c. Longest variation has 8 words / 71 characters, while the shortest one is only 3 words, 23 characters long

2.Yamaha Stagepas 600BT – naming variations

Yamaha Stagepas 600BT Tragbares PA-System
Yamaha STAGEPAS600BT
YAMAHA STAGEPAS 600BT 2X340W
Yamaha Stagepas 600BT Portable PA system
Things to note:
- a. Tragbaresvs Portable (vs some variations do not contain Portable keyword at all)
- b. MPN is there, but in one case it’s written in a non-standard way: STAGEPAS600BT
- c. Longest variation has 5 words / 41 characters, while the shortest one is only 2 words, 20 characters long

So, there was our training set:

12 major German musical websites
Largest website: 51K products
Smallest website: 1.7K products
A total of 112K products involved
A total of 292K matches established by human matching

And what data was included in our training set? We wanted as much data as possible, however, not all websites shared all the product data fields. Therefore, we had to go for 4 fields which were available in all 12 cases:

Product name
Brand name
Product price
Product description

More information on the following links:

Product matching in Price2Spy

Previous topic: (Part #3) For ML experts – why is product matching so difficult?

Next topic: (Part #5) ML training Implementation

Author

Miša Krunić

Father of 2, Husband of 1, CEO of 3 :-)

Featured Posts

Customer Journey in eCommerce: The Blind Spots Mos…

Most customer journey maps look remarkably similar. Customer journey maps follow the customer from awareness to purchase, identify key touchpoints, assign emotions to each interaction, and highlight opportunities for improvement. Marketing teams use them to optimize campaigns, customer experience teams to...

What is Performance-based Pricing and Does It Appl…

Pricing terminology can sometimes be confusing. Some terms sound similar, but describe very different things. One example is performance pricing, or more accurately, performance-based pricing. At first glance, it may sound like another eCommerce pricing strategy. Since online retailers already deal...

Adapt eCommerce pricing throughout the customer decision making process

How Retailers Should Adapt Pricing Throughout the …

Modern consumers rarely follow a simple or predictable path to purchase. They compare products across marketplaces, revisit offers multiple times, evaluate alternatives side by side, and often delay decisions until pricing, timing, or conditions feel right. For retailers, this creates a...

Services

Customization

Pricing intelligence

Pricing analytics

Modules

Uses

(Part #4) Preparing the ML training set

Featured Posts