{"id":7294,"date":"2020-06-19T11:25:42","date_gmt":"2020-06-19T11:25:42","guid":{"rendered":"https:\/\/www.price2spy.com\/blog\/?p=7294"},"modified":"2020-07-28T10:49:04","modified_gmt":"2020-07-28T10:49:04","slug":"part-4-preparing-the-ml-training-set","status":"publish","type":"post","link":"https:\/\/www.price2spy.com\/blog\/part-4-preparing-the-ml-training-set\/","title":{"rendered":"(Part #4) Preparing the ML training set"},"content":{"rendered":"\n<p> <a rel=\"noreferrer noopener\" href=\"https:\/\/www.price2spy.com\/en\/pricing\/product-matching.html\" target=\"_blank\">Product matching in Price2Spy<\/a>  <\/p>\n\n\n\n<p> <strong>Previous topic:<\/strong> <a rel=\"noreferrer noopener\" href=\"https:\/\/www.price2spy.com\/blog\/part-3-for-ml-experts-why-is-product-matching-so-difficult\/\" target=\"_blank\">(Part #3) For ML experts \u2013 why is product matching so difficult? <\/a> <\/p>\n\n\n\n<p> <strong>Next topic:<\/strong> <a href=\"https:\/\/www.price2spy.com\/blog\/part-5-ml-training-implementation\/\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"(Part #5) ML training Implementation  (opens in a new tab)\">(Part #5) ML training Implementation <\/a> <\/p>\n\n\n\n<p>This part was particularly tough. When performing the <a href=\"https:\/\/www.price2spy.com\/en\/pricing\/product-matching.html\">product matching<\/a>, one would need to match Set A (Products from Website A) to Set B (Products from Website B).<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"640\" src=\"https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/sport-1013733_640.jpg\" alt=\"ML training set\" class=\"wp-image-7295\" srcset=\"https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/sport-1013733_640.jpg 640w, https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/sport-1013733_640-300x300.jpg 300w, https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/sport-1013733_640-1024x1024.jpg 1024w, https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/sport-1013733_640-150x150.jpg 150w, https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/sport-1013733_640-768x768.jpg 768w, https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/sport-1013733_640-1536x1536.jpg 1536w, https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/sport-1013733_640-2048x2048.jpg 2048w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/figure><\/div>\n\n\n\n<p>However, in order to avoid biased training, we needed to ensure to get as many different Sets A and Sets B \u2013 so from as many different websites as possible.<\/p>\n\n\n\n<p>Further, SetsA and Sets B need to be variable in size \u2013 we cannot expect all sites we will be matching to have a similar number of products (for example client\u2019s website might have 5000 products, and he may want it matched on Amazon, where this product category has more than 500 000 products).<\/p>\n\n\n\n<p>As for the language \u2013 in order to make our model language-independent, we were looking for a non-English language, but from an industry where product naming has many English influences.<\/p>\n\n\n\n<p>The choice of the industry was a bit easier. Going for an industry that has strong usage of MPNs in product names (like the automotive industry) would be too easy \u2013 the ML model would learn to rely on MPN too much, and that would not be applicable to other industries. And remember \u2013 our solution has to be industry-independent.<\/p>\n\n\n\n<p>Last but not least \u2013 anyone who\u2019s been dealing with ML knows &#8211; the size matters. The bigger your training set, the more reliable your ML model gets.<\/p>\n\n\n\n<p>But how do you prepare the training data for an ML model? That\u2019s the toughest part of all \u2013 someone (humans) needs to prepare accurate matches for all combinations of Sets A \/ Sets B.<\/p>\n\n\n\n<p>Luckily enough, <a href=\"https:\/\/www.price2spy.com\/\">Price2Spy<\/a> has been in business for long enough to acquire more than 650 clients from various parts of the world (various languages) and various industries. However, one industry is our sweet spot \u2013 and that is Music. Roughly 25% of our clients are from the musical world, and quite a few are from Germany. Musical gear has a great mixture of product naming variations, as you can see in examples below:<\/p>\n\n\n\n<p>1.<strong>HagstromFantomen Black<\/strong> \u2013 naming variations<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Hagstrom 6-saitige E-Gitarre (FANT-BLK) Schwarz NormaleGr\u00f6\u00dfe Schwarz<\/li><li>HagstromFantomen Black Gloss<\/li><li>HagstromFantomen Black E-Gitarre<\/li><li>Things to note:<ul><li>a. Black vs Schwarz<\/li><li> b. No MPN <\/li><li> c. Longest variation has 8 words \/ 71 characters, while the shortest one is only 3 words, 23 characters long <\/li><\/ul><\/li><\/ul>\n\n\n\n<p>2.<strong>Yamaha Stagepas 600BT<\/strong> \u2013 naming variations<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Yamaha Stagepas 600BT Tragbares PA-System<\/li><li>Yamaha STAGEPAS600BT<\/li><li>YAMAHA STAGEPAS 600BT 2X340W<\/li><li>Yamaha Stagepas 600BT Portable PA system<\/li><li>Things to note:<ul><li>a. Tragbaresvs Portable (vs some variations do not contain Portable keyword at all)<\/li><li> b. MPN is there, but in one case it\u2019s written in a non-standard way: STAGEPAS600BT <\/li><li> c. Longest variation has 5 words \/ 41 characters, while the shortest one is only 2 words, 20 characters long <\/li><\/ul><\/li><\/ul>\n\n\n\n<p>So, there was our training set:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>12 major German musical websites<\/li><li>Largest website: 51K products<\/li><li>Smallest website: 1.7K products<\/li><li>A total of 112K products involved<\/li><li>A total of 292K matches established by human matching<\/li><\/ul>\n\n\n\n<p>And what data was included in our training set? We wanted as much data as possible, however, not all websites shared all the product data fields. Therefore, we had to go for 4 fields which were available in all 12 cases:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Product name<\/li><li>Brand name<\/li><li>Product price<\/li><li>Product description<\/li><\/ul>\n\n\n\n<p><strong>More information on the following links:<\/strong><\/p>\n\n\n\n<p> <a rel=\"noreferrer noopener\" aria-label=\"Product matching in Price2Spy (opens in a new tab)\" href=\"https:\/\/www.price2spy.com\/en\/pricing\/product-matching.html\" target=\"_blank\">Product matching in Price2Spy<\/a> <\/p>\n\n\n\n<p> <strong>Previous topic:<\/strong> <a rel=\"noreferrer noopener\" aria-label=\"(Part #3) For ML experts \u2013 why is product matching so difficult?  (opens in a new tab)\" href=\"https:\/\/www.price2spy.com\/blog\/part-3-for-ml-experts-why-is-product-matching-so-difficult\/\" target=\"_blank\">(Part #3) For ML experts \u2013 why is product matching so difficult? <\/a><\/p>\n\n\n\n<p><strong>Next topic:<\/strong> <a href=\"https:\/\/www.price2spy.com\/blog\/part-5-ml-training-implementation\/\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"(Part #5) ML training Implementation  (opens in a new tab)\">(Part #5) ML training Implementation <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Product matching in Price2Spy Previous topic: (Part #3) For ML experts \u2013 why is product matching so difficult? Next topic: (Part #5) ML training Implementation This part was particularly tough. When performing the product matching, one would need to match Set A (Products from Website&#8230;<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[108,167],"tags":[],"class_list":["post-7294","post","type-post","status-publish","format-standard","hentry","category-best-practices","category-new-price2spy-features"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.3 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>(Part #4) Preparing the ML training set<\/title>\n<meta name=\"description\" content=\"In this post, we&#039;re explaining the preparation of the ML training set\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.price2spy.com\/blog\/part-4-preparing-the-ml-training-set\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"(Part #4) Preparing the ML training set\" \/>\n<meta property=\"og:description\" content=\"In this post, we&#039;re explaining the preparation of the ML training set\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.price2spy.com\/blog\/part-4-preparing-the-ml-training-set\/\" \/>\n<meta property=\"og:site_name\" content=\"Price2Spy\u00ae Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Price2Spy\/\" \/>\n<meta property=\"article:published_time\" content=\"2020-06-19T11:25:42+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2020-07-28T10:49:04+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/sport-1013733_640.jpg\" \/>\n<meta name=\"author\" content=\"Mi\u0161a Kruni\u0107\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Price2Spy\" \/>\n<meta name=\"twitter:site\" content=\"@Price2Spy\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Mi\u0161a Kruni\u0107\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"(Part #4) Preparing the ML training set","description":"In this post, we're explaining the preparation of the ML training set","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.price2spy.com\/blog\/part-4-preparing-the-ml-training-set\/","og_locale":"en_US","og_type":"article","og_title":"(Part #4) Preparing the ML training set","og_description":"In this post, we're explaining the preparation of the ML training set","og_url":"https:\/\/www.price2spy.com\/blog\/part-4-preparing-the-ml-training-set\/","og_site_name":"Price2Spy\u00ae Blog","article_publisher":"https:\/\/www.facebook.com\/Price2Spy\/","article_published_time":"2020-06-19T11:25:42+00:00","article_modified_time":"2020-07-28T10:49:04+00:00","og_image":[{"url":"https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/sport-1013733_640.jpg","type":"","width":"","height":""}],"author":"Mi\u0161a Kruni\u0107","twitter_card":"summary_large_image","twitter_creator":"@Price2Spy","twitter_site":"@Price2Spy","twitter_misc":{"Written by":"Mi\u0161a Kruni\u0107","Est. reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.price2spy.com\/blog\/part-4-preparing-the-ml-training-set\/#article","isPartOf":{"@id":"https:\/\/www.price2spy.com\/blog\/part-4-preparing-the-ml-training-set\/"},"author":{"name":"Mi\u0161a Kruni\u0107","@id":"https:\/\/www.price2spy.com\/blog\/#\/schema\/person\/382ac9db90cb7d6dd54b9425857fc96c"},"headline":"(Part #4) Preparing the ML training set","datePublished":"2020-06-19T11:25:42+00:00","dateModified":"2020-07-28T10:49:04+00:00","mainEntityOfPage":{"@id":"https:\/\/www.price2spy.com\/blog\/part-4-preparing-the-ml-training-set\/"},"wordCount":581,"commentCount":0,"image":{"@id":"https:\/\/www.price2spy.com\/blog\/part-4-preparing-the-ml-training-set\/#primaryimage"},"thumbnailUrl":"https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/sport-1013733_640.jpg","articleSection":["Best practices in price monitoring","New Price2Spy features"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.price2spy.com\/blog\/part-4-preparing-the-ml-training-set\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.price2spy.com\/blog\/part-4-preparing-the-ml-training-set\/","url":"https:\/\/www.price2spy.com\/blog\/part-4-preparing-the-ml-training-set\/","name":"(Part #4) Preparing the ML training set","isPartOf":{"@id":"https:\/\/www.price2spy.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.price2spy.com\/blog\/part-4-preparing-the-ml-training-set\/#primaryimage"},"image":{"@id":"https:\/\/www.price2spy.com\/blog\/part-4-preparing-the-ml-training-set\/#primaryimage"},"thumbnailUrl":"https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/sport-1013733_640.jpg","datePublished":"2020-06-19T11:25:42+00:00","dateModified":"2020-07-28T10:49:04+00:00","author":{"@id":"https:\/\/www.price2spy.com\/blog\/#\/schema\/person\/382ac9db90cb7d6dd54b9425857fc96c"},"description":"In this post, we're explaining the preparation of the ML training set","breadcrumb":{"@id":"https:\/\/www.price2spy.com\/blog\/part-4-preparing-the-ml-training-set\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.price2spy.com\/blog\/part-4-preparing-the-ml-training-set\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.price2spy.com\/blog\/part-4-preparing-the-ml-training-set\/#primaryimage","url":"https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/sport-1013733_640.jpg","contentUrl":"https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/sport-1013733_640.jpg","width":640,"height":640},{"@type":"BreadcrumbList","@id":"https:\/\/www.price2spy.com\/blog\/part-4-preparing-the-ml-training-set\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.price2spy.com\/blog\/"},{"@type":"ListItem","position":2,"name":"(Part #4) Preparing the ML training set"}]},{"@type":"WebSite","@id":"https:\/\/www.price2spy.com\/blog\/#website","url":"https:\/\/www.price2spy.com\/blog\/","name":"Price2Spy\u00ae Blog","description":"Price2Spy\u00ae","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.price2spy.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/www.price2spy.com\/blog\/#\/schema\/person\/382ac9db90cb7d6dd54b9425857fc96c","name":"Mi\u0161a Kruni\u0107","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/31aa4afb2464eca1f1ca0c7979628c87e54e7a6b53ebcb371749e9349d27c850?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/31aa4afb2464eca1f1ca0c7979628c87e54e7a6b53ebcb371749e9349d27c850?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/31aa4afb2464eca1f1ca0c7979628c87e54e7a6b53ebcb371749e9349d27c850?s=96&d=mm&r=g","caption":"Mi\u0161a Kruni\u0107"},"description":"Father of 2, Husband of 1, CEO of 3 :-)","sameAs":["http:\/\/www.price2spy.com"],"url":"https:\/\/www.price2spy.com\/blog\/author\/misha\/"}]}},"_links":{"self":[{"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/posts\/7294","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/comments?post=7294"}],"version-history":[{"count":5,"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/posts\/7294\/revisions"}],"predecessor-version":[{"id":7414,"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/posts\/7294\/revisions\/7414"}],"wp:attachment":[{"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/media?parent=7294"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/categories?post=7294"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/tags?post=7294"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}