{"id":7292,"date":"2020-06-19T11:11:06","date_gmt":"2020-06-19T11:11:06","guid":{"rendered":"https:\/\/www.price2spy.com\/blog\/?p=7292"},"modified":"2020-07-28T10:47:14","modified_gmt":"2020-07-28T10:47:14","slug":"part-3-for-ml-experts-why-is-product-matching-so-difficult","status":"publish","type":"post","link":"https:\/\/www.price2spy.com\/blog\/part-3-for-ml-experts-why-is-product-matching-so-difficult\/","title":{"rendered":"(Part #3) For ML experts &#8211; why is product matching so difficult?"},"content":{"rendered":"\n<ul class=\"wp-block-list\"><li> <a rel=\"noreferrer noopener\" href=\"https:\/\/www.price2spy.com\/en\/pricing\/product-matching.html\" target=\"_blank\">Product matching in Price2Spy<\/a> <\/li><li> <strong>Previous topic:<\/strong>  <a rel=\"noreferrer noopener\" href=\"https:\/\/www.price2spy.com\/blog\/part-2-product-matching-via-machine-learning-important-decisions-to-be-made\/\" target=\"_blank\">(Part #2) Product matching via Machine Learning &#8211; Important decisions to be made<\/a> <\/li><li> <strong>Next topic: <\/strong> <a href=\"https:\/\/www.price2spy.com\/blog\/part-4-preparing-the-ml-training-set\/\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"(Part #4) Preparing the ML training set  (opens in a new tab)\">(Part #4) Preparing the ML training set <\/a><\/li><\/ul>\n\n\n\n<p>This chapter is of technical nature, and it explains the difficulties <a href=\"https:\/\/www.price2spy.com\/\">Price2Spy\u2019s<\/a> team had to overcome when building the ML model for <a href=\"https:\/\/www.price2spy.com\/en\/pricing\/product-matching.html\">product matching<\/a>.<\/p>\n\n\n\n<p><strong>1. Computation size<\/strong> \u2013 we\u2019re talking about comparing SET A (expected size varies from 10K to 100K products, so let\u2019s say 50K) to Set B (let\u2019s say that it\u2019s expected size is slightly less \u2013 40K products). This brings us to 50K x 40K = 2G potential matching combinations that need to be scored<\/p>\n\n\n\n<p><strong>2. Diverse training data sources<\/strong> (websites from different languages, industries, product assortments, and product naming conventions)<\/p>\n\n\n\n<p><strong>3. Hugely unbalanced positive and negative labels in the training set <\/strong>(positive are the ones where matches do exist \u2013 in the example given in 1) there can be a maximum of 40K matches) which means:<\/p>\n\n\n\n<p>         a. Positive labels = 0.002% of training set<\/p>\n\n\n\n<p>         b. Negative labels = 99.998% of the training set<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"314\" src=\"https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/Matches-1.png\" alt=\"ML matches\" class=\"wp-image-7329\" srcset=\"https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/Matches-1.png 600w, https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/Matches-1-768x401.png 768w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/figure><\/div>\n\n\n\n<p><strong>4. After matches get scored, complex post-processing will be needed, in order to determine the best matching candidates<\/strong> (full matching on bipartite graph problem)<\/p>\n\n\n\n<p><strong>5. Label noise<\/strong> \u2013 matches supplied in the training set are not 100% accurate:<\/p>\n\n\n\n<p><strong>a.<\/strong> A moderate amount of matches were missed \u2013 simply because not all sites\/product categories were in scope for manual product matching, which was the source of the training set<\/p>\n\n\n\n<p><strong>b.<\/strong> A very small portion of matches was wrong (due to human error)<\/p>\n\n\n\n<p><strong>6. Data duplication in training set<\/strong> \u2013 due to the fact that websites can have products listed in multiple categories, with multiple product URLs. Let&#8217;s suppose that Set A has 1 product duplicate, and Set B has 1 product triplicate \u2013 this leads to potentially 6 identical matches in the training set, which will be very misleading for ML algorithm (this comes down to entity resolution problem)<\/p>\n\n\n\n<p><strong>7. Difficult to evaluate<\/strong> \u2013 we have used precision-recall curves in order to evaluate the model performance. However, due to label noise, we had to manually evaluate the results (as ML model was often detecting matches which were missing in the training set)Python vs Java \u2013 while our external consultant was working in Python, we had to translate all the code into Java (Price2Spy\u2019s standard technology)<\/p>\n\n\n\n<p><strong>For more information:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\"><li> <a rel=\"noreferrer noopener\" aria-label=\"Product matching in Price2Spy (opens in a new tab)\" href=\"https:\/\/www.price2spy.com\/en\/pricing\/product-matching.html\" target=\"_blank\">Product matching in Price2Spy<\/a> <\/li><li> <strong>Previous topic:<\/strong>  <a rel=\"noreferrer noopener\" aria-label=\"(Part #2) Product matching via Machine Learning - Important decisions to be made (opens in a new tab)\" href=\"https:\/\/www.price2spy.com\/blog\/part-2-product-matching-via-machine-learning-important-decisions-to-be-made\/\" target=\"_blank\">(Part #2) Product matching via Machine Learning &#8211; Important decisions to be made<\/a> <\/li><li> <strong>Next topic: <\/strong> <a href=\"https:\/\/www.price2spy.com\/blog\/part-4-preparing-the-ml-training-set\/\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"(Part #4) Preparing the ML training set (opens in a new tab)\">(Part #4) Preparing the ML training set<\/a> <\/li><\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Product matching in Price2Spy Previous topic: (Part #2) Product matching via Machine Learning &#8211; Important decisions to be made Next topic: (Part #4) Preparing the ML training set This chapter is of technical nature, and it explains the difficulties Price2Spy\u2019s team had to overcome when&#8230;<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[108,167],"tags":[190,645,646,15,81],"class_list":["post-7292","post","type-post","status-publish","format-standard","hentry","category-best-practices","category-new-price2spy-features","tag-ecommerce","tag-machine-learning","tag-ml","tag-price2spy","tag-product-matching"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.3 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>#3 For ML experts - why is product matching so difficult?<\/title>\n<meta name=\"description\" content=\"This chapter is of technical nature, and it explains the difficulties Price2Spy\u2019s team had to overcome when building the ML model for product matching.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.price2spy.com\/blog\/part-3-for-ml-experts-why-is-product-matching-so-difficult\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"#3 For ML experts - why is product matching so difficult?\" \/>\n<meta property=\"og:description\" content=\"This chapter is of technical nature, and it explains the difficulties Price2Spy\u2019s team had to overcome when building the ML model for product matching.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.price2spy.com\/blog\/part-3-for-ml-experts-why-is-product-matching-so-difficult\/\" \/>\n<meta property=\"og:site_name\" content=\"Price2Spy\u00ae Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Price2Spy\/\" \/>\n<meta property=\"article:published_time\" content=\"2020-06-19T11:11:06+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2020-07-28T10:47:14+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/Matches-1.png\" \/>\n<meta name=\"author\" content=\"Mi\u0161a Kruni\u0107\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Price2Spy\" \/>\n<meta name=\"twitter:site\" content=\"@Price2Spy\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Mi\u0161a Kruni\u0107\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"2 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"#3 For ML experts - why is product matching so difficult?","description":"This chapter is of technical nature, and it explains the difficulties Price2Spy\u2019s team had to overcome when building the ML model for product matching.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.price2spy.com\/blog\/part-3-for-ml-experts-why-is-product-matching-so-difficult\/","og_locale":"en_US","og_type":"article","og_title":"#3 For ML experts - why is product matching so difficult?","og_description":"This chapter is of technical nature, and it explains the difficulties Price2Spy\u2019s team had to overcome when building the ML model for product matching.","og_url":"https:\/\/www.price2spy.com\/blog\/part-3-for-ml-experts-why-is-product-matching-so-difficult\/","og_site_name":"Price2Spy\u00ae Blog","article_publisher":"https:\/\/www.facebook.com\/Price2Spy\/","article_published_time":"2020-06-19T11:11:06+00:00","article_modified_time":"2020-07-28T10:47:14+00:00","og_image":[{"url":"https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/Matches-1.png","type":"","width":"","height":""}],"author":"Mi\u0161a Kruni\u0107","twitter_card":"summary_large_image","twitter_creator":"@Price2Spy","twitter_site":"@Price2Spy","twitter_misc":{"Written by":"Mi\u0161a Kruni\u0107","Est. reading time":"2 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.price2spy.com\/blog\/part-3-for-ml-experts-why-is-product-matching-so-difficult\/#article","isPartOf":{"@id":"https:\/\/www.price2spy.com\/blog\/part-3-for-ml-experts-why-is-product-matching-so-difficult\/"},"author":{"name":"Mi\u0161a Kruni\u0107","@id":"https:\/\/www.price2spy.com\/blog\/#\/schema\/person\/382ac9db90cb7d6dd54b9425857fc96c"},"headline":"(Part #3) For ML experts &#8211; why is product matching so difficult?","datePublished":"2020-06-19T11:11:06+00:00","dateModified":"2020-07-28T10:47:14+00:00","mainEntityOfPage":{"@id":"https:\/\/www.price2spy.com\/blog\/part-3-for-ml-experts-why-is-product-matching-so-difficult\/"},"wordCount":414,"commentCount":0,"image":{"@id":"https:\/\/www.price2spy.com\/blog\/part-3-for-ml-experts-why-is-product-matching-so-difficult\/#primaryimage"},"thumbnailUrl":"https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/Matches-1.png","keywords":["ecommerce","machine learning","ml","price2spy","product matching"],"articleSection":["Best practices in price monitoring","New Price2Spy features"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.price2spy.com\/blog\/part-3-for-ml-experts-why-is-product-matching-so-difficult\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.price2spy.com\/blog\/part-3-for-ml-experts-why-is-product-matching-so-difficult\/","url":"https:\/\/www.price2spy.com\/blog\/part-3-for-ml-experts-why-is-product-matching-so-difficult\/","name":"#3 For ML experts - why is product matching so difficult?","isPartOf":{"@id":"https:\/\/www.price2spy.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.price2spy.com\/blog\/part-3-for-ml-experts-why-is-product-matching-so-difficult\/#primaryimage"},"image":{"@id":"https:\/\/www.price2spy.com\/blog\/part-3-for-ml-experts-why-is-product-matching-so-difficult\/#primaryimage"},"thumbnailUrl":"https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/Matches-1.png","datePublished":"2020-06-19T11:11:06+00:00","dateModified":"2020-07-28T10:47:14+00:00","author":{"@id":"https:\/\/www.price2spy.com\/blog\/#\/schema\/person\/382ac9db90cb7d6dd54b9425857fc96c"},"description":"This chapter is of technical nature, and it explains the difficulties Price2Spy\u2019s team had to overcome when building the ML model for product matching.","breadcrumb":{"@id":"https:\/\/www.price2spy.com\/blog\/part-3-for-ml-experts-why-is-product-matching-so-difficult\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.price2spy.com\/blog\/part-3-for-ml-experts-why-is-product-matching-so-difficult\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.price2spy.com\/blog\/part-3-for-ml-experts-why-is-product-matching-so-difficult\/#primaryimage","url":"https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/Matches-1.png","contentUrl":"https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/Matches-1.png","width":600,"height":314},{"@type":"BreadcrumbList","@id":"https:\/\/www.price2spy.com\/blog\/part-3-for-ml-experts-why-is-product-matching-so-difficult\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.price2spy.com\/blog\/"},{"@type":"ListItem","position":2,"name":"(Part #3) For ML experts &#8211; why is product matching so difficult?"}]},{"@type":"WebSite","@id":"https:\/\/www.price2spy.com\/blog\/#website","url":"https:\/\/www.price2spy.com\/blog\/","name":"Price2Spy\u00ae Blog","description":"Price2Spy\u00ae","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.price2spy.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/www.price2spy.com\/blog\/#\/schema\/person\/382ac9db90cb7d6dd54b9425857fc96c","name":"Mi\u0161a Kruni\u0107","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/31aa4afb2464eca1f1ca0c7979628c87e54e7a6b53ebcb371749e9349d27c850?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/31aa4afb2464eca1f1ca0c7979628c87e54e7a6b53ebcb371749e9349d27c850?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/31aa4afb2464eca1f1ca0c7979628c87e54e7a6b53ebcb371749e9349d27c850?s=96&d=mm&r=g","caption":"Mi\u0161a Kruni\u0107"},"description":"Father of 2, Husband of 1, CEO of 3 :-)","sameAs":["http:\/\/www.price2spy.com"],"url":"https:\/\/www.price2spy.com\/blog\/author\/misha\/"}]}},"_links":{"self":[{"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/posts\/7292","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/comments?post=7292"}],"version-history":[{"count":5,"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/posts\/7292\/revisions"}],"predecessor-version":[{"id":7413,"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/posts\/7292\/revisions\/7413"}],"wp:attachment":[{"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/media?parent=7292"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/categories?post=7292"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/tags?post=7292"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}