{"id":7297,"date":"2020-06-19T11:35:05","date_gmt":"2020-06-19T11:35:05","guid":{"rendered":"https:\/\/www.price2spy.com\/blog\/?p=7297"},"modified":"2020-07-28T10:50:42","modified_gmt":"2020-07-28T10:50:42","slug":"part-5-ml-training-implementation","status":"publish","type":"post","link":"https:\/\/www.price2spy.com\/blog\/part-5-ml-training-implementation\/","title":{"rendered":"(Part #5) ML training Implementation"},"content":{"rendered":"\n<p> <a rel=\"noreferrer noopener\" aria-label=\"Product matching in Price2Spy  (opens in a new tab)\" href=\"https:\/\/www.price2spy.com\/en\/pricing\/product-matching.html\" target=\"_blank\">Product matching in Price2Spy <\/a><\/p>\n\n\n\n<p> <strong>Previous topic: <\/strong> <a rel=\"noreferrer noopener\" aria-label=\"(Part #4) Preparing the ML training set (opens in a new tab)\" href=\"https:\/\/www.price2spy.com\/blog\/part-4-preparing-the-ml-training-set\/\" target=\"_blank\">(Part #4) Preparing the ML training set<\/a> <\/p>\n\n\n\n<p> <strong>Next topic: <\/strong><a href=\"https:\/\/www.price2spy.com\/blog\/part-6-evaluating-ml-training-results\/\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"(Part #6)  Evaluating ML training results  (opens in a new tab)\">(Part #6)  Evaluating ML training results <\/a><\/p>\n\n\n\n<p>Our very first implementation step was a bit non-standard. While most of ML is done in Python, <a href=\"https:\/\/www.price2spy.com\/\">Price2Spy<\/a> is a Java shop. We respect other technologies, but our love and our choice go to Java. And finding reliable ML libraries in Java is no piece of cake (though I must say, as months go by, more and more Java enthusiasts are posting ML libraries in the public domain).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Pre-processing<\/h3>\n\n\n\n<p>Next problem: sheer numbers. <a href=\"https:\/\/www.price2spy.com\/en\/pricing\/product-matching.html\">Product matching<\/a> has one nasty logical thing about it. A match is a pair \u2013 a product A is (or is not) a match with Product B. And what if Set A has 50K products, while Set B has 40K? <\/p>\n\n\n\n<p>That means 50K x 40K = 2G (2 000 000 000) combinations to be checked! And for each combination, we\u2019ll have to perform 30-40 computing operations (called \u2018features\u2019). Even with modern CPU power, that\u2019s just too much (we have excluded cloud solution very early since the cost would be just too high). <\/p>\n\n\n\n<p>A simple calculation said that our training process (just one single iteration) would take 3.5 years!<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"296\" height=\"269\" src=\"https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/5.png\" alt=\"ML training process calculation\" class=\"wp-image-7298\" srcset=\"https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/5.png 296w, https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/5-768x697.png 768w\" sizes=\"auto, (max-width: 296px) 100vw, 296px\" \/><\/figure><\/div>\n\n\n\n<p>This is where we had to pull out the first ML trick \u2013 it\u2019s called \u2018blocking\u2019. What blocking practically does (by performing operations which are not that heavy on CPU) is that it eliminates matching pairs which are too improbable. We have made blocking in a configurable way, so we were able to put a threshold of what is considered improbable enough. <\/p>\n\n\n\n<p>Thanks to blocking, we have reduced 2.0G to 1.8M matching combinations to be checked. So, from 3.5 years, we get to 20 hours CPU time. What a relief! And not only have we reduced the sheer number of combinations, we have also ensured that our training set contains both good matches and wrong matches which are not that improbable \u2013 which means better learning examples.<\/p>\n\n\n\n<p>However, that\u2019s not all \u2013 we\u2019re living in the real world, and in reality, data is not 100% clean. For example, the same product might be listed twice on a particular website \u2013 under 2 different categories. This is very dangerous for ML, because if it sees it as 2 separate products \u2013 and only one of them will be matched, while both of them have the same characteristics. This means \u2018noise\u2019 which results in lower accuracy.<\/p>\n\n\n\n<p>The answer to this was another ML technique called \u2018deduplication\u2019 \u2013 thank God, that was much easier than blocking.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">ML feature set<\/h3>\n\n\n\n<p>The next thing to tackle was \u2013 which features to use. Please note that we consider this as the heart of our ML project, so we won\u2019t be able to reveal all \u2018secret ingredients\u2019. What we can say that the number of features is around 40 and that they can be categorized as:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Features of linguistic proximity<\/li><li>Features targeting alpha-numeric cases<\/li><li>Price-related features<\/li><li>Brand-related features (including brand synonyms)<\/li><li>Entity recognition (useful for Black vs Schwarz cases)<\/li><li>Image-related features were left out of scope in the 1<sup>st<\/sup> version of the project<\/li><\/ul>\n\n\n\n<p>Next question to solve was \u2013 which ML algorithm to use? Our preference went to Random Forest (RF). RF has a configurable degree of randomness, which makes it more resilient to \u2018real-life\u2019 examples. The term \u2018forest\u2019 designates the decision-making process \u2013 which starts as a decision tree, which keeps growing into a forest. Let\u2019s try to be practical: suppose that algorithm performs X comparisons for each combination of potential matches. So, we will ask X questions, and depending on their answer, we will consider it a match or not. If we ask 100 questions, and we get 98 times a \u2018yes\u2019 \u2013 that means that our combination has a 98% probability of being a match. Or, in ML terminology, it has a matching score of 0.98.<\/p>\n\n\n\n<p>Of course, the more questions you have, the more CPU time and the more RAM you will need. We have done a lot of experimenting and finally, we chose an RF of 500 trees, with some extra RAM that made our CTO very happy JThe training process took an average of 20 hours, which was not great if you get something wrong and need to run it again. This is how we learned that we need to prepare 3 training sets (small, medium, large). Once you\u2019re happy with the outcome of the small set, you move on to the medium one, and only then you proceed to the large one \u2013 which will result in your ML model.<\/p>\n\n\n\n<p><strong>Find more information here:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><a rel=\"noreferrer noopener\" href=\"https:\/\/www.price2spy.com\/en\/pricing\/product-matching.html\" target=\"_blank\">Product matching in Price2Spy <\/a> <\/li><li> <strong>Previous topic: <\/strong> <a rel=\"noreferrer noopener\" href=\"https:\/\/www.price2spy.com\/blog\/part-4-preparing-the-ml-training-set\/\" target=\"_blank\">(Part #4) Preparing the ML training set<\/a>  <\/li><li> <strong>Next topic: <\/strong><a href=\"https:\/\/www.price2spy.com\/blog\/part-6-evaluating-ml-training-results\/\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"(Part #6)  Evaluating ML training results  (opens in a new tab)\">(Part #6)  Evaluating ML training results <\/a> <\/li><\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Product matching in Price2Spy Previous topic: (Part #4) Preparing the ML training set Next topic: (Part #6) Evaluating ML training results Our very first implementation step was a bit non-standard. While most of ML is done in Python, Price2Spy is a Java shop. We respect&#8230;<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[108,167],"tags":[190,645,646,15,81],"class_list":["post-7297","post","type-post","status-publish","format-standard","hentry","category-best-practices","category-new-price2spy-features","tag-ecommerce","tag-machine-learning","tag-ml","tag-price2spy","tag-product-matching"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.3 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>(Part #5) ML training Implementation<\/title>\n<meta name=\"description\" content=\"How do we conduct the ML training Implementation?\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.price2spy.com\/blog\/part-5-ml-training-implementation\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"(Part #5) ML training Implementation\" \/>\n<meta property=\"og:description\" content=\"How do we conduct the ML training Implementation?\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.price2spy.com\/blog\/part-5-ml-training-implementation\/\" \/>\n<meta property=\"og:site_name\" content=\"Price2Spy\u00ae Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Price2Spy\/\" \/>\n<meta property=\"article:published_time\" content=\"2020-06-19T11:35:05+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2020-07-28T10:50:42+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/5.png\" \/>\n<meta name=\"author\" content=\"Mi\u0161a Kruni\u0107\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Price2Spy\" \/>\n<meta name=\"twitter:site\" content=\"@Price2Spy\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Mi\u0161a Kruni\u0107\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"(Part #5) ML training Implementation","description":"How do we conduct the ML training Implementation?","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.price2spy.com\/blog\/part-5-ml-training-implementation\/","og_locale":"en_US","og_type":"article","og_title":"(Part #5) ML training Implementation","og_description":"How do we conduct the ML training Implementation?","og_url":"https:\/\/www.price2spy.com\/blog\/part-5-ml-training-implementation\/","og_site_name":"Price2Spy\u00ae Blog","article_publisher":"https:\/\/www.facebook.com\/Price2Spy\/","article_published_time":"2020-06-19T11:35:05+00:00","article_modified_time":"2020-07-28T10:50:42+00:00","og_image":[{"url":"https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/5.png","type":"","width":"","height":""}],"author":"Mi\u0161a Kruni\u0107","twitter_card":"summary_large_image","twitter_creator":"@Price2Spy","twitter_site":"@Price2Spy","twitter_misc":{"Written by":"Mi\u0161a Kruni\u0107","Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.price2spy.com\/blog\/part-5-ml-training-implementation\/#article","isPartOf":{"@id":"https:\/\/www.price2spy.com\/blog\/part-5-ml-training-implementation\/"},"author":{"name":"Mi\u0161a Kruni\u0107","@id":"https:\/\/www.price2spy.com\/blog\/#\/schema\/person\/382ac9db90cb7d6dd54b9425857fc96c"},"headline":"(Part #5) ML training Implementation","datePublished":"2020-06-19T11:35:05+00:00","dateModified":"2020-07-28T10:50:42+00:00","mainEntityOfPage":{"@id":"https:\/\/www.price2spy.com\/blog\/part-5-ml-training-implementation\/"},"wordCount":767,"commentCount":0,"image":{"@id":"https:\/\/www.price2spy.com\/blog\/part-5-ml-training-implementation\/#primaryimage"},"thumbnailUrl":"https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/5.png","keywords":["ecommerce","machine learning","ml","price2spy","product matching"],"articleSection":["Best practices in price monitoring","New Price2Spy features"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.price2spy.com\/blog\/part-5-ml-training-implementation\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.price2spy.com\/blog\/part-5-ml-training-implementation\/","url":"https:\/\/www.price2spy.com\/blog\/part-5-ml-training-implementation\/","name":"(Part #5) ML training Implementation","isPartOf":{"@id":"https:\/\/www.price2spy.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.price2spy.com\/blog\/part-5-ml-training-implementation\/#primaryimage"},"image":{"@id":"https:\/\/www.price2spy.com\/blog\/part-5-ml-training-implementation\/#primaryimage"},"thumbnailUrl":"https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/5.png","datePublished":"2020-06-19T11:35:05+00:00","dateModified":"2020-07-28T10:50:42+00:00","author":{"@id":"https:\/\/www.price2spy.com\/blog\/#\/schema\/person\/382ac9db90cb7d6dd54b9425857fc96c"},"description":"How do we conduct the ML training Implementation?","breadcrumb":{"@id":"https:\/\/www.price2spy.com\/blog\/part-5-ml-training-implementation\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.price2spy.com\/blog\/part-5-ml-training-implementation\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.price2spy.com\/blog\/part-5-ml-training-implementation\/#primaryimage","url":"https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/5.png","contentUrl":"https:\/\/www.price2spy.com\/blog\/wp-content\/uploads\/2020\/06\/5.png","width":296,"height":269},{"@type":"BreadcrumbList","@id":"https:\/\/www.price2spy.com\/blog\/part-5-ml-training-implementation\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.price2spy.com\/blog\/"},{"@type":"ListItem","position":2,"name":"(Part #5) ML training Implementation"}]},{"@type":"WebSite","@id":"https:\/\/www.price2spy.com\/blog\/#website","url":"https:\/\/www.price2spy.com\/blog\/","name":"Price2Spy\u00ae Blog","description":"Price2Spy\u00ae","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.price2spy.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/www.price2spy.com\/blog\/#\/schema\/person\/382ac9db90cb7d6dd54b9425857fc96c","name":"Mi\u0161a Kruni\u0107","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/31aa4afb2464eca1f1ca0c7979628c87e54e7a6b53ebcb371749e9349d27c850?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/31aa4afb2464eca1f1ca0c7979628c87e54e7a6b53ebcb371749e9349d27c850?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/31aa4afb2464eca1f1ca0c7979628c87e54e7a6b53ebcb371749e9349d27c850?s=96&d=mm&r=g","caption":"Mi\u0161a Kruni\u0107"},"description":"Father of 2, Husband of 1, CEO of 3 :-)","sameAs":["http:\/\/www.price2spy.com"],"url":"https:\/\/www.price2spy.com\/blog\/author\/misha\/"}]}},"_links":{"self":[{"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/posts\/7297","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/comments?post=7297"}],"version-history":[{"count":3,"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/posts\/7297\/revisions"}],"predecessor-version":[{"id":7415,"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/posts\/7297\/revisions\/7415"}],"wp:attachment":[{"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/media?parent=7297"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/categories?post=7297"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.price2spy.com\/blog\/wp-json\/wp\/v2\/tags?post=7297"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}