{"id":258,"date":"2018-04-21T21:36:41","date_gmt":"2018-04-21T21:36:41","guid":{"rendered":"http:\/\/ai.intelligentonlinetools.com\/ml\/?p=258"},"modified":"2018-07-19T00:51:09","modified_gmt":"2018-07-19T00:51:09","slug":"document-similarity","status":"publish","type":"post","link":"https:\/\/ai.intelligentonlinetools.com\/ml\/document-similarity\/","title":{"rendered":"Document Similarity, Tokenization and Word Vectors in Python with spaCY"},"content":{"rendered":"<div class=\"ezvvm6a5e8f54365d3\" ><script async src=\"\/\/pagead2.googlesyndication.com\/pagead\/js\/adsbygoogle.js\"><\/script>\n<!-- Text analytics techniques 728_90 horizontal top -->\n<ins class=\"adsbygoogle\"\n     style=\"display:inline-block;width:728px;height:90px\"\n     data-ad-client=\"ca-pub-3416618249440971\"\n     data-ad-slot=\"2926649501\"><\/ins>\n<script>\n(adsbygoogle = window.adsbygoogle || []).push({});\n<\/script><\/div><style type=\"text\/css\">\r\n.ezvvm6a5e8f54365d3 {\r\nmargin: 5px; padding: 0px;\r\n}\r\n@media screen and (min-width: 1201px) {\r\n.ezvvm6a5e8f54365d3 {\r\ndisplay: block;\r\n}\r\n}\r\n@media screen and (min-width: 993px) and (max-width: 1200px) {\r\n.ezvvm6a5e8f54365d3 {\r\ndisplay: block;\r\n}\r\n}\r\n@media screen and (min-width: 769px) and (max-width: 992px) {\r\n.ezvvm6a5e8f54365d3 {\r\ndisplay: block;\r\n}\r\n}\r\n@media screen and (min-width: 768px) and (max-width: 768px) {\r\n.ezvvm6a5e8f54365d3 {\r\ndisplay: block;\r\n}\r\n}\r\n@media screen and (max-width: 767px) {\r\n.ezvvm6a5e8f54365d3 {\r\ndisplay: block;\r\n}\r\n}\r\n<\/style>\r\n<p><img decoding=\"async\" loading=\"lazy\" src=\"http:\/\/ai.intelligentonlinetools.com\/ml\/wp-content\/uploads\/2018\/04\/document_similarity_tokenization_word_embeddings-e1524422364824.png\" alt=\"\" width=\"422\" height=\"196\" class=\"aligncenter size-full wp-image-266\" \/><\/p>\n<p>Calculating document similarity is very frequent task in Information Retrieval or Text Mining. Years ago we would need to build a document-term matrix or term-document matrix that describes the frequency of terms that occur in a collection of documents and then do word vectors math to find similarity. Now by using spaCY it can be done just within few lines. Below you will find how to get document similarity , tokenization and word vectors with spaCY. <\/p>\n<p>spaCY is  an open-source library designed to help you build NLP applications. It has a lot of features, we will look in this post only at few but very useful.<\/p>\n<h2> Document Similarity<\/h2>\n<p>Here is how to get document similarity:<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\nimport spacy\r\nnlp = spacy.load('en')\r\n\r\ndoc1 = nlp(u'Hello this is document similarity calculation')\r\ndoc2 = nlp(u'Hello this is python similarity calculation')\r\ndoc3 = nlp(u'Hi there')\r\n\r\nprint (doc1.similarity(doc2)) \r\nprint (doc2.similarity(doc3)) \r\nprint (doc1.similarity(doc3))  \r\n\r\nOutput:\r\n0.94\r\n0.33\r\n0.30\r\n<\/pre>\n<p>In more realistic situations we would load documents from files and would have longer text. Here is the experiment that I performed.  I saved 3 articles from different random sites, two about deep learning and one about feature engineering. <\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\ndef get_file_contents(filename):\r\n  with open(filename, 'r') as filehandle:  \r\n    filecontent = filehandle.read()\r\n    return (filecontent) \r\n\r\nfn1=&quot;deep_learning1.txt&quot;\r\nfn2=&quot;feature_eng.txt&quot;\r\nfn3=&quot;deep_learning.txt&quot;\r\n\r\nfn1_doc=get_file_contents(fn1)\r\nprint (fn1_doc)\r\n\r\nfn2_doc=get_file_contents(fn2)\r\nprint (fn2_doc)\r\n\r\nfn3_doc=get_file_contents(fn3)\r\nprint (fn3_doc)\r\n \r\ndoc1 = nlp(fn1_doc)\r\ndoc2 = nlp(fn2_doc)\r\ndoc3 = nlp(fn3_doc)\r\n \r\nprint (&quot;dl1 - features&quot;)\r\nprint (doc1.similarity(doc2)) \r\nprint (&quot;feature - dl&quot;)\r\nprint (doc2.similarity(doc3)) \r\nprint (&quot;dl1 - dl&quot;)\r\nprint (doc1.similarity(doc3)) \r\n \r\n&quot;&quot;&quot;\r\noutput:\r\ndl1 - features\r\n0.9700237040142454\r\nfeature - dl\r\n0.9656364096761337\r\ndl1 - dl\r\n0.9547075478662724\r\n&quot;&quot;&quot;\r\n\r\n\r\n<\/pre>\n<p>It was able to assign higher similarity score for documents with similar topics!<\/p>\n<h2>Tokenization<\/h2>\n<p>Another very useful and simple feature that can be done with spaCY is tokenization. Here is how easy to convert text into tokens (words):<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\nfor token in doc1:\r\n    print(token.text)\r\n    print (token.vector)\r\n<\/pre>\n<h2>Word Vectors<\/h2>\n<p>spaCY has integrated word vectors support, while other libraries like NLTK do not have it. Below line will print word embeddings &#8211; array of 768 numbers on my environment.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\"> \r\nprint (token.vector)   #-  prints word vector form of token. \r\nprint (doc1[0].vector) #- prints word vector form of first token of document.\r\nprint (doc1.vector)    #- prints mean vector form for doc1\r\n<\/pre>\n<p>So we looked how to use few features (similarity, tokenization and word embeddings) which are very easy to implement with spaCY. I hope you enjoyed this post.  If you have any tips or anything else to add, please leave a comment below.<\/p>\n<p><strong>References<\/strong><br \/>\n1. <a href=https:\/\/spacy.io\/ target=\"_blank\">spaCY<\/a><br \/>\n2. <a href=https:\/\/www.shanelynn.ie\/word-embeddings-in-python-with-spacy-and-gensim\/ target=\"_blank\">Word Embeddings in Python with Spacy and Gensim<\/a><\/p>\n<div class=\"pjpnh6a5e8f5436605\" ><center>\n<script async src=\"\/\/pagead2.googlesyndication.com\/pagead\/js\/adsbygoogle.js\"><\/script>\n<!-- Text analytics techniques link ads horizontal Medium after content -->\n<ins class=\"adsbygoogle\"\n     style=\"display:inline-block;width:468px;height:15px\"\n     data-ad-client=\"ca-pub-3416618249440971\"\n     data-ad-slot=\"5765984772\"><\/ins>\n<script>\n(adsbygoogle = window.adsbygoogle || []).push({});\n<\/script>\n\n<script async src=\"\/\/pagead2.googlesyndication.com\/pagead\/js\/adsbygoogle.js\"><\/script>\n<ins class=\"adsbygoogle\"\n     style=\"display:block\"\n     data-ad-format=\"autorelaxed\"\n     data-ad-client=\"ca-pub-3416618249440971\"\n     data-ad-slot=\"3903486841\"><\/ins>\n<script>\n     (adsbygoogle = window.adsbygoogle || []).push({});\n<\/script>\n<\/center><\/div><style type=\"text\/css\">\r\n.pjpnh6a5e8f5436605 {\r\nmargin: 5px; padding: 0px;\r\n}\r\n@media screen and (min-width: 1201px) {\r\n.pjpnh6a5e8f5436605 {\r\ndisplay: block;\r\n}\r\n}\r\n@media screen and (min-width: 993px) and (max-width: 1200px) {\r\n.pjpnh6a5e8f5436605 {\r\ndisplay: block;\r\n}\r\n}\r\n@media screen and (min-width: 769px) and (max-width: 992px) {\r\n.pjpnh6a5e8f5436605 {\r\ndisplay: block;\r\n}\r\n}\r\n@media screen and (min-width: 768px) and (max-width: 768px) {\r\n.pjpnh6a5e8f5436605 {\r\ndisplay: block;\r\n}\r\n}\r\n@media screen and (max-width: 767px) {\r\n.pjpnh6a5e8f5436605 {\r\ndisplay: block;\r\n}\r\n}\r\n<\/style>\r\n","protected":false},"excerpt":{"rendered":"<p>Calculating document similarity is very frequent task in Information Retrieval or Text Mining. Years ago we would need to build a document-term matrix or term-document matrix that describes the frequency of terms that occur in a collection of documents and then do word vectors math to find similarity. Now by using spaCY it can be &#8230; <a title=\"Document Similarity, Tokenization and Word Vectors in Python with spaCY\" class=\"read-more\" href=\"https:\/\/ai.intelligentonlinetools.com\/ml\/document-similarity\/\" aria-label=\"More on Document Similarity, Tokenization and Word Vectors in Python with spaCY\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0},"categories":[5],"tags":[29,9,20,19,30,11,28],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v20.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Document Similarity, Tokenization and Word Vectors in Python with spaCY - Text Analytics Techniques<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/ai.intelligentonlinetools.com\/ml\/document-similarity\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Document Similarity, Tokenization and Word Vectors in Python with spaCY - Text Analytics Techniques\" \/>\n<meta property=\"og:description\" content=\"Calculating document similarity is very frequent task in Information Retrieval or Text Mining. Years ago we would need to build a document-term matrix or term-document matrix that describes the frequency of terms that occur in a collection of documents and then do word vectors math to find similarity. Now by using spaCY it can be ... Read more\" \/>\n<meta property=\"og:url\" content=\"https:\/\/ai.intelligentonlinetools.com\/ml\/document-similarity\/\" \/>\n<meta property=\"og:site_name\" content=\"Text Analytics Techniques\" \/>\n<meta property=\"article:published_time\" content=\"2018-04-21T21:36:41+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2018-07-19T00:51:09+00:00\" \/>\n<meta property=\"og:image\" content=\"http:\/\/ai.intelligentonlinetools.com\/ml\/wp-content\/uploads\/2018\/04\/document_similarity_tokenization_word_embeddings-e1524422364824.png\" \/>\n<meta name=\"author\" content=\"owygs156\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"owygs156\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"2 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/ai.intelligentonlinetools.com\/ml\/document-similarity\/\",\"url\":\"https:\/\/ai.intelligentonlinetools.com\/ml\/document-similarity\/\",\"name\":\"Document Similarity, Tokenization and Word Vectors in Python with spaCY - Text Analytics Techniques\",\"isPartOf\":{\"@id\":\"http:\/\/ai.intelligentonlinetools.com\/ml\/#website\"},\"datePublished\":\"2018-04-21T21:36:41+00:00\",\"dateModified\":\"2018-07-19T00:51:09+00:00\",\"author\":{\"@id\":\"http:\/\/ai.intelligentonlinetools.com\/ml\/#\/schema\/person\/832f10562faaa1c7ed668c1ab4388857\"},\"breadcrumb\":{\"@id\":\"https:\/\/ai.intelligentonlinetools.com\/ml\/document-similarity\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/ai.intelligentonlinetools.com\/ml\/document-similarity\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/ai.intelligentonlinetools.com\/ml\/document-similarity\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"http:\/\/ai.intelligentonlinetools.com\/ml\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Document Similarity, Tokenization and Word Vectors in Python with spaCY\"}]},{\"@type\":\"WebSite\",\"@id\":\"http:\/\/ai.intelligentonlinetools.com\/ml\/#website\",\"url\":\"http:\/\/ai.intelligentonlinetools.com\/ml\/\",\"name\":\"Text Analytics Techniques\",\"description\":\"Text Analytics Techniques\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"http:\/\/ai.intelligentonlinetools.com\/ml\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"http:\/\/ai.intelligentonlinetools.com\/ml\/#\/schema\/person\/832f10562faaa1c7ed668c1ab4388857\",\"name\":\"owygs156\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"http:\/\/ai.intelligentonlinetools.com\/ml\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g\",\"caption\":\"owygs156\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Document Similarity, Tokenization and Word Vectors in Python with spaCY - Text Analytics Techniques","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/ai.intelligentonlinetools.com\/ml\/document-similarity\/","og_locale":"en_US","og_type":"article","og_title":"Document Similarity, Tokenization and Word Vectors in Python with spaCY - Text Analytics Techniques","og_description":"Calculating document similarity is very frequent task in Information Retrieval or Text Mining. Years ago we would need to build a document-term matrix or term-document matrix that describes the frequency of terms that occur in a collection of documents and then do word vectors math to find similarity. Now by using spaCY it can be ... Read more","og_url":"https:\/\/ai.intelligentonlinetools.com\/ml\/document-similarity\/","og_site_name":"Text Analytics Techniques","article_published_time":"2018-04-21T21:36:41+00:00","article_modified_time":"2018-07-19T00:51:09+00:00","og_image":[{"url":"http:\/\/ai.intelligentonlinetools.com\/ml\/wp-content\/uploads\/2018\/04\/document_similarity_tokenization_word_embeddings-e1524422364824.png"}],"author":"owygs156","twitter_card":"summary_large_image","twitter_misc":{"Written by":"owygs156","Est. reading time":"2 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/ai.intelligentonlinetools.com\/ml\/document-similarity\/","url":"https:\/\/ai.intelligentonlinetools.com\/ml\/document-similarity\/","name":"Document Similarity, Tokenization and Word Vectors in Python with spaCY - Text Analytics Techniques","isPartOf":{"@id":"http:\/\/ai.intelligentonlinetools.com\/ml\/#website"},"datePublished":"2018-04-21T21:36:41+00:00","dateModified":"2018-07-19T00:51:09+00:00","author":{"@id":"http:\/\/ai.intelligentonlinetools.com\/ml\/#\/schema\/person\/832f10562faaa1c7ed668c1ab4388857"},"breadcrumb":{"@id":"https:\/\/ai.intelligentonlinetools.com\/ml\/document-similarity\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/ai.intelligentonlinetools.com\/ml\/document-similarity\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/ai.intelligentonlinetools.com\/ml\/document-similarity\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"http:\/\/ai.intelligentonlinetools.com\/ml\/"},{"@type":"ListItem","position":2,"name":"Document Similarity, Tokenization and Word Vectors in Python with spaCY"}]},{"@type":"WebSite","@id":"http:\/\/ai.intelligentonlinetools.com\/ml\/#website","url":"http:\/\/ai.intelligentonlinetools.com\/ml\/","name":"Text Analytics Techniques","description":"Text Analytics Techniques","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"http:\/\/ai.intelligentonlinetools.com\/ml\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Person","@id":"http:\/\/ai.intelligentonlinetools.com\/ml\/#\/schema\/person\/832f10562faaa1c7ed668c1ab4388857","name":"owygs156","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"http:\/\/ai.intelligentonlinetools.com\/ml\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g","caption":"owygs156"}}]}},"_links":{"self":[{"href":"https:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/posts\/258"}],"collection":[{"href":"https:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/comments?post=258"}],"version-history":[{"count":9,"href":"https:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/posts\/258\/revisions"}],"predecessor-version":[{"id":342,"href":"https:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/posts\/258\/revisions\/342"}],"wp:attachment":[{"href":"https:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/media?parent=258"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/categories?post=258"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/tags?post=258"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}