{"id":404,"date":"2018-09-02T16:58:07","date_gmt":"2018-09-02T16:58:07","guid":{"rendered":"http:\/\/ai.intelligentonlinetools.com\/ml\/?p=404"},"modified":"2018-09-10T00:01:40","modified_gmt":"2018-09-10T00:01:40","slug":"topic-modeling-python-textacy","status":"publish","type":"post","link":"http:\/\/ai.intelligentonlinetools.com\/ml\/topic-modeling-python-textacy\/","title":{"rendered":"Topic Modeling Python and Textacy Example"},"content":{"rendered":"<div class=\"zugyy69f2371d1ecb7\" ><script async src=\"\/\/pagead2.googlesyndication.com\/pagead\/js\/adsbygoogle.js\"><\/script>\n<!-- Text analytics techniques 728_90 horizontal top -->\n<ins class=\"adsbygoogle\"\n     style=\"display:inline-block;width:728px;height:90px\"\n     data-ad-client=\"ca-pub-3416618249440971\"\n     data-ad-slot=\"2926649501\"><\/ins>\n<script>\n(adsbygoogle = window.adsbygoogle || []).push({});\n<\/script><\/div><style type=\"text\/css\">\r\n.zugyy69f2371d1ecb7 {\r\nmargin: 5px; padding: 0px;\r\n}\r\n@media screen and (min-width: 1201px) {\r\n.zugyy69f2371d1ecb7 {\r\ndisplay: block;\r\n}\r\n}\r\n@media screen and (min-width: 993px) and (max-width: 1200px) {\r\n.zugyy69f2371d1ecb7 {\r\ndisplay: block;\r\n}\r\n}\r\n@media screen and (min-width: 769px) and (max-width: 992px) {\r\n.zugyy69f2371d1ecb7 {\r\ndisplay: block;\r\n}\r\n}\r\n@media screen and (min-width: 768px) and (max-width: 768px) {\r\n.zugyy69f2371d1ecb7 {\r\ndisplay: block;\r\n}\r\n}\r\n@media screen and (max-width: 767px) {\r\n.zugyy69f2371d1ecb7 {\r\ndisplay: block;\r\n}\r\n}\r\n<\/style>\r\n<p>Topic modeling is automatic discovering the abstract &#8220;topics&#8221; that occur in a collection of documents.[1] It can be used for providing more informative view of search results, quick overview for set of documents or some other services.<\/p>\n<h2>Textacy<\/h2>\n<p>In this post we will look at topic modeling with textacy. Textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the high-performance spacy library.<br \/>\nIt can flexibly tokenize and vectorize documents and corpora, then train, interpret, and visualize topic models using LSA, LDA, or NMF methods. [2]<br \/>\nTextacy is less known than other python libraries such as NLTK, SpaCY, TextBlob [3]  But it looks very promising as it&#8217;s built on the top of spaCY.<\/p>\n<p>In this post we will use textacy for the following task. We have group of documents and we want extract topics out of this set of documents. We will use 20 Newsgroups dataset as the source of documents.  <\/p>\n<h2>Code Structure<\/h2>\n<p>Our code consist of the following steps:<br \/>\nGet data. We will use only  2 groups (alt.atheism&#8217;, &#8216;soc.religion.christian&#8217;).<br \/>\nTokenize and remove some not needed characters or stopwords.<br \/>\nVectorize.<br \/>\nExtract Topics.  Here we do actual topic modeling. We use Non-negative Matrix Factorization method. (NMF)<br \/>\nOutput graph of terms &#8211; topic matrix.<\/p>\n<h2>Output<\/h2>\n<p>Below is the final output plot.<\/p>\n<figure id=\"attachment_425\" aria-describedby=\"caption-attachment-425\" style=\"width: 690px\" class=\"wp-caption aligncenter\"><img decoding=\"async\" loading=\"lazy\" src=\"http:\/\/ai.intelligentonlinetools.com\/ml\/wp-content\/uploads\/2018\/09\/Topic-modeling-with-textacy-e1536196093704.png\" alt=\"Topic modeling with textacy\" width=\"700\" height=\"461\" class=\"size-full wp-image-425\" \/><figcaption id=\"caption-attachment-425\" class=\"wp-caption-text\">Topic modeling with textacy<\/figcaption><\/figure>\n<p>Looking at output graph we can see term distribution over the topics. We identified more than 2 topics. For example topic 2  is associated with atheism,  while topic 1 is associated with God, religion. <\/p>\n<p>While better data preparation is needed to remove few more non meaningful words, the example still showing that to do topic modeling with textacy is much easy than with some other modes (for example gensim).  This is because it has ability to do many things that you need do after NLP versus just do NLP and allow user then add additional data views, heatmaps or diagrams.  <\/p>\n<p>Here are few links with topic modeling using LDA and gensim (not using textacy). The posts demonstrate that it is required more coding comparing with textacy.<br \/>\n<a href=\"https:\/\/intelligentonlinetools.com\/blog\/2017\/01\/08\/topic-extraction-from-blog-posts-with-lsi-and-lda-and-python\/\" target=\"_blank\">Topic Extraction from Blog Posts with LSI , LDA and Python<\/a><br \/>\n<a href=\"https:\/\/intelligentonlinetools.com\/blog\/2017\/01\/22\/data-visualization-visualizing-an-lda-model-using-python\/\" target=\"_blank\">Data Visualization \u2013 Visualizing an LDA Model using Python<\/a><\/p>\n<h2>Source Code<\/h2>\n<p>Below is python full source code.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\ncategories = ['alt.atheism', 'soc.religion.christian'] \r\n\r\n#Loading the data set - training data.\r\nfrom sklearn.datasets import fetch_20newsgroups\r\n \r\nnewsgroups_train = fetch_20newsgroups(subset='train', shuffle=True, categories=categories, remove=('headers', 'footers', 'quotes'))\r\n \r\n# You can check the target names (categories) and some data files by following commands.\r\nprint (newsgroups_train.target_names) #prints all the categories\r\nprint(&quot;\\n&quot;.join(newsgroups_train.data[0].split(&quot;\\n&quot;)[:3])) #prints first line of the first data file\r\nprint (newsgroups_train.target_names)\r\nprint (len(newsgroups_train.data))\r\n \r\ntexts = []\r\n \r\nlabels=newsgroups_train.target\r\ntexts = newsgroups_train.data\r\n\r\nfrom nltk.corpus import stopwords\r\n\r\nimport textacy\r\nfrom textacy.vsm import Vectorizer\r\n\r\nterms_list=[[tok  for tok in doc.split() if tok not in stopwords.words('english') ] for doc in texts]\r\n \r\n\r\ncount=0            \r\nfor doc in terms_list:\r\n for word in doc:   \r\n   print (word) \r\n   if word == &quot;|&gt;&quot; or word == &quot;|&gt;&quot; or word == &quot;_&quot; or word == &quot;-&quot; or word == &quot;#&quot;:\r\n         terms_list[count].remove (word)\r\n   if word == &quot;=&quot;:\r\n         terms_list[count].remove (word)\r\n   if word == &quot;:&quot;:\r\n         terms_list[count].remove (word)    \r\n   if word == &quot;_\/&quot;:\r\n         terms_list[count].remove (word)  \r\n   if word == &quot;I&quot; or word == &quot;A&quot;:\r\n         terms_list[count].remove (word)\r\n   if word == &quot;The&quot; or word == &quot;But&quot; or word==&quot;If&quot; or word==&quot;It&quot;:\r\n         terms_list[count].remove (word)       \r\n count=count+1\r\n      \r\n\r\nprint (&quot;=====================terms_list===============================&quot;)\r\nprint (terms_list)\r\n\r\n\r\nvectorizer = Vectorizer(tf_type='linear', apply_idf=True, idf_type='smooth')\r\ndoc_term_matrix = vectorizer.fit_transform(terms_list)\r\n\r\n\r\nprint (&quot;========================doc_term_matrix)=======================&quot;)\r\nprint (doc_term_matrix)\r\n\r\n\r\n\r\n#initialize and train a topic model:\r\nmodel = textacy.tm.TopicModel('nmf', n_topics=20)\r\nmodel.fit(doc_term_matrix)\r\n\r\nprint (&quot;======================model=================&quot;)\r\nprint (model)\r\n\r\ndoc_topic_matrix = model.transform(doc_term_matrix)\r\nfor topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, topics=[0,1]):\r\n          print('topic', topic_idx, ':', '   '.join(top_terms))\r\n\r\nfor i, val in enumerate(model.topic_weights(doc_topic_matrix)):\r\n     print(i, val)\r\n     \r\n     \r\nprint   (&quot;doc_term_matrix&quot;)     \r\nprint   (doc_term_matrix)   \r\nprint (&quot;vectorizer.id_to_term&quot;)\r\nprint (vectorizer.id_to_term)\r\n         \r\n\r\nmodel.termite_plot(doc_term_matrix, vectorizer.id_to_term, topics=-1,  n_terms=25, sort_terms_by='seriation')  \r\nmodel.save('nmf-10topics.pkl')        \r\n\r\n\r\n<\/pre>\n<p><strong>References<\/strong><br \/>\n1.<a href=\"https:\/\/en.wikipedia.org\/wiki\/Topic_model\" target=\"_blank\">Topic Model<\/a><br \/>\n2.<a href=\"https:\/\/chartbeat-labs.github.io\/textacy\/index.html\" target=\"_blank\">textacy: NLP, before and after spaCy<\/a><br \/>\n3.<a href=\"https:\/\/elitedatascience.com\/python-nlp-libraries\" target=\"_blank\">5 Heroic Python NLP Libraries<\/a><\/p>\n<div class=\"qzcry69f2371d1ecf6\" ><center>\n<script async src=\"\/\/pagead2.googlesyndication.com\/pagead\/js\/adsbygoogle.js\"><\/script>\n<!-- Text analytics techniques link ads horizontal Medium after content -->\n<ins class=\"adsbygoogle\"\n     style=\"display:inline-block;width:468px;height:15px\"\n     data-ad-client=\"ca-pub-3416618249440971\"\n     data-ad-slot=\"5765984772\"><\/ins>\n<script>\n(adsbygoogle = window.adsbygoogle || []).push({});\n<\/script>\n\n<script async src=\"\/\/pagead2.googlesyndication.com\/pagead\/js\/adsbygoogle.js\"><\/script>\n<ins class=\"adsbygoogle\"\n     style=\"display:block\"\n     data-ad-format=\"autorelaxed\"\n     data-ad-client=\"ca-pub-3416618249440971\"\n     data-ad-slot=\"3903486841\"><\/ins>\n<script>\n     (adsbygoogle = window.adsbygoogle || []).push({});\n<\/script>\n<\/center><\/div><style type=\"text\/css\">\r\n.qzcry69f2371d1ecf6 {\r\nmargin: 5px; padding: 0px;\r\n}\r\n@media screen and (min-width: 1201px) {\r\n.qzcry69f2371d1ecf6 {\r\ndisplay: block;\r\n}\r\n}\r\n@media screen and (min-width: 993px) and (max-width: 1200px) {\r\n.qzcry69f2371d1ecf6 {\r\ndisplay: block;\r\n}\r\n}\r\n@media screen and (min-width: 769px) and (max-width: 992px) {\r\n.qzcry69f2371d1ecf6 {\r\ndisplay: block;\r\n}\r\n}\r\n@media screen and (min-width: 768px) and (max-width: 768px) {\r\n.qzcry69f2371d1ecf6 {\r\ndisplay: block;\r\n}\r\n}\r\n@media screen and (max-width: 767px) {\r\n.qzcry69f2371d1ecf6 {\r\ndisplay: block;\r\n}\r\n}\r\n<\/style>\r\n","protected":false},"excerpt":{"rendered":"<p>Topic modeling is automatic discovering the abstract &#8220;topics&#8221; that occur in a collection of documents.[1] It can be used for providing more informative view of search results, quick overview for set of documents or some other services. Textacy In this post we will look at topic modeling with textacy. Textacy is a Python library for &#8230; <a title=\"Topic Modeling Python and Textacy Example\" class=\"read-more\" href=\"http:\/\/ai.intelligentonlinetools.com\/ml\/topic-modeling-python-textacy\/\" aria-label=\"More on Topic Modeling Python and Textacy Example\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0},"categories":[40],"tags":[43,41,42],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v20.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Topic Modeling Python and Textacy Example - Text Analytics Techniques<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/ai.intelligentonlinetools.com\/ml\/topic-modeling-python-textacy\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Topic Modeling Python and Textacy Example - Text Analytics Techniques\" \/>\n<meta property=\"og:description\" content=\"Topic modeling is automatic discovering the abstract &#8220;topics&#8221; that occur in a collection of documents.[1] It can be used for providing more informative view of search results, quick overview for set of documents or some other services. Textacy In this post we will look at topic modeling with textacy. Textacy is a Python library for ... Read more\" \/>\n<meta property=\"og:url\" content=\"https:\/\/ai.intelligentonlinetools.com\/ml\/topic-modeling-python-textacy\/\" \/>\n<meta property=\"og:site_name\" content=\"Text Analytics Techniques\" \/>\n<meta property=\"article:published_time\" content=\"2018-09-02T16:58:07+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2018-09-10T00:01:40+00:00\" \/>\n<meta property=\"og:image\" content=\"http:\/\/ai.intelligentonlinetools.com\/ml\/wp-content\/uploads\/2018\/09\/Topic-modeling-with-textacy-e1536196093704.png\" \/>\n<meta name=\"author\" content=\"owygs156\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"owygs156\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/ai.intelligentonlinetools.com\/ml\/topic-modeling-python-textacy\/\",\"url\":\"https:\/\/ai.intelligentonlinetools.com\/ml\/topic-modeling-python-textacy\/\",\"name\":\"Topic Modeling Python and Textacy Example - Text Analytics Techniques\",\"isPartOf\":{\"@id\":\"http:\/\/ai.intelligentonlinetools.com\/ml\/#website\"},\"datePublished\":\"2018-09-02T16:58:07+00:00\",\"dateModified\":\"2018-09-10T00:01:40+00:00\",\"author\":{\"@id\":\"http:\/\/ai.intelligentonlinetools.com\/ml\/#\/schema\/person\/832f10562faaa1c7ed668c1ab4388857\"},\"breadcrumb\":{\"@id\":\"https:\/\/ai.intelligentonlinetools.com\/ml\/topic-modeling-python-textacy\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/ai.intelligentonlinetools.com\/ml\/topic-modeling-python-textacy\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/ai.intelligentonlinetools.com\/ml\/topic-modeling-python-textacy\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"http:\/\/ai.intelligentonlinetools.com\/ml\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Topic Modeling Python and Textacy Example\"}]},{\"@type\":\"WebSite\",\"@id\":\"http:\/\/ai.intelligentonlinetools.com\/ml\/#website\",\"url\":\"http:\/\/ai.intelligentonlinetools.com\/ml\/\",\"name\":\"Text Analytics Techniques\",\"description\":\"Text Analytics Techniques\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"http:\/\/ai.intelligentonlinetools.com\/ml\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"http:\/\/ai.intelligentonlinetools.com\/ml\/#\/schema\/person\/832f10562faaa1c7ed668c1ab4388857\",\"name\":\"owygs156\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"http:\/\/ai.intelligentonlinetools.com\/ml\/#\/schema\/person\/image\/\",\"url\":\"http:\/\/2.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g\",\"contentUrl\":\"http:\/\/2.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g\",\"caption\":\"owygs156\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Topic Modeling Python and Textacy Example - Text Analytics Techniques","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/ai.intelligentonlinetools.com\/ml\/topic-modeling-python-textacy\/","og_locale":"en_US","og_type":"article","og_title":"Topic Modeling Python and Textacy Example - Text Analytics Techniques","og_description":"Topic modeling is automatic discovering the abstract &#8220;topics&#8221; that occur in a collection of documents.[1] It can be used for providing more informative view of search results, quick overview for set of documents or some other services. Textacy In this post we will look at topic modeling with textacy. Textacy is a Python library for ... Read more","og_url":"https:\/\/ai.intelligentonlinetools.com\/ml\/topic-modeling-python-textacy\/","og_site_name":"Text Analytics Techniques","article_published_time":"2018-09-02T16:58:07+00:00","article_modified_time":"2018-09-10T00:01:40+00:00","og_image":[{"url":"http:\/\/ai.intelligentonlinetools.com\/ml\/wp-content\/uploads\/2018\/09\/Topic-modeling-with-textacy-e1536196093704.png"}],"author":"owygs156","twitter_card":"summary_large_image","twitter_misc":{"Written by":"owygs156","Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/ai.intelligentonlinetools.com\/ml\/topic-modeling-python-textacy\/","url":"https:\/\/ai.intelligentonlinetools.com\/ml\/topic-modeling-python-textacy\/","name":"Topic Modeling Python and Textacy Example - Text Analytics Techniques","isPartOf":{"@id":"http:\/\/ai.intelligentonlinetools.com\/ml\/#website"},"datePublished":"2018-09-02T16:58:07+00:00","dateModified":"2018-09-10T00:01:40+00:00","author":{"@id":"http:\/\/ai.intelligentonlinetools.com\/ml\/#\/schema\/person\/832f10562faaa1c7ed668c1ab4388857"},"breadcrumb":{"@id":"https:\/\/ai.intelligentonlinetools.com\/ml\/topic-modeling-python-textacy\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/ai.intelligentonlinetools.com\/ml\/topic-modeling-python-textacy\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/ai.intelligentonlinetools.com\/ml\/topic-modeling-python-textacy\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"http:\/\/ai.intelligentonlinetools.com\/ml\/"},{"@type":"ListItem","position":2,"name":"Topic Modeling Python and Textacy Example"}]},{"@type":"WebSite","@id":"http:\/\/ai.intelligentonlinetools.com\/ml\/#website","url":"http:\/\/ai.intelligentonlinetools.com\/ml\/","name":"Text Analytics Techniques","description":"Text Analytics Techniques","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"http:\/\/ai.intelligentonlinetools.com\/ml\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Person","@id":"http:\/\/ai.intelligentonlinetools.com\/ml\/#\/schema\/person\/832f10562faaa1c7ed668c1ab4388857","name":"owygs156","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"http:\/\/ai.intelligentonlinetools.com\/ml\/#\/schema\/person\/image\/","url":"http:\/\/2.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g","contentUrl":"http:\/\/2.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g","caption":"owygs156"}}]}},"_links":{"self":[{"href":"http:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/posts\/404"}],"collection":[{"href":"http:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/comments?post=404"}],"version-history":[{"count":20,"href":"http:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/posts\/404\/revisions"}],"predecessor-version":[{"id":442,"href":"http:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/posts\/404\/revisions\/442"}],"wp:attachment":[{"href":"http:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/media?parent=404"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/categories?post=404"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/tags?post=404"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}