{"id":908,"date":"2019-05-27T14:36:12","date_gmt":"2019-05-27T14:36:12","guid":{"rendered":"http:\/\/ai.intelligentonlinetools.com\/ml\/?p=908"},"modified":"2019-06-09T13:28:25","modified_gmt":"2019-06-09T13:28:25","slug":"how-to-extract-text-from-website","status":"publish","type":"post","link":"https:\/\/ai.intelligentonlinetools.com\/ml\/how-to-extract-text-from-website\/","title":{"rendered":"How to Extract Text from Website"},"content":{"rendered":"<div class=\"hkzyz6a443a2bf30cb\" ><script async src=\"\/\/pagead2.googlesyndication.com\/pagead\/js\/adsbygoogle.js\"><\/script>\n<!-- Text analytics techniques 728_90 horizontal top -->\n<ins class=\"adsbygoogle\"\n     style=\"display:inline-block;width:728px;height:90px\"\n     data-ad-client=\"ca-pub-3416618249440971\"\n     data-ad-slot=\"2926649501\"><\/ins>\n<script>\n(adsbygoogle = window.adsbygoogle || []).push({});\n<\/script><\/div><style type=\"text\/css\">\r\n.hkzyz6a443a2bf30cb {\r\nmargin: 5px; padding: 0px;\r\n}\r\n@media screen and (min-width: 1201px) {\r\n.hkzyz6a443a2bf30cb {\r\ndisplay: block;\r\n}\r\n}\r\n@media screen and (min-width: 993px) and (max-width: 1200px) {\r\n.hkzyz6a443a2bf30cb {\r\ndisplay: block;\r\n}\r\n}\r\n@media screen and (min-width: 769px) and (max-width: 992px) {\r\n.hkzyz6a443a2bf30cb {\r\ndisplay: block;\r\n}\r\n}\r\n@media screen and (min-width: 768px) and (max-width: 768px) {\r\n.hkzyz6a443a2bf30cb {\r\ndisplay: block;\r\n}\r\n}\r\n@media screen and (max-width: 767px) {\r\n.hkzyz6a443a2bf30cb {\r\ndisplay: block;\r\n}\r\n}\r\n<\/style>\r\n<p>Extracting data from the Web using scripts (web scraping) is widely used today for numerous purposes. One of the parts of this process is downloading actual text from urls.  This will be the topic of this post.<br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"http:\/\/ai.intelligentonlinetools.com\/ml\/wp-content\/uploads\/2019\/05\/extracting-text-from-html-pdf.jpg\" alt=\"\" width=\"379\" height=\"417\" class=\"aligncenter size-full wp-image-930\" srcset=\"https:\/\/ai.intelligentonlinetools.com\/ml\/wp-content\/uploads\/2019\/05\/extracting-text-from-html-pdf.jpg 379w, https:\/\/ai.intelligentonlinetools.com\/ml\/wp-content\/uploads\/2019\/05\/extracting-text-from-html-pdf-273x300.jpg 273w\" sizes=\"(max-width: 379px) 100vw, 379px\" \/><br \/>\nWe will consider how it can be done using the following case examples:<br \/>\nExtracting information from visited links of history of using Chrome browser.    <\/p>\n<p>Extracting information from list of links. For example in the previous post we looked how to extract links from twitter search results into csv file. This file will be now the source of links. <\/p>\n<p>Below will follow the python script implementation of main parts. It is using few code snippets and posts from the web. References and full source code are provided in the end.<\/p>\n<p><strong>Switching Between Cases<\/strong><br \/>\nThe script is using a variable USE_LINKS_FROM_CHROME_HISTORY to select correct program flow. If USE_LINKS_FROM_CHROME_HISTORY is true it will start extract links from Chrome, otherwise will use file with links.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\nresults=[]\r\nif  USE_LINKS_FROM_CHROME_HISTORY:\r\n        results =  get_links_from_chrome_history() \r\n        fname=&quot;data_from_chrome_history_links.csv&quot;\r\nelse:\r\n        results=get_links_from_csv_file()\r\n        fname=&quot;data_from_file_links.csv&quot;\r\n<\/pre>\n<p><strong>Extracting Content From HTML Links<\/strong><br \/>\nWe use python libraries BeautifulSoup for processing HTML and requests library for downloading HTML:<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\nfrom bs4 import BeautifulSoup\r\nfrom bs4.element import Comment\r\nimport requests\r\n\r\ndef tag_visible(element):\r\n    if element.parent.name in ['style', 'script', 'head',  'meta', '[document]']:\r\n        return False\r\n    if isinstance(element, Comment):\r\n        return False\r\n    return True\r\n\r\ndef get_text(url):\r\n   print (url) \r\n   \r\n   try:\r\n      req  = requests.get(url, timeout=5)\r\n   except: \r\n      return &quot;TIMEOUT ERROR&quot;  \r\n  \r\n   data = req.text\r\n   soup = BeautifulSoup(data, &quot;html.parser&quot;)\r\n   texts = soup.findAll(text=True)\r\n   visible_texts = filter(tag_visible, texts)  \r\n   return u&quot; &quot;.join(t.strip() for t in visible_texts)\r\n<\/pre>\n<p><strong>Extracting Content from PDF Format with PDF to Text Python<\/strong><\/p>\n<p>Not all links will give html page. Some might lead to pdf data format page. For this we need to use specific process of getting text from pdf.  There are several solutions possible. Here we will use pdftotext exe file. [2] With this method we create function as below and call it when url ends with &#8220;.pdf&#8221;. <\/p>\n<p>To make actual conversion from pdf to txt we use subprocess.call and provide location of pdftotext.exe file, filename of pdf file and filename of new txt file.  Note that we  first download pdf page to pdf file on local drive.  <\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\nimport subprocess\r\ndef get_txt_from_pdf(url):\r\n    myfile = requests.get(url, timeout=8)\r\n    myfile_name=url.split(&quot;\/&quot;)[-1] \r\n    myfile_name_wout_ext=myfile_name[0:-4]\r\n    open('C:\\\\Users\\\\username\\\\Downloads\\\\' + myfile_name, 'wb').write(myfile.content)\r\n    subprocess.call(['C:\\\\Users\\\\username\\\\pythonrun\\\\pdftotext' + '\\\\pdftotext', myfile_name, myfile_name_wout_ext+&quot;.txt&quot;])\r\n    with open('C:\\\\Users\\\\username\\\\Downloads\\\\' + myfile_name_wout_ext+&quot;.txt&quot;, 'r') as content_file:\r\n        content = content_file.read()\r\n    return content  \r\n\r\n if url.endswith(&quot;.pdf&quot;):\r\n                  txt = get_txt_from_pdf(full_url)\r\n<\/pre>\n<p><strong>Cleaning Extracted Text<\/strong><br \/>\nOnce text is extracted from pdf or html we need to remove not useful text.<br \/>\nBelow are processing actions that are implemented in the script:<\/p>\n<ul>\n<li> remove non content text like script, html, tags (it is only for html pages) <\/li>\n<li> remove non text characters<\/li>\n<li>remove repeating spaces<\/li>\n<li>remove documents if the size of document less then some min number of characters (MIN_LENGTH_of_document)<\/li>\n<li>remove bad requests results &#8211; for example the request to get content from specific link was not successful but  still resulted in some text.\n<\/li>\n<\/ul>\n<p><strong>Getting Links from Chrome History<\/strong><br \/>\nTo get visited links we need query Chrome web browser database with simple SQL statement. This is well described on some other web blogs. You can find link also in the references below [1].<\/p>\n<p>Additionally when we extracting from Chrome history we need remove links that are out of scope &#8211; example you are extracting links that you used for reading about data mining. So links where you access your banking site or friends on facebook are not related.<\/p>\n<p>To sort out not related links we can insert in sql statement filtering criteria with NOT Like * or <> as below:<br \/>\nselect_statement = &#8220;SELECT urls.url FROM urls WHERE urls.url NOT Like &#8216;%localhost%&#8217; AND urls.url NOT Like &#8216;%google%&#8217; AND urls.visit_count > 0 AND urls.url <> &#8216;https:\/\/www.reddit.com\/&#8217; ;&#8221;<\/p>\n<p><strong>Conclusion<\/strong><br \/>\nWe learned how to extract text from website (pdf or html). We built the script for two practical examples: when we use links from Chrome web browser history or when we have list of links extracted from somewhere, for example from Twitter search results.  The next step would be extract insights from the obtained text data using machine learning or text mining. For example from chrome history we could build frequent questions that developer searches in the web browser and create faster way to access information. <\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n# -*- coding: utf-8 -*-\r\n\r\nimport os\r\nimport sqlite3\r\nimport operator\r\nfrom collections import OrderedDict\r\n\r\nimport time\r\nimport csv\r\n\r\nfrom bs4 import BeautifulSoup\r\nfrom bs4.element import Comment\r\nimport requests\r\nimport re\r\nimport subprocess\r\n\r\n\r\nMIN_LENGTH_of_document = 40\r\nMIN_LENGTH_of_word = 2\r\nUSE_LINKS_FROM_CHROME_HISTORY = False #if false will use from csv file\r\n\r\ndef remove_min_words(txt):\r\n   \r\n   shortword = re.compile(r'\\W*\\b\\w{1,1}\\b')\r\n   return(shortword.sub('', txt))\r\n\r\n\r\ndef clean_txt(text):\r\n   text = re.sub('[^A-Za-z.  ]', ' ', text)\r\n   text=' '.join(text.split())\r\n   text = remove_min_words(text)\r\n   text=text.lower()\r\n   text = text if  len(text) &gt;= MIN_LENGTH_of_document else &quot;&quot;\r\n   return text\r\n\r\ndef tag_visible(element):\r\n    if element.parent.name in ['style', 'script', 'head',  'meta', '[document]']:\r\n        return False\r\n    if isinstance(element, Comment):\r\n        return False\r\n    return True\r\n\r\n\r\n  \r\n    \r\ndef get_txt_from_pdf(url):\r\n    myfile = requests.get(url, timeout=8)\r\n    myfile_name=url.split(&quot;\/&quot;)[-1] \r\n    myfile_name_wout_ext=myfile_name[0:-4]\r\n    open('C:\\\\Users\\\\username\\\\Downloads\\\\' + myfile_name, 'wb').write(myfile.content)\r\n    subprocess.call(['C:\\\\Users\\\\username\\\\pythonrun\\\\pdftotext' + '\\\\pdftotext', myfile_name, myfile_name_wout_ext+&quot;.txt&quot;])\r\n    with open('C:\\\\Users\\\\username\\\\Downloads\\\\' + myfile_name_wout_ext+&quot;.txt&quot;, 'r') as content_file:\r\n        content = content_file.read()\r\n    return content    \r\n\r\n\r\ndef get_text(url):\r\n   print (url) \r\n   \r\n   try:\r\n      req  = requests.get(url, timeout=5)\r\n   except: \r\n      return &quot;TIMEOUT ERROR&quot;  \r\n  \r\n   data = req.text\r\n   soup = BeautifulSoup(data, &quot;html.parser&quot;)\r\n   texts = soup.findAll(text=True)\r\n   visible_texts = filter(tag_visible, texts)  \r\n   return u&quot; &quot;.join(t.strip() for t in visible_texts)\r\n\r\n\r\ndef parse(url):\r\n\ttry:\r\n\t\tparsed_url_components = url.split('\/\/')\r\n\t\tsublevel_split = parsed_url_components[1].split('\/', 1)\r\n\t\tdomain = sublevel_split[0].replace(&quot;www.&quot;, &quot;&quot;)\r\n\t\treturn domain\r\n\texcept IndexError:\r\n\t\tprint (&quot;URL format error!&quot;)\r\n\r\n\r\ndef get_links_from_chrome_history():\r\n   #path to user's history database (Chrome)\r\n   data_path = os.path.expanduser('~')+&quot;\\\\AppData\\\\Local\\\\Google\\\\Chrome\\\\User Data\\\\Default&quot;\r\n \r\n   history_db = os.path.join(data_path, 'history')\r\n\r\n   #querying the db\r\n   c = sqlite3.connect(history_db)\r\n   cursor = c.cursor()\r\n   select_statement = &quot;SELECT urls.url FROM urls WHERE urls.url NOT Like '%localhost%' AND urls.url NOT Like '%google%' AND urls.visit_count &gt; 0 AND urls.url &lt;&gt; 'https:\/\/www.reddit.com\/' ;&quot;\r\n   cursor.execute(select_statement)\r\n\r\n   results_tuples = cursor.fetchall() \r\n  \r\n   return ([x[0] for x in results_tuples])\r\n   \r\n   \r\ndef get_links_from_csv_file():\r\n   links_from_csv = []\r\n   \r\n   filename = 'C:\\\\Users\\\\username\\\\pythonrun\\\\links.csv'\r\n   col_id=0\r\n   with open(filename, newline='', encoding='utf-8-sig') as f:\r\n      reader = csv.reader(f)\r\n     \r\n      try:\r\n        for row in reader:\r\n            \r\n            links_from_csv.append(row[col_id])\r\n      except csv.Error as e:\r\n        print('file {}, line {}: {}'.format(filename, reader.line_num, e))\r\n   return links_from_csv   \r\n   \r\n \r\nresults=[]\r\nif  USE_LINKS_FROM_CHROME_HISTORY:\r\n        results =  get_links_from_chrome_history() \r\n        fname=&quot;data_from_chrome_history_links.csv&quot;\r\nelse:\r\n        results=get_links_from_csv_file()\r\n        fname=&quot;data_from_file_links.csv&quot;\r\n        \r\n        \r\n\r\nsites_count = {} \r\nfull_sites_count = {}\r\n\r\n\r\n\r\nwith open(fname, 'w', encoding=&quot;utf8&quot;, newline='' ) as csvfile: \r\n  fieldnames = ['URL', 'URL Base', 'TXT']\r\n  writer = csv.DictWriter(csvfile, fieldnames=fieldnames)\r\n  writer.writeheader()\r\n\r\n  \r\n  count_url=0\r\n  for url in results:    \r\n      print (url)\r\n      full_url=url\r\n      url = parse(url)\r\n      \r\n      if full_url in full_sites_count:\r\n            full_sites_count[full_url] += 1\r\n      else:\r\n            full_sites_count[full_url] = 1\r\n          \r\n            if url.endswith(&quot;.pdf&quot;):\r\n                  txt = get_txt_from_pdf(full_url)\r\n            else:\r\n                  txt = get_text(full_url)\r\n            txt=clean_txt(txt)\r\n            writer.writerow({'URL': full_url, 'URL Base': url, 'TXT': txt})\r\n            time.sleep(4)\r\n      \r\n      \r\n      \r\n     \r\n      if url in sites_count:\r\n            sites_count[url] += 1\r\n      else:\r\n            sites_count[url] = 1\r\n   \r\n      count_url +=1\r\n<\/pre>\n<p><strong>References<\/strong><br \/>\n1. <a href=\"https:\/\/geekswipe.net\/technology\/computing\/analyze-chromes-browsing-history-with-python\/\"  target=\"_blank\">Analyze Chrome\u2019s Browsing History with Python<\/a><br \/>\n2. <a href=\"http:\/\/www.xpdfreader.com\/download.html\" target=\"_blank\">XpdfReader<\/a><br \/>\n3. <a href=\"https:\/\/www.w3resource.com\/python-exercises\/re\/python-re-exercise-49.php\" target=\"_blank\">Python: Remove words from a string of length between 1 and a given number<\/a><br \/>\n4. <a href=\"https:\/\/stackoverflow.com\/questions\/1936466\/beautifulsoup-grab-visible-webpage-text\" target=\"_blank\">BeautifulSoup Grab Visible Webpage Text<\/a><br \/>\n5. <a href=\"https:\/\/codeburst.io\/web-scraping-101-with-python-beautiful-soup-bb617be1f486\" target=\"_blank\">Web Scraping 101 with Python &#038; Beautiful Soup<\/a><br \/>\n6. <a href=\"https:\/\/likegeeks.com\/downloading-files-using-python\/\" target=\"_blank\">Downloading Files Using Python (Simple Examples)<\/a><br \/>\n7. <a href=\"https:\/\/blog.pusher.com\/introduction-web-scraping-python\/\" target=\"_blank\">Introduction to web scraping in Python<\/a><br \/>\n8. <a href=https:\/\/www.analyticsvidhya.com\/blog\/2018\/02\/the-different-methods-deal-text-data-predictive-python\/\n target=\"_blank\">Ultimate guide to deal with Text Data (using Python) \u2013 for Data Scientists and Engineers<\/a><\/p>\n<div class=\"rrfbf6a443a2bf311e\" ><center>\n<script async src=\"\/\/pagead2.googlesyndication.com\/pagead\/js\/adsbygoogle.js\"><\/script>\n<!-- Text analytics techniques link ads horizontal Medium after content -->\n<ins class=\"adsbygoogle\"\n     style=\"display:inline-block;width:468px;height:15px\"\n     data-ad-client=\"ca-pub-3416618249440971\"\n     data-ad-slot=\"5765984772\"><\/ins>\n<script>\n(adsbygoogle = window.adsbygoogle || []).push({});\n<\/script>\n\n<script async src=\"\/\/pagead2.googlesyndication.com\/pagead\/js\/adsbygoogle.js\"><\/script>\n<ins class=\"adsbygoogle\"\n     style=\"display:block\"\n     data-ad-format=\"autorelaxed\"\n     data-ad-client=\"ca-pub-3416618249440971\"\n     data-ad-slot=\"3903486841\"><\/ins>\n<script>\n     (adsbygoogle = window.adsbygoogle || []).push({});\n<\/script>\n<\/center><\/div><style type=\"text\/css\">\r\n.rrfbf6a443a2bf311e {\r\nmargin: 5px; padding: 0px;\r\n}\r\n@media screen and (min-width: 1201px) {\r\n.rrfbf6a443a2bf311e {\r\ndisplay: block;\r\n}\r\n}\r\n@media screen and (min-width: 993px) and (max-width: 1200px) {\r\n.rrfbf6a443a2bf311e {\r\ndisplay: block;\r\n}\r\n}\r\n@media screen and (min-width: 769px) and (max-width: 992px) {\r\n.rrfbf6a443a2bf311e {\r\ndisplay: block;\r\n}\r\n}\r\n@media screen and (min-width: 768px) and (max-width: 768px) {\r\n.rrfbf6a443a2bf311e {\r\ndisplay: block;\r\n}\r\n}\r\n@media screen and (max-width: 767px) {\r\n.rrfbf6a443a2bf311e {\r\ndisplay: block;\r\n}\r\n}\r\n<\/style>\r\n","protected":false},"excerpt":{"rendered":"<p>Extracting data from the Web using scripts (web scraping) is widely used today for numerous purposes. One of the parts of this process is downloading actual text from urls. This will be the topic of this post. We will consider how it can be done using the following case examples: Extracting information from visited links &#8230; <a title=\"How to Extract Text from Website\" class=\"read-more\" href=\"https:\/\/ai.intelligentonlinetools.com\/ml\/how-to-extract-text-from-website\/\" aria-label=\"More on How to Extract Text from Website\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0},"categories":[65,51],"tags":[69,67,66,20,6,68],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v20.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>How to Extract Text from Website - Text Analytics Techniques<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/ai.intelligentonlinetools.com\/ml\/how-to-extract-text-from-website\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How to Extract Text from Website - Text Analytics Techniques\" \/>\n<meta property=\"og:description\" content=\"Extracting data from the Web using scripts (web scraping) is widely used today for numerous purposes. One of the parts of this process is downloading actual text from urls. This will be the topic of this post. We will consider how it can be done using the following case examples: Extracting information from visited links ... Read more\" \/>\n<meta property=\"og:url\" content=\"https:\/\/ai.intelligentonlinetools.com\/ml\/how-to-extract-text-from-website\/\" \/>\n<meta property=\"og:site_name\" content=\"Text Analytics Techniques\" \/>\n<meta property=\"article:published_time\" content=\"2019-05-27T14:36:12+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2019-06-09T13:28:25+00:00\" \/>\n<meta property=\"og:image\" content=\"http:\/\/ai.intelligentonlinetools.com\/ml\/wp-content\/uploads\/2019\/05\/extracting-text-from-html-pdf.jpg\" \/>\n<meta name=\"author\" content=\"owygs156\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"owygs156\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"8 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/ai.intelligentonlinetools.com\/ml\/how-to-extract-text-from-website\/\",\"url\":\"https:\/\/ai.intelligentonlinetools.com\/ml\/how-to-extract-text-from-website\/\",\"name\":\"How to Extract Text from Website - Text Analytics Techniques\",\"isPartOf\":{\"@id\":\"http:\/\/ai.intelligentonlinetools.com\/ml\/#website\"},\"datePublished\":\"2019-05-27T14:36:12+00:00\",\"dateModified\":\"2019-06-09T13:28:25+00:00\",\"author\":{\"@id\":\"http:\/\/ai.intelligentonlinetools.com\/ml\/#\/schema\/person\/832f10562faaa1c7ed668c1ab4388857\"},\"breadcrumb\":{\"@id\":\"https:\/\/ai.intelligentonlinetools.com\/ml\/how-to-extract-text-from-website\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/ai.intelligentonlinetools.com\/ml\/how-to-extract-text-from-website\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/ai.intelligentonlinetools.com\/ml\/how-to-extract-text-from-website\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"http:\/\/ai.intelligentonlinetools.com\/ml\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"How to Extract Text from Website\"}]},{\"@type\":\"WebSite\",\"@id\":\"http:\/\/ai.intelligentonlinetools.com\/ml\/#website\",\"url\":\"http:\/\/ai.intelligentonlinetools.com\/ml\/\",\"name\":\"Text Analytics Techniques\",\"description\":\"Text Analytics Techniques\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"http:\/\/ai.intelligentonlinetools.com\/ml\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"http:\/\/ai.intelligentonlinetools.com\/ml\/#\/schema\/person\/832f10562faaa1c7ed668c1ab4388857\",\"name\":\"owygs156\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"http:\/\/ai.intelligentonlinetools.com\/ml\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g\",\"caption\":\"owygs156\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"How to Extract Text from Website - Text Analytics Techniques","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/ai.intelligentonlinetools.com\/ml\/how-to-extract-text-from-website\/","og_locale":"en_US","og_type":"article","og_title":"How to Extract Text from Website - Text Analytics Techniques","og_description":"Extracting data from the Web using scripts (web scraping) is widely used today for numerous purposes. One of the parts of this process is downloading actual text from urls. This will be the topic of this post. We will consider how it can be done using the following case examples: Extracting information from visited links ... Read more","og_url":"https:\/\/ai.intelligentonlinetools.com\/ml\/how-to-extract-text-from-website\/","og_site_name":"Text Analytics Techniques","article_published_time":"2019-05-27T14:36:12+00:00","article_modified_time":"2019-06-09T13:28:25+00:00","og_image":[{"url":"http:\/\/ai.intelligentonlinetools.com\/ml\/wp-content\/uploads\/2019\/05\/extracting-text-from-html-pdf.jpg"}],"author":"owygs156","twitter_card":"summary_large_image","twitter_misc":{"Written by":"owygs156","Est. reading time":"8 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/ai.intelligentonlinetools.com\/ml\/how-to-extract-text-from-website\/","url":"https:\/\/ai.intelligentonlinetools.com\/ml\/how-to-extract-text-from-website\/","name":"How to Extract Text from Website - Text Analytics Techniques","isPartOf":{"@id":"http:\/\/ai.intelligentonlinetools.com\/ml\/#website"},"datePublished":"2019-05-27T14:36:12+00:00","dateModified":"2019-06-09T13:28:25+00:00","author":{"@id":"http:\/\/ai.intelligentonlinetools.com\/ml\/#\/schema\/person\/832f10562faaa1c7ed668c1ab4388857"},"breadcrumb":{"@id":"https:\/\/ai.intelligentonlinetools.com\/ml\/how-to-extract-text-from-website\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/ai.intelligentonlinetools.com\/ml\/how-to-extract-text-from-website\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/ai.intelligentonlinetools.com\/ml\/how-to-extract-text-from-website\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"http:\/\/ai.intelligentonlinetools.com\/ml\/"},{"@type":"ListItem","position":2,"name":"How to Extract Text from Website"}]},{"@type":"WebSite","@id":"http:\/\/ai.intelligentonlinetools.com\/ml\/#website","url":"http:\/\/ai.intelligentonlinetools.com\/ml\/","name":"Text Analytics Techniques","description":"Text Analytics Techniques","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"http:\/\/ai.intelligentonlinetools.com\/ml\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Person","@id":"http:\/\/ai.intelligentonlinetools.com\/ml\/#\/schema\/person\/832f10562faaa1c7ed668c1ab4388857","name":"owygs156","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"http:\/\/ai.intelligentonlinetools.com\/ml\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g","caption":"owygs156"}}]}},"_links":{"self":[{"href":"https:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/posts\/908"}],"collection":[{"href":"https:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/comments?post=908"}],"version-history":[{"count":26,"href":"https:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/posts\/908\/revisions"}],"predecessor-version":[{"id":988,"href":"https:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/posts\/908\/revisions\/988"}],"wp:attachment":[{"href":"https:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/media?parent=908"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/categories?post=908"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ai.intelligentonlinetools.com\/ml\/wp-json\/wp\/v2\/tags?post=908"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}