- Basic Setup
- Image Downloader
- Next Steps
- Full Code Listing
Building A Concurrent Web Scraper With Haskell
updated: April 16, 2012
Let's make a concurrent web scraper! We will use Haskell, because it allows easy concurrency. We will use the HXT library to do the scraping. If you want to follow the
HXT bits, you should be comfortable with Arrows in Haskell. If you're not, take a moment to read up on Arrows.
If you don't care about the scraping bits, jump straight to the concurrency section.
Make sure you have the
cabal install hxt cabal install url cabal install http cabal install maybet
First, let's write some basic functions to make life easier for ourselves:
openUrl :: String -> MaybeT IO String openUrl url = case parseURI url of Nothing -> fail "" Just u -> liftIO (getResponseBody =<< simpleHTTP (mkRequest GET u)) css :: ArrowXml a => String -> a XmlTree XmlTree css tag = multi (hasName tag) get :: String -> IO (IOSArrow XmlTree (NTree XNode)) get url = do contents <- runMaybeT $ openUrl url return $ readString [withParseHTML yes, withWarnings no] (fromMaybe "" contents)
I say basic because they will be our building blocks, not because they are easy :P Let's see how they work.
openUrl is a function that will download a web page for us. It returns a
MaybeT Monad Transformer. We can use it like
contents <- runMaybeT $ openUrl "http://example.com"
and contents will be a
Just if the operation was successful, or
css will allow us to use css selectors on the downloaded page.
get is where things get interesting. First, we download a page using
contents <- runMaybeT $ openUrl url
like we talked about. Next, we parse the page using
readString [withParseHTML yes, withWarnings no] contents
readString takes some options as its first parameter:
withParseHTML: Parse as HTML, which makes sure the parser doesn't break on things like the doctype.
withWarnings: Prints out warnings about malformed html if it's switched on. Since so much of the web is malformed html, I switched it off :P
Now we are ready to start.
Let's write something that downloads all the images from a given page.
First, let's get a parsed page:
main = do page <- get "http://www.reddit.com/r/pics"
page is now an Arrow. We can run this Arrow at any time to get its value by using
runX. Let's try it now:
ghci>runX page [NTree (XTag "/" [NTree (XAttr "transfer-Status") [NTree (XText "200") ],NTree (XAttr "transfer-Message") [NTree (XText "OK") ],NTree (XAttr "transfer-URI") [NTree (XText "string:") ],NTree (XAttr "source") [NTree (XText "\"<!DOCTYPE html PUBLIC \\\"-//W3C//DTD XHTML 1.0 ...\"") ],NTree (XAttr "transfer-Encoding") [NTree (XText "UNICODE") ],NTree (XAttr "doctype-name") (...many lines skipped...)
Wow, that looks confusing. Let's select only what we want. Get just the images:
ghci>runX $ page >>> css "img" [NTree (XTag "img" [NTree (XAttr "id") [NTree (XText "header-img") ],NTree (XAttr "src") [NTree (XText "http://f.thumbs.redditmedia.com/nDSO6j0fVaKEV5Hw.png") ] ...
Aha! Much nicer. Now let's get just the
ghci>runX $ page >>> css "img" >>> getAttrValue "src" ["http://f.thumbs.redditmedia.com/nDSO6j0fVaKEV5Hw.png","http://pixel.redditmedia.com/pixel/of_defenestration.png" ...
Done! That was easy. Now all we need to do is download these images and save them to disk.
main = do url <- parseArgs doc <- get url imgs <- runX . images $ doc sequence_ $ map download imgs
The first three lines of our
main function get a list of links. The
images function is very simple:
images tree = tree >>> css "img" >>> getAttrValue "src"
It gets a list of all the image sources, just like we had talked about.
The fourth lines maps the
download function over this list to create a list of IO actions. Then we feed that list into
sequence_, which runs the actions one at a time and throws away the return values. We could have used
sequence instead, which would have printed the return values.
download url = do content <- runMaybeT $ openUrl url case content of Nothing -> putStrLn $ "bad url: " ++ url Just _content -> do let name = tail . uriPath . fromJust . parseURI $ url B.writeFile name (B.pack _content)
We have to write out binary data, so we use the
writeFile defined in
Data.ByteString.Char8, which operates on
ByteStrings. This is why we need to convert our
String to a
ByteString first using
We are also able to do error checking thanks to our openUrl function being a
MaybeT. If we didn't get any content, we just print out "bad url: [url]". Otherwise we download the image.
After all that work, the concurrent bit seems almost anti-climactic.
First, install the
cabal install parallel-io
Import it into the script:
ParallelIO defines a new function called
parallel_ which we can use anywhere we would have used
sequence_. The IO actions will then get performed concurrently.
Change the end of the script to this:
... imgs <- runX . images $ doc parallel_ $ map download imgs stopGlobalPool
stopGlobalPool needs to be called after the last use of a global parallelism combinator. It cleans up the thread pool before shutdown.
Now build the concurrent version (enabling runtime system options):
$ ghc --make grabber_par.hs -threaded -rtsopts
And run it with
+RTS -N[number of threads]:
$ ./grabber_par +RTS -N4
Here's how the two versions performed on my machine:
$ time ./grabber "http://www.reddit.com/r/pics" real 0m10.341s user 0m0.203s sys 0m0.048s
With parallelization (four threads):
$ time ./grabber_par "http://www.reddit.com/r/pics" +RTS -N4 real 0m3.490s user 0m0.477s sys 0m0.154s
Almost a third of the time!
ParallelIO library uses
MVars to keep things in sync. Read more about ParallelIO or MVars.
Next steps involve writing this as a crawler that visits links on the page up to a depth of
N as well as some way to keep track of visited pages. We also want to keep track of name collisions. If you try to download two images, both named "test.jpg", the concurrent version will error out. The non-concurrent version would just overwrite one image with another, which isn't any good either. On the crawling side, we should watch out for robots.txt files and META tag directives to be polite. And ask for gzip'd data to reduce request time.
We could also parallelize more than just the download, but its a start!
Full Code Listing
import qualified Data.ByteString.Char8 as B import Data.Tree.NTree.TypeDefs import Data.Maybe import Text.XML.HXT.Core import Control.Monad import Control.Monad.Trans import Control.Monad.Maybe import Network.HTTP import Network.URI import System.Environment import Control.Concurrent.ParallelIO -- helper function for getting page content openUrl :: String -> MaybeT IO String openUrl url = case parseURI url of Nothing -> fail "" Just u -> liftIO (getResponseBody =<< simpleHTTP (mkRequest GET u)) css :: ArrowXml a => String -> a XmlTree XmlTree css tag = multi (hasName tag) get :: String -> IO (IOSArrow XmlTree (NTree XNode)) get url = do contents <- runMaybeT $ openUrl url return $ readString [withParseHTML yes, withWarnings no] (fromMaybe "" contents) images tree = tree >>> css "img" >>> getAttrValue "src" parseArgs = do args <- getArgs case args of (url:) -> return url otherwise -> error "usage: grabber [url]" download url = do content <- runMaybeT $ openUrl url case content of Nothing -> putStrLn $ "bad url: " ++ url Just _content -> do let name = tail . uriPath . fromJust . parseURI $ url B.writeFile name (B.pack _content) main = do url <- parseArgs doc <- get url imgs <- runX . images $ doc parallel_ $ map download imgs stopGlobalPool