Smalltalk Scripting
With fine timing, Chris Petrilli posts a pointer to freely available online Smalltalk books. A perfect opening for some Smalltalk web scraping.
Mission: Download the PDF files for the book Smalltalk By Example.
Sketch of approach: 1. Fetch the page. 2. Parse the HTML. 3. For each link that is a downloadable file, download it.
Tools: Smalltalk/X system browser, workspace, transcript, inspector and a Unix shell.
1. Using the shell and system browser, look for classes about HTTP. Ah, found HTTPInterface.
2. Click about to figure out how to use the class. In workspace, try:
httpResp := HTTPInterface
get: '/~ducasse/FreeBooks/ByExample/'
fromHost: 'www.iam.unibe.ch'
3. Inspect httpResp, see that it is a HTTPResponse instance; use system browser to navigate to its implementation, look-look see-see.
4. Find classes about HTML parsing: Found HTMLParser. Figure out how to use the class.
5. In workspace,
parsed := HTMLParser new parseText: httpResp data
6. Inspect parsed, see that it is a HTMLDocument, use system browser to navigate to its implementation, look-look see-see.
7. Ok, the message (or is it called a selector?) 'anchorElements' returns an OrderedCollection of, duh, anchor elements in said HTML document.
8. I want those links with one of these suffix: pdf, zip or gif. In workspace,
parsed anchorElements do:
[:each | each hrefString ifNotNil:
[ |f| f := each hrefString asFilename.
(((f hasSuffix: 'pdf') or:
[f hasSuffix: 'zip']) or:
[f hasSuffix: 'gif']) ifTrue:
[Transcript showCR: f asString]]]
The statement (expression? command?) "Transcript ..." prints each 'f' in the transcript window. Its output looks like this:
CodeExamples.zip SmalltalkByExampleNewRelease.pdf SmalltalkbyExMissingChapter27.pdf byExample.gif
From here it is just a little more work to apply class HTTPInterface to 'each' (which contains the URL) to save the downloaded content into a file named by the Filename instance 'f'.
All this just by using 'Find Class' and 'Find Method' in the system browser. Nice! I'm looking forward to discovering what the rest of the IDE does.
Conclusion: Don't know how to indent. Don't know the terminology. My fingers want to type a closing parenthesis at the end of each line. ;-) Notwithstanding all that, Smalltalk rocks!
BTW, if you want to try out the above, remember to practise on your own server and not on Ducasse's site.