During the last few weekends, I have needed to brush up on my web site parsing skills. The tools available have moved on nicely since my last dip into this topic.
I am currently keeping an eye on properties in Lyon, France. The process has been tedious and called out for some automation. Megan and I plan to return to France in the future and this little project should ease the burden of finding an apartment or house.
This morning I discovered the perl module Web::Scraper. It is a port of a Ruby based tool called scrAPI. The approach taken avoids regular expression matching and opts for XPath and DOM tree selector matching; both more resilient methods of addressing specific sections of a web page.
I found one stumbling block that took a while to overcome. After a little trial and error, I discovered the FireFox browser returned misleading XPaths for objects embedded in tables.
The XPaths provided by FireBug and XPather, included browser-inserted
tbody tags. These tags did not appear in my source web pages. Thus the browser’s XPath did not match the structure used by Web::Scraper, and caused Web::Scraper to miss the desired content.
The solution was easy; strip out the
tbody tags and Web::Scraper returns to working as advertised.
With this problem overcome, the project is already looking helpful.