Tag Archives: cookies

Daily newspaper download with curl

Today I was reading about the election of Stefano Zacchiroli as Project Leader of the Debian project. Since his surname sounded pretty italian I did some research about him and happily found that he really is italian and he also studied at the University of Bologna just like me (though he graduated few years before). I personally never met him even if he was a researcher there in the same years I were attending. I’m very happy for him and proud that an italian reached such a position.

Reading his blog I then found something I should have written about months ago, after the newspaper Il Fatto has gone published. Just like Stefano, I wrote a script to download the newspaper on a daily base. I’m making it public so that anyone who subcribed to Il Fatto can use it. You can download it from here.

Create a file called .ilfattorc in your home with your credentials:

username="USERNAME"
password="PASSWORD"

Sobstitute USERNAME and PASSWORD with yours, of course.

The script is made of two files, one written in Bash, the other in Ruby. Save them in the folder you want the pdfs to get downloaded. It uses the curl tool for the HTTP requests. The Ruby part calculates a list of dates starting from a given one, up to the current and prints them in the required format. Basically the script downloads the pdfs of the newspaper for every day since the day of the last downloaded pdf and up to the current date.

The general steps to authenticate against the web server with curl are the following:

  1. get the login page and save cookies
  2. use the saved cookies to submit username and password along with other login parameters

Once you are authenticated, that is you have all the necessary cookies, you will simply have to send a request to the download url and save the output content.

How these steps are implemented is very specific for each case and I suggest to read the source code to understand them in the case of Il Fatto. If you’re trying to do something similar for other services I suggest you to first clarify how the whole procedure works paying particular attention to cookies, redirects and submitted parameters in POST calls. To do this I would consider using Firebug and Firecookie Firefox plugins.

If you are as lazy as me and want the newspaper to be downloaded every day automatically, then configure Anacron.
Edit your user crontab (with crontab -e) and enter this content (adjust paths according to your environment):

# m h  dom mon dow   command
25 * * * * /usr/sbin/anacron -t /home/fabio/.anacrontab -S /home/fabio/.anacronspool

This will run anacron at the 25th minute of every hour.
The create the .anacrontab file and the .anacronspool directory under your home folder. The content of .anacrontab will be something like (adjust paths according to your environment):

1   0   ilfatto.daily   /home/fabio/Desktop/ilfatto/download.sh

This statement will ensure that the download script will be called just one time per day.

Have a nice reading and good luck to Stefano Zacchiroli.

Update

The scripts have been updated to work with the new site of Il Fatto Quotidiano.