Scraping useful information from the Mexican Senate
On August 2017 the Mexican Senate was awarder for fulfill it’s transparency obligations. This mostly has to be with the fact that a great deal of information is publish every day at the Senate’s website.
Well, yes… the information is massive but impossible to handle for must of the users as it doesn’t show any important statistics and the useful data is very sparse through the website. That’s where our scraping skills comes to save the day! If we collect the information in maneageble formats then we can rearange it to gain some insights.
The first thing that we need to do to successfully scrap a website is to visit and study it in order to discover what things would we want to see. In this case we’re going to separate the information in certain principal topics:
- Senators
- Name
- State (entity)
- Party
- Alternate
- Attendance
- Present
- Absent
- Justified Absence
- Absent on Official Comission
- Comissions
- 96 comission
- Votes
- PRO
- CON
- Abstention
- Absent
- Edicts (bills)
JuliaLang is what we’re going to do use for this task with some help from 3 wonderful packages.
- Requests
- Gumbo
- Cascadia
Requests: To download the code from a website. Gumbo: A HTML parser. Cascadia: A CSS selector.
Then the pipeline we’ll follow is: Download the raw file text -> parse in into HTML -> use the HTML tree-like structure to download the information we need.
In order to use the built-in Julia’s parallelization capabilities I wrote everything as a function so the map function would be available. You can look at all these functions right here.
Now let’s take for example how we would scrap the Senators info (name, entity, party and alternate). Which is this file.
The first to take notice is the 640:767 range. This is the alternates’ id in the website, and we’re going to use
these ones insted of the titular’s because there’s no link to the alternates from the original Senator’s site but
there is the other way around.
Then, as we’re using multiple workers (parallelization) we have to import every function and package to every one
of them and that’s the macro @everywhere
for. And pmap
is a parallel map for the senators_and_alternates
function in the alternates array.
But what’s on the senators_and_alternates
function?
Which prints a pipe separated CSV to the STDOUT but it’s calling another another function that takes the id as argument and returns multiple fields as result from that particular id.
Now here’s where the scraping for the Senators happens! but as I said before, it’s using data from the alternate’s webpage. But here we can see the use of the Selector and matches from the Cascadia package, taking advantage from the tree-like structure of the HTML document as stated before.
Finally, lets take a look into the get_info
function.
This is a very important function because is the one that downloads the files from the website. The argument it takes is the Senator or alternate’s id and downloads that specific site via the Requests package and then parse it into HTML using Gumbo.
Now that we have understood how this works we need to run the scraper telling the computer how many workers do we want, for that we can write a simple bash script.
What it does is to look how many procs does your computer have and use them all to scrap the Senate! but we need
to clean the output a little bit because we don’t want to know which worker did what so that’s what the egrep
is for.
Well that’s all you need to know to scrap the Mexican Senate’s website! If you want the info but don’t want to write code you can always use mine which is right in this repo.