Scraping useful information from the Mexican Senate

On August 2017 the Mexican Senate was awarder for fulfill it’s transparency obligations. This mostly has to be with the fact that a great deal of information is publish every day at the Senate’s website.

Well, yes… the information is massive but impossible to handle for must of the users as it doesn’t show any important statistics and the useful data is very sparse through the website. That’s where our scraping skills comes to save the day! If we collect the information in maneageble formats then we can rearange it to gain some insights.

The first thing that we need to do to successfully scrap a website is to visit and study it in order to discover what things would we want to see. In this case we’re going to separate the information in certain principal topics:

Senators
- Name
- State (entity)
- Party
- Alternate
Attendance
- Present
- Absent
- Justified Absence
- Absent on Official Comission
Comissions
- 96 comission
Votes
- PRO
- CON
- Abstention
- Absent
Edicts (bills)

JuliaLang is what we’re going to do use for this task with some help from 3 wonderful packages.

Requests
Gumbo
Cascadia

Requests: To download the code from a website. Gumbo: A HTML parser. Cascadia: A CSS selector.

Then the pipeline we’ll follow is: Download the raw file text -> parse in into HTML -> use the HTML tree-like structure to download the information we need.

In order to use the built-in Julia’s parallelization capabilities I wrote everything as a function so the map function would be available. You can look at all these functions right here.

Now let’s take for example how we would scrap the Senators info (name, entity, party and alternate). Which is this file.

#!/usr/bin/env julia

@everywhere alternates = 640:767

@everywhere using Requests
@everywhere using Cascadia

@everywhere include("scraping_functions.jl")

pmap(senators_and_alternates,alternates)

The first to take notice is the 640:767 range. This is the alternates’ id in the website, and we’re going to use these ones insted of the titular’s because there’s no link to the alternates from the original Senator’s site but there is the other way around. Then, as we’re using multiple workers (parallelization) we have to import every function and package to every one of them and that’s the macro @everywhere for. And pmap is a parallel map for the senators_and_alternates function in the alternates array.

But what’s on the senators_and_alternates function?

function senators_and_alternates(id::Int64)
    tit_id, name, party, state, comissions, alt_id, alternate = names_and_representations(id)
    println(tit_id, "|", name, "|", party, "|", state, "|", alt_id, "|", alternate)
end

Which prints a pipe separated CSV to the STDOUT but it’s calling another another function that takes the id as argument and returns multiple fields as result from that particular id.

function names_and_representations(id::Int64)
    tit_id, alt_id, alt_name, alt_com = alternate_info(id)
    h = get_info(tit_id)
    qs = matchall(Selector("div table strong"),h.root)
    pty = matchall(Selector("div table tr td table img"),h.root)
    alt = matchall(Selector("div span b"),h.root)
    com = matchall(Selector("div ul li a[href^='index.php?watch=9&sm=100&id']"),h.root)
    name = giveme_names(qs)
    state = giveme_states(qs)
    party = giveme_party(pty)
    alternate = giveme_alternate(alt)
    if giveme_comissions(com) != nothing
	comissions = giveme_comissions(com)
    else
	comissions = alt_com
    end
    tit_id, name, party, state, comissions, alt_id, alternate
end

Now here’s where the scraping for the Senators happens! but as I said before, it’s using data from the alternate’s webpage. But here we can see the use of the Selector and matches from the Cascadia package, taking advantage from the tree-like structure of the HTML document as stated before.

Finally, lets take a look into the get_info function.

function get_info(id::Int64)
    r = get("http://www.senado.gob.mx/index.php?watch=8&id=$id")
    h = parsehtml(String(copy(r.data)))
    h
end

This is a very important function because is the one that downloads the files from the website. The argument it takes is the Senator or alternate’s id and downloads that specific site via the Requests package and then parse it into HTML using Gumbo.

Now that we have understood how this works we need to run the scraper telling the computer how many workers do we want, for that we can write a simple bash script.

#! /usr/bin/env bash

procs=`grep -c ^processor /proc/cpuinfo`

julia -p $procs get_names_and_reps.jl | egrep -o "[0-9]{3}.*"

What it does is to look how many procs does your computer have and use them all to scrap the Senate! but we need to clean the output a little bit because we don’t want to know which worker did what so that’s what the egrep is for.

Well that’s all you need to know to scrap the Mexican Senate’s website! If you want the info but don’t want to write code you can always use mine which is right in this repo.