Saturday 14 January 2012

Downloading information from 23andMe using wget

I want a backup of 23andMe for when my account runs out, so I decided to use wget to download the various analysis available. Note, you can download your SNP calls from 23andMe, but the value the site provides is the associated interpretation.

First, lets just grab the extensive information that they already make publicly available (no account necessary):

wget -r -l 2 --include health/ https://www.23andme.com/health/all/
  • -r - makes wget recursively download pages linked to the given URL.
  • -l 2 - keeps recursion to a maximum of 2 levels, which is all we need I think.
  • --include health/ - only downloads pages with 'health/' in the URL, which is all we want in this case.

Recursive wget can be 'dangerous', because, essentially, it starts downloading the whole internet, following all links it finds. Here we just grab pages that are two links 'deep', staring from the given URL, and only links under the 'health' directory. In this way we never leave the 23andMe domain, and we only grab what we're interested in, ignoring all the other junk that makes up a web page.

The result won't be navigable (see -m for a more 'robust' solution), but I don't really care, it's the information I'm after, not the layout.

Note that the display of the resulting HTML in your browser will look pretty because it uses absolute links to css, js, and images. This won't last when 23andMe take the information down. i.e. if you really care about the result being pretty, see wget -m.


Now to download the data in my account...

No comments:

Post a Comment