Downloading the TTD API documentation site

May 17, 2019

Recently I've been working with The Trade Desk API whose documentation is hosted on a password protected web site. The site requires a username and password to access. These you get from your TTD representative.

Now, the site is fine but very dense. In a few cases I wanted to search the entire site and having a copy of the site locally would make this possible.

How to download the site is described in this post.

Introduction

The first attempt to download using wget didn't work because the site doesn't support access with Basic Authentication. This meant that passing your username and password as parameters to wget will not work.

Notice that when logging in that the site asks for the username and then prompts for the password in two steps. This means another technique will be needed.

One idea would be to build a crawler that knew how to navigate the login dialogs and then copy the site's files.

Another idea, is to split the work between the login and the file copying between the browser and wget. With this idea we login using the browser to have the site set it's cookies with the browser. Then, we borrow the cookie values and have wget use them to copy the files.

Technique

  • Install 'Export Cookie' add-in for Firefox
  • Login to the site manually with Firefox
  • Export the cookies for the site using the Add-In
  • Use wget to download the site using the cookies file

The Export Cookies Add-In is available here:

Install this into your Firefox browser.

Login to the site manually

Next, manually login to the site using Firefox. This will set the appropriate cookie values within the browser.

The site's url is:

https://api.thetradedesk.com/v3/doc

Export the cookie file for the site to cookies.txt. Find the 'Export Cookie' add-in's icon in the Firefox tool bar and click it. Export the cookies for the site's domain api.thetradedesk.com.

Lastly, here is the usage of wget to copy the site while using the cookies.txt file to allow to access the files.

wget --load-cookies cookies.txt --recursive --no-parent --convert-links --no-clobber --html-extension https://api.thetradedesk.com/v3/doc

Tags: api ttd wget