Once part 1 of TokenWatch
was done the next step appeared when you saw all the details on each entries interior page. There I was most interested in the links to the Whitepapers so I could collect them and read through them more easily as a group.
For this part of the project I created another script tokenwatch_details.py
which extends the previous tokenwatch.py
script.
To start I'll be getting the dataframe from the tokenwatch.py
script sorted by NAME
.
df = t.process().sort_values(['NAME'])
Each row in the dataframe has a link to the details page. The gist here will be to get this page and parse out the details I need. When inspecting the page it is noted that all the tables are classed with table-asset-data
. The last one on the page is the most interesting. To grab that table see the following.
html = requests.get(url, headers={'User-agent': 'Mozilla/5.0'}).text
soup = BeautifulSoup(html, "lxml")
tables = soup.findAll("table", {"class": "table-asset-data"})
# Last table
table = tables[-1]
For convience I grab the tables data into a dictionary.
details = {}
for td in table.find_all('td'):
key = td.text.strip().split(' ')[0].lower()
vals = td.find_all('a')
if vals:
value = vals[0]['href']
else:
value = '-'
details[key] = value
return details
If available there will be a link to a whitepaper. Most are links to pdf
files so I defend against errors by checking the type and the url before downloading.
def get_whitepaper(name, details):
try:
whitepaper_link = details['whitepaper']
if whitepaper_link != '-':
# only download if the link has a pdf in it
print whitepaper_link
head = requests.head(whitepaper_link, headers={'User-agent': 'Mozilla/5.0'})
# Some servers doesn't return the applcation/pdf type properly
# As a double check look at the url
if head.headers['Content-Type'] == 'application/pdf' or head.url.find(".pdf") > 0:
whitepaper_filename = get_dir(name) + "/" + name + "-whitepaper.pdf"
download_file(whitepaper_filename, whitepaper_link)
print whitepaper_filename
else:
print "Unknown whitepaper type: " + whitepaper_link
else:
print "Unavailable whitepaper for " + name
except:
print "No whitepaper link in dictionary"
See the complete proje ct in the GitHub repo listed below.