This is Fundbase Nerds, written by the team behind Fundbase.

Go to Fundbase  

Smart Document Downloading

When the servers are not nice

Posted by Marek Stanczyk on

We have a module that parses documents from email attachments that our partners send to us. Sometimes, however, instead of attachments, the mail text contains download links (often in a mess of HTML tags). So we parse the message body (using regular expressions defined by our operations team), collect the URLs, and retrieve the documents with OpenURI#open.

OpenURI is a nice wrapper for Net::HTTP (and other libraries) which can handle redirects, proxies and other issues. But as usual, when dealing with user input, it’s easy to hit limitations of default libraries.

SSL Issues

It’s not rare that a server has a missing or invalid SSL certificate. This should normally raise eyebrows, but since we trust our sources (at least to some extent), and we really want the document, we’ll ignore the SSL verification:

open(url, ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE)

HTML Garbage

The message body that we parse is often a HTML source, and we can end up with a URL that contains HTML entities, e.g.

Clearly, this is a "&" encoded to "&". We don’t want to reinvent the wheel, so let’s use the htmlentities gem and decode the URL first:

url =

results in

Redirection issues

With a few easy fixes under our belt, we hope for the best, but it doesn’t take a long time for new issues to be reported. The original URL is often a redirect to a malformed URL, e.g. contains spaces: Report.pdf

At this point, the advantage of using OpenURI turned against us, because the handling of redirections is buried deep in a long method, with no easy way to override. Seems we need to roll our own redirect-handling loop using Net::HTTP and fix the URLs with URI.encode, such as:

def do_request(uri)
  http =, uri.port)
  if uri.is_a?(URI::HTTPS)
    http.use_ssl = true
    http.verify_mode = OpenSSL::SSL::VERIFY_NONE

loop do
  url = URI.encode(url)
  response = do_request(URI.parse(url))
  break unless [301, 302, 303, 307].include?(response.code.to_i) # redirect?
  url = response.header['location']

Nice, malformed URLs fixed: Report.pdf


Except that… if the URL was already properly URI-escaped, then we end up with

Oops, now what. There’s one approach to check whether we need to escape or not: decode the URL, compare to the original, and if they’re the same, then it wasn’t encoded yet. However, this won’t work with partially encoded URLs (which sounds like a very strange case, but assuming the URLs are compiled from different parts, at least the base and the document name, I’m sure it would happen sooner or later). So let’s try something different - decode and encode again:

url = URI.encode(URI.decode(url))

And this seems to work nicely. Until we encounter some new cases… (to be continued)

Marek Stanczyk

Marek is a Fullstack developer at Fundbase.
Loves beautiful code and enjoys developing with Ruby and Rails the most.