We have a module that parses documents from email attachments that our partners send to us. Sometimes, however, instead of attachments, the mail text contains download links (often in a mess of HTML tags). So we parse the message body (using regular expressions defined by our operations team), collect the URLs, and retrieve the documents with OpenURI#open
.
OpenURI
is a nice wrapper for Net::HTTP
(and other libraries) which can handle redirects, proxies and other issues. But as usual, when dealing with user input, it’s easy to hit limitations of default libraries.
SSL Issues
It’s not rare that a server has a missing or invalid SSL certificate. This should normally raise eyebrows, but since we trust our sources (at least to some extent), and we really want the document, we’ll ignore the SSL verification:
open(url, ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE)
HTML Garbage
The message body that we parse is often a HTML source, and we can end up with a URL that contains HTML entities, e.g.
http://example.com/document?id=1&foo=bar
Clearly, this is a "&"
encoded to "&"
. We don’t want to reinvent the wheel, so let’s use the htmlentities gem and decode the URL first:
url = HTMLEntities.new.decode(url)
results in
http://example.com/document?id=1&foo=bar
Redirection issues
With a few easy fixes under our belt, we hope for the best, but it doesn’t take a long time for new issues to be reported. The original URL is often a redirect to a malformed URL, e.g. contains spaces:
http://example.com/documents/Monthly Report.pdf
At this point, the advantage of using OpenURI
turned against us, because the handling of redirections is buried deep in a long method, with no easy way to override. Seems we need to roll our own redirect-handling loop using Net::HTTP
and fix the URLs with URI.encode
, such as:
def do_request(uri)
http = Net::HTTP.new(uri.host, uri.port)
if uri.is_a?(URI::HTTPS)
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_NONE
end
http.request_get(uri.request_uri)
end
loop do
url = URI.encode(url)
response = do_request(URI.parse(url))
break unless [301, 302, 303, 307].include?(response.code.to_i) # redirect?
url = response.header['location']
end
Nice, malformed URLs fixed:
http://example.com/documents/Monthly Report.pdf
becomes
http://example.com/documents/Monthly%20Report.pdf
Except that… if the URL was already properly URI-escaped, then we end up with
http://example.com/documents/Monthly%2520Report.pdf
Oops, now what. There’s one approach to check whether we need to escape or not: decode the URL, compare to the original, and if they’re the same, then it wasn’t encoded yet. However, this won’t work with partially encoded URLs (which sounds like a very strange case, but assuming the URLs are compiled from different parts, at least the base and the document name, I’m sure it would happen sooner or later). So let’s try something different - decode and encode again:
url = URI.encode(URI.decode(url))
And this seems to work nicely. Until we encounter some new cases… (to be continued)