<< First< PreviousNext >Last >>

2011-06-16

How to cURL an ASP.NET UpdatePanel

Let's say you're writing a web spider/crawler/robot to "borrow" some data from a site. But what if the site is using Microsoft AJAX to present the entire page? This means the page initially requested is totally blank because JavaScript fills in the data after the page loads. In a web browser, this may not even be noticeable, but it tends to be a problem for a spider because spiders usually don't understand JavaScript. These guys must have built their site this way because (1) they're overexuberant newbies who like to use the UpdatePanel simply because it's the current fad, regardless of whether it's appropriate, or (2) they thought this tactic would prevent a web spider from collecting their data. Either motive was a mistake, I think.

But then how can a spider get the AJAX'd portion of a page without having a JavaScript engine? We simply need to spoof the AJAX request. The easiest way to approximate a normal AJAX request is to observe it. I like to use an HTTP sniffer called Fiddler for this purpose, but there are plenty of others that would do the job, such as Firebug, or even a low-level packet sniffer. While you're logging traffic, access the data in a normal web browser. Among other things, you should capture the initial HTML page request & response followed eventually by the AJAX POST & response. The body of the AJAX response should start with several pipe-delimited numbers and identifiers followed by an HTML snippet containing your data.

Now that we know what all the HTTP headers in the AJAX POST are supposed to look like, we just need to mimic them in our own request. I don't know or care exactly which headers are required because copying all of them works. It doesn't matter whether you're using PHP + cURL or those System.Net.WebRequest classes; it can be done in either environment. I should point out that, due to a bug in dot-Net Framework 4, many ASP.NET pages will not fully render if you don't have a User-Agent that Microsoft likes. So to avoid unnecessary extra problems, be sure to mimic the User-Agent from the browser that successfully got you the data.

You'll probably need to have a session cookie in your AJAX request, so your spider will have to get the base HTML page first, save the cookie, then re-present it in the AJAX request. Since your very first request won't have a session, ASP.NET may like to redirect you once as part of the cookie delivery procedure before serving the base HTML page. This means you may need to tell cURL to follow redirects (CURLOPT_FOLLOWLOCATION). But your spider probably does that already, right?

So far, this has all been standard spider spoofery. But for Microsoft AJAX, there are a couple more things we have to do: The AJAX POST needs to submit a collection of form fields from the base HTML page. Most of these are the names of ASP.NET controls. If you are only targeting one page, then you could copy & paste most of these from your HTTP sniffer because they are unlikely to change. However, a couple fields in particular change with every request and need to be copied from the initial HTML response each time: ViewState and EventValidation. These are both long Base64 encoded strings that you simply need to parse out and re-present in your POSTed form fields. There is no need to decrypt them (although it might be fun). If you wish to write code that could handle many different ASP.NET pages, then you might need to use some RegEx to pull all the control names dynamically and then build the whole form fields string from scratch.

I realize that every situation will be a little different, so these steps may not get you all the way there. If all else fails, you should save the requests & responses logged by your sniffer and use a diff tool to compare your attempts with the successful requests made by the browser. This may be the only practical way to see what else you're missing.

"The fine details are left as an exercise for the reader."

<< First< PreviousNext >Last >>

Comment

BradleyMacomber.com Portfolio Reference Words Links