asp.net - c#: crawler project -
could easy follow code examples on following:
- use browser control launch request target website.
- capture response target website.
- convert response dom object.
- iterate through dom object , capture things "firstname" "lastname" etc if part of response.
thanks
here code uses webrequest object retrieve data , captures response stream.
public static stream getexternaldata( string url, string postdata, int timeout ) { servicepointmanager.servercertificatevalidationcallback += delegate( object sender, x509certificate certificate, x509chain chain, sslpolicyerrors sslpolicyerrors ) { // if trust callee implicitly, return true...otherwise, perform validation logic return [bool]; }; webrequest request = null; httpwebresponse response = null; try { request = webrequest.create( url ); request.timeout = timeout; // force quick timeout if( postdata != null ) { request.method = "post"; request.contenttype = "application/x-www-form-urlencoded"; request.contentlength = postdata.length; using( streamwriter requeststream = new streamwriter( request.getrequeststream(), system.text.encoding.ascii ) ) { requeststream.write( postdata ); requeststream.close(); } } response = (httpwebresponse)request.getresponse(); } catch( webexception ex ) { log.logexception( ex ); } { request = null; } if( response == null || response.statuscode != httpstatuscode.ok ) { if( response != null ) { response.close(); response = null; } return null; } return response.getresponsestream(); }
for managing response, have custom xhtml parser use, thousands of lines of code. there several publicly available parsers (see darin's comment).
edit: per op's question, headers can added request emulate user agent. example:
request = (httpwebrequest)webrequest.create( url ); request.accept = "application/x-ms-application, image/jpeg, application/xaml+xml, image/gif, image/pjpeg, application/x-ms-xbap, application/x-shockwave-flash, */*"; request.timeout = timeout; request.headers.add( "cookie", cookies ); // // manifest standard user agent request.useragent = "mozilla/5.0 (windows; u; windows nt 6.1; en-us)";
Comments
Post a Comment