HTML Snapshot for crawler - Understanding how it works -

- February 15, 2014

i'm reading article today. honest, im interessed "2. of content created server-side technology such php or asp.net" point.

i want understand if have understood :)

i create php script (gethtmlsnapshot.php) include server-side ajax page (getdata.php) , escape (for security) parameters. add @ end of html static page (index-movies.html). right? now...

1 - put gethtmlsnapshot.php? in other words, need call (or better, crawler need) page. if don't have link on main page, crawler can't call :o how can crawler call page _escaped_fragment_ parameters? can't know them if don't specific them somewhere :)

2 - how can crewler call page parameters? before, need link script parameters, crewler browse each page , save content of dinamic result.

can me? , think technique? won't better if developers of crawler own bots in others ways? :)

let me know think about. cheers

i think got wrong i'll try explain what's going on here including background , alternatives. indeed important topic of stumbled upon (or @ least similar) time time.

using ajax or rather asynchronous incremental page updating (because pages don't use xml json), has enriched web , provided great user experience.

it has come @ price.

the main problem clients didn't support xmlhttpget object or javascript @ all. in beginning had provide backwards compatibility. done providing links , capture onclick event , fire ajax call instead of reloading page (if client supported it).

today every client supports necessary functions.

so problem today search engines. because don't. that's not entirely true because partly (especially google), other purposes. google evaluates javascript code prevent blackhat seo (for example link pointing somewhere javascript opening different webpage... or html keyword codes invisible client because removed javascript or other way round).

but keeping simple best think of search engine crawler of basic browser no css or js support (it's same css, party parsed special reasons).

so if have "ajax links" on website, , webcrawler doesn't support following them using javascript, don't crawled. or they? answer javascript links (like document.location whatever) followed. google intelligent enough guess target. ajax calls not made. simple because return partial content , no senseful whole page can constructed context unknown , unique uri doesn't represent location of content.

so there 3 strategies work around that.

have onclick event on links normal href attribute fallback (imo best option solves problem clients search engines)
submitting content websites via sitemap indexed, apart site links (usually pages provide permalink urls external pages link them pagerank)
ajax crawling scheme

the idea have javascript xmlhttpget requests entangled corresponding href attributes so: www.example.com/ajax.php#!key=value

so link looks like:

<a href="http://www.example.com/ajax.php#!page=imprint" onclick="handleajax()">go imprint</a>

the function handleajax evaluate document.location variable fire incremental asynchronous page update. possible pass id or url or whatever.

the crawler recognises ajax crawling scheme format , automatically fetches http://www.example.com/ajax.php.php?%23!page=imprint instead of http://www.example.com/ajax.php#!page=imprint query string contanis html fragment can tell partial content has been updated. have have make sure http://www.example.com/ajax.php.php?%23!page=imprint returns full website looks website should user after xmlhttpget update has been made.

a elegant solution pass object handler function fetches same url crawler have fetched using ajax additional parameters. server side script decides whether deliver whole page or partial content.

it's creative approach indeed , here comes personal pr/ con analysis:

pro:

partial updated pages receive unique identifier @ point qualified resources in semantic web
partially updated websites receive unique identifier can presented search engines

con:

it's fallback solution search engines, not clients without javascript
it provides opportunities black hat seo. google sure won't adopt or rank pages technique high out proper verification of content.

conclusion:

just usual links fallback legacy working href attributes, onclick handler better approach because provide functionality old browsers.
the main advantage of ajax crawling scheme partially updated websites unique uri, , don't have create duplicate content somehow serves indexable , linkable counterpart.
you argue ajax crawling scheme implementation more consistent , easier implement. think question of application design.

Search This Blog

ERT

HTML Snapshot for crawler - Understanding how it works -

Comments

Post a Comment

Popular posts from this blog

ASP.NET/SQL find the element ID and update database -

c++ - Compiling static TagLib 1.6.3 libraries for Windows -

PostgreSQL 9.x - pg_read_binary_file & inserting files into bytea -