python - How do I use regular expressions to parse HTML tags? -
was wondering how extrapolate value of html element using regular expression (in python preferably).
for example, <a href="http://google.com"> hello world! </a>
what regex use extract hello world!
above html?
but beautifulsoup handy library.
>>> beautifulsoup import beautifulsoup >>> html = '<a href="http://google.com"> hello world! </a>' >>> soup = beautifulsoup(html) >>> soup.a.string u' hello world! '
this, instance, print out links on page:
import urllib2 beautifulsoup import beautifulsoup q = urllib2.urlopen('https://stackoverflow.com/questions/3884419/') soup = beautifulsoup(q.read()) link in soup.findall('a'): if link.has_key('href'): print str(link.string) + " -> " + link['href'] elif link.has_key('id'): print "id: " + link['id'] else: print "???"
output:
stack exchange -> http://stackexchange.com log in -> /users/login?returnurl=%2fquestions%2f3884419%2f careers -> http://careers.stackoverflow.com meta -> http://meta.stackoverflow.com ... id: flag-post-3884419 none -> /posts/3884419/revisions ...
Comments
Post a Comment