Install
sudo gem install nokogiri
Contribute
github.com/tenderlove/nokogiri

An HTML, XML, SAX, & Reader parser with the ability to search documents via XPath or CSS3 selectors… and much more

Nokogiri

Parsing an HTML / XML Document

From a String

We’ve tried to make this easy on you. Really! We’re here to make your life easier.

html_doc = Nokogiri::HTML("<html><body><h1>Mr. Belvedere Fan Club</h1></body></html>")
xml_doc  = Nokogiri::XML("<root><aliens><alien><name>Alf</name></alien></aliens></root>")

The variables html_doc and xml_doc are Nokogiri documents, which have all kinds of interesting properties and methods that you can read about here. We’ll cover the interesting bits in other chapters.

From a File

Note that you don’t need to read the file into a string variable. Nokogiri will do this for you.

f = File.open("blossom.xml")
doc = Nokogiri::XML(f)
f.close

Clever Nokogiri! With the time you just saved, approach enlightenment by meditating on this koan.

From the Internets

I understand that there may be some HTML documents available on the World Wide Web.

require 'open-uri'
doc = Nokogiri::HTML(open("http://www.threescompany.com/"))

Parse Options

Nokogiri offers quite a few options that affect how a document is parsed. You can read about them here, but the most commonly-used options are:

  • NOBLANKS - Remove blank nodes
  • NOENT - Substitute entities
  • NOERROR - Suppress error reports
  • STRICT - Strict parsing; raise an error when parsing malformed documents

Here’s how they are used:

doc = Nokogiri::XML(File.open("blossom.xml")) do |config|
  config.strict.noent
end

Or

doc = Nokogiri::XML(File.open("blossom.xml")) do |config|
  config.options = Nokogiri::XML::ParseOptions.STRICT | Nokogiri::XML::ParseOptions.NOENT
end