👨🏿‍💻
Learn
  • Learn And The Power Of Community
  • Intro
    • learn-co-curriculum/welcome-to-learn-verified
    • learn-co-curriculum/your-first-lab
    • learn-co-curriculum/environment-setup
  • Intro to tic tac toe
    • matz-readme
    • what-is-a-program?
    • hello world
    • Intro to irb
    • Reading-error-messages
    • Data-types
    • variable
    • Variable-assignment lab
    • String interpolation
    • Interpolation-super-power
    • Welcome to tic tac toe
    • Array
    • Tic tac toe board
    • Intro to methods
    • Puts print and return
    • Intro-to-tdd-rspec-and-learn
    • Methods and arguments
    • Say hello (lab)
    • Methods-default-arguments
    • ttt-3-display_board-example
    • ttt-4-display-board-rb
    • Intro-to-cli-applications
    • Greeting-cli
    • cli-applications in Ruby
    • Ruby-gets-input
    • Tic tac toe move
    • Truthiness-in-ruby-readme
    • booleans
    • conditional (if)
    • ttt-6-position-taken
    • ttt-7-valid-move
    • rspec-fizzbuzz
    • Looping-introduction
    • Loop
    • while-and-until-loop
    • Tic Tac Toe Turn lab
    • looping-while-until lab
    • Tic Tac Toe Play Loop (lab)
    • Tic Tac Toe Current Player (lab)
    • Intro to ruby iterators
    • Nested Arrays
    • Boolean Enumerators
    • Search Enumerators
    • Tic Tac Toe Game Status
    • tic-tac-toe
  • OOP tic tac toe
    • intro to oop
    • Intro-to-classes-and-instances
    • Classes-and-instances-lab
    • Instance-methods
    • Instance-methods-lab
    • Object Attributes
    • object-attributes-lab
    • Object Initialization
    • Object-initialize-lab
    • oop barking dog lab
    • Procedural-vs-oop
    • oop tic tac toe
  • Git and github
    • Intro to Version Control
    • Git Repository Basics
    • Git-basics-quiz
    • Forks-and-clones
    • Git Remotes and Github
    • Git Remotes and Github Codealong
    • Thinking Ahead: GitHub as Career Differentiator
    • Github Pull Requests
    • Git Collaboration
    • Git-collaboration-quiz
    • Git Basics Quiz
  • HTML
    • A-quick-tour-of-the-web
    • The-web-is-made-of-strings
    • What-makes-the-web-possible?
    • html-introduction
    • Your first-html-tag-lab
    • Nested-tags-and-attributes
    • Well-formed-html-document-lab
    • HTML elements types overview
    • Researching-HTML-elements
    • Separation-of-content-and-presentation
  • CSS
    • Introduction-to-css
    • introduction-to-css-code-along
  • Procedural Ruby
    • Regex-what-is-a-pattern
    • Regex-basics
    • Regex-lab
    • Regex-match-scan-grep-methods
    • learn-co-curriculum/method-arguments-lab
    • Method-scope
    • Return Values Lab
    • Debugging-with-pry
    • Method-scope-lab
    • Truthiness-code-challenge
    • If Statements Lab
    • Case-statements
    • Case-statements-quiz
    • Logic and Conditionals Quiz
    • Ternary Operators and Statement Modifiers lab
    • Looping Lab
    • looping-quiz
    • learn-co-curriculum/looping-times
    • countdown-to-midnight lab
    • Array introduction
    • Using Arrays
    • Array-CRUD-lab
    • Array-methods
    • Array-methods-lab
    • Square array lab
    • Collect and Return Values
    • Collect Lab
    • Badges and Schedules Lab
    • Oxford comma lab
    • Deli counter lab
    • Reverse Each Word Lab
    • Yield-and-blocks
    • Each Lab
    • Return from Yield Statements
    • My All? Lab
    • My Find Lab
    • Cartoon Collections Lab
    • Enumerators Code Challenge
    • Prime? Lab
    • Sorting
    • Sorting Lab
    • Introduction to Hashes
    • Using Hashes lab
    • Ruby-symbols
    • Hash iteration
    • Hash Iteration Lab
    • Hash Iteration with Collect
    • Intro to Nested Hashes
    • Building Nested Hashes
    • Building Nested Hashes Lab
    • Nested Hash Iteration
    • Nested Hashes Lab
    • Multitype Collections Lab
    • Iterating over Nested Hashes Codealong
    • Other Hashes Codealong
    • Hashes Manipulation Lab
  • OOP Ruby
    • OO Ruby Video: Object Orientation Overview
    • Object Accessors
    • Instance Variables lab
    • Video Review: Object Properties
    • Meowing Cat
    • Intro to Object Orientation - Ruby
    • oo basics lab
    • OO Basics with Class Constants
    • Self
    • OO School Domain lab
    • OO Counting Sentences lab
    • Class Variables and Methods
    • Class Variables and Methods Lab
    • Remembering Objects
    • Puppy Lab
    • Advanced Class Methods
    • Advanced Class Methods Lab
    • Video Review: Object Models
    • OO Email Parser lab
    • OO Anagram Detector lab
    • OO Cash Register lab
    • Intro to Object Relationships
    • Belongs to Object Lab
    • Has Many Object
    • Has Many Object Lab
    • Collaborating Objects Review
    • Collaborating Objects Lab
    • OO My Pets
    • oo kickstarter lab
    • OO Banking lab
    • Has Many Objects Through
    • Has Many Objects Through Lab
    • Intro to Inheritance
    • Intro to Inheritance Lab
    • Super
    • Super Lab
    • Intro to Modules
    • Intro to Modules Lab
    • Mass Assignment
    • Mass Assignment and Metaprogramming
    • Mass Assignment Lab
    • Custom Errors lab
    • OO Triangle lab
  • Scraping and project
    • Gems and Bundler
    • Scraping
    • Scraping Lab
    • Kickstarter Scraping Lab
    • Video Review: Object Orientation and Scraping
    • OO Ruby Object Orientation Video Review
    • Music Library CLI
    • Video Review: Music Library CLI
    • Tic-tac-toe with AI project
    • Student Scraper
    • CLI Data Gem Portfolio Project
    • CLI Data Gem Walkthrough
    • CLI Data Gem Walkthrough: Creating a CLI Scraper Gem
    • Common Anti-Patterns in CLI Data Gem
    • Student Example 1: Refactoring CLI Gem
    • Student Example 2: Refactoring CLI Gem
  • SQL
    • What is SQL
    • SQL Intro and Installation
    • SQL Database Basics
    • SQL Databases and Text Editors
    • SQL Data Types
    • SQL Inserting, Updating, and Selecting
    • Basic SQL Queries
    • SQL Aggregate Functions
    • SQL Aggregate Functions Lab
    • SQL Bear Organizer Lab
    • Edgar Codd and Table Relations
    • Table Relations
    • SQL JOINS
    • SQL Complex Joins
    • SQL Join Tables
    • Grouping and Sorting Data
    • SQL Joins Review Lectures
    • SQL Crowdfunding Lab
    • SQL Library Lab
    • Pokemon Scraper Lab
  • ORM And Active record
    • Why an ORM is Useful
    • Mapping Ruby Classes to Database Tables
    • Mapping Classes to Tables Lab
    • Mapping Database Tables to Ruby Objects
    • Mapping Database Rows to Objects Lab
    • Updating Records in an ORM
    • Updating Records in an ORM Lab
    • Preventing Record Duplication
    • ORMs Lab: Bringing It All Together lab
    • Dynamic ORMs
    • Dynamic ORMs with Inheritance
    • ActiveRecord Mechanics
    • Translating from ORM to ActiveRecord
    • Intro to Rake
    • Mechanics of Migrations
    • Writing Our Own Migrations
    • Migrations and Active Record Lab
    • ActiveRecord CRUD Lab
    • Advanced Finding Lab
    • ActiveRecord Associations
    • ActiveRecord Associations Lab
    • ActiveRecord Associations Lab II
    • ActiveRecord Associations Video Review
    • ActiveRecord Associations Video Review II
    • Video Review: Aliasing ActiveRecord Associations
    • Video Review: Blog CLI with ActiveRecord and Associations
  • Rack
    • How the Internet Works
    • Increasing Layers of Abstraction
    • Inspecting the Web with Rack (lab)
    • The HTTP Request
    • Dynamic URL Routes
    • Dynamic Web Apps with Rack (lab)
    • Rack Responses Lab
    • Rack Routes and GET Params Lab
    • HTTP Status Codes
    • Dynamic URLs and Status Codes Lab
    • Video Review: How The Web Works, Pt 1
    • Video Review: How the Web Works, Pt 2
  • Html
    • How the Web Works
    • Site Planning
    • HTML Fundamentals
    • HTTP Status Codes
    • video review how the web works pt 1
    • How the Web Works, Part 2: Overview
    • Setting Up a New Site
    • Document Structure
    • Text Formatting
    • HTML Tables
    • Html-images
    • HTML Links
    • Html backing-up changes
    • HTML Validation
    • Quiz - HTML Fundamentals
    • Dev Tools Super Power
    • HTML Lists
    • Html issue bot 9000 (lab)
    • HTML Forms and Iframes
    • HTML Map and Contact Form Code-along
    • HTML5 Media
    • HTML5 Video Embed Code-Along
    • HTML5 Semantic Elements
    • HTML5 Semantic Containers Code-along
    • HTML5 Quiz
  • CSS
    • CSS Fundamentals
    • CSS Styling Code Along
    • My Little Rainbow
    • CSS Kitten Wheelbarrow
    • CSS Graffiti Override Lab
    • CSS Issue Bot 9000
    • Your first deployment
    • The Box Model
    • Layout Types
    • Float
    • Clearfix
    • Centering
    • Column Structure
    • CSS Columns Code Along Exercise (lab)
    • Box Model & Page Layout
    • Using Z Index
    • Positioning
    • ZHW Shoes Layout (lab)
    • Zetsy (lab)
    • CSS Box Style Code Along
    • Animal Save (lab)
    • Building Responsive Sites
    • Intro to Responsive Media
    • CSS Media Queries
    • Working with Responsive Type
    • Responsive layout
    • The Viewport Property
    • Responsive Features Code-Along (lab)
    • Bootstrap Introduction
    • Bootstrap Code-Along
    • Bootstrap Grid System
    • Grid Layout Code-Along
    • Bootstrap Navbar Code-Along
  • Sinatra
    • What is Sinatra?
    • Sinatra From Scratch
    • Using the Shotgun Development Server (lab)
    • Sinatra Basics
    • Sinatra Hello World Basics (lab)
    • Routes in Sinatra
    • Sinatra Routes Lab
    • Intro To MVC
    • Sinatra MVC File Structure (lab)
    • Sinatra Views: Using ERB
    • Sinatra Views (lab)
    • Sinatra Basic Views Lab
    • Sinatra Views Lab II
    • Intro To Capybara
    • Dynamic Routes in Sinatra
    • HTML Forms and Params
    • Passing Data Between Views and Controllers in Sinatra
    • Sinatra Forms Lab
    • Sinatra Yield Readme
    • Integrating Models Sinatra Code-along
    • Sinatra MVC Lab - Pig Latinizer
    • Sinatra Basic Forms Lab
    • Sinatra Forms
    • Nested Forms Readme
    • Sinatra Nested Forms Lab: Pirates!
    • Lab Review-- Sinatra Nested Forms Lab: Pirates
    • Sinatra Nested Forms Lab: Superheroes!
    • Sessions and Cookies
    • Mechanics of Sessions
    • Sinatra Basic Sessions Lab
    • Using Sessions
    • Sinatra and Active Record CRUD
    • Sinatra Activerecord Setup
    • Sinatra ActiveRecord CRUD
    • User Authentication in Sinatra
    • Sinatra Sessions Lab - User Logins
    • Securing Passwords
    • Secure Password Lab
    • Sinatra Authentication- Overview
    • RESTful Routes
    • Restful Routes Basic Lab
    • Sinatra ActiveRecord Associations: Join Tables
    • Using Tux in Sinatra with ActiveRecord
    • ActiveRecord Associations in Sinatra
    • Sinatra Multiple Controllers
    • Sinatra and Active Record: Associations and Complex Forms
    • Sinatra Playlister (lab)
    • Welcome to NYC Sinatra! (lab)
    • Building a Site Generator, Part 1- Overview
    • Building a Site Generator, Part 2- Overview
    • Fwitter Group Project
  • Rails
    • Welcome To Rails
      • Rails Application Basics
      • Rails Static Request
      • Rails Hello World Lab
      • Rails Model View Controller
      • Intro to Rails- Overview
    • Intro to REST
    • Active Record Models and Rails
    • ActiveRecord Model Rails Lab
    • RESTful Index Action Lab
    • Rails Dynamic Request
    • Rails Dynamic Request Lab
    • Rails URL Helpers
    • Rails URL Helpers Lab
    • Rails form_tag
    • Rails form_tag Lab
    • Create Action
    • Create Action Lab
    • Index, Show, New, Create Lab
    • Edit/Update Action
    • form_for on Edit
    • Strong Params Basics
    • form_for Lab
    • Rails Generators
    • CRU with form_for Lab
    • Resource and Scaffold Generator
    • Rails Blog scaffold
    • Todo mvc assets and managing lists
    • Rails Forms Overview
    • ActiveRecord Validations
    • ActiveRecord Validations Lab
    • Validations in Controller Actions
    • Validations In Controller Actions Lab
    • Validations with form_tag
    • Validations with form_for
    • DELETE Forms and Requests
    • Testing in Rails
    • Validations with form_tag
    • CRUD With Validations Lab
    • Join the Fun rails
    • Activerecord lifecycle reading
    • Displaying Associations Rails
    • Active Record Associations Review
    • Forms And Basic Associations Rails
    • Forms And Basic Associations Rails Lab
    • Basic Nested Forms
    • Displaying Has Many Through Rails
    • Displaying Has Many Through Rails Lab
    • Has Many Through Forms Rails
    • Has Many Through Forms Rails Labs
    • Has Many Through in Forms Lab Review- Overview
    • Deep Dive into Nested Forms- Overview
    • Layouts And Templates in Rails
    • Rails Layouts And Templates Lab
    • Simple Partials
    • Simple Partials Lab
    • Partials with Locals
    • Partials with Locals
    • Refresher on MVC
    • Refactoring Views With Helpers
    • Refactoring Views With Helpers Lab
    • Model Class Methods
    • Optimal Queries using Active Record (lab)
    • Routing And Nested Resources
    • Nested Resource Routing Lab
    • Modifying Nested Resources
    • Modifying Nested Resources Lab
    • Namespaced Routes
    • Namespaced Routes Lab
    • Todomvc 2 lists have items
    • TodoMVC 3: Mark Items Complete
    • Todomvc 4 refactoring with partials and helpers
    • Todomvc 5 deleting items
    • Introduction to Authentication and Authorization
      • Cookies and sessions
      • Cookies and Sessions Lab
      • Sessions Controller
      • Sessions Controller Lab
      • Login Required Readme
      • Login Required Lab
      • Using has_secure_password
      • Has_secure_password lab
      • Authentication- Overviewn
      • OmniAuth
      • Omniauth Lab
      • Omniauth review lecture in todomvc
      • Authentication and authorization recap and gems
    • Rails Amusement Park lab
    • How to Find Gems
  • JavaScript
    • Intro to JavaScript
      • JavaScript Data Types
      • JavaScript Data Types Quiz
      • JavaScript Variables
      • JavaScript Comparisons
      • Conditionals
      • Logical Operators
      • Functions
      • Intro to Debugging
      • Intro to Testing
      • JavaScript Basics Quiz
    • Scope
      • Scope chain
      • JavaScript Practice Scope Lab
      • Lexical scoping
      • Errors and Stack Traces
      • Hoisting
    • Arrays And Objects
      • Objects
      • JavaScript: Objects and Arrays Quiz
      • Object Iteration
      • JavaScript Logging
      • Traversing Nested Objects
      • Filter
      • Map
    • Functions Revised
      • First-Class Functions Lab
      • First-Class Functions
      • First-Class Functions Practice
      • First-Class Functions Practice Lab
    • OOP
      • Creating Objects
      • Object Methods and Classes
      • Using Prototypes
      • Using Classes in Javascript
      • JavaScript This Walkthrough
      • This Code-along
      • Bind, Call, and Apply Readme
      • Bind, Call, Apply Lab
      • Object Relations
      • Association Methods in Javascript
      • Class Relations Lab
      • JavaScript Closures and Higher Order Functions
      • Closures Lab
      • JavaScript Arrow Functions
      • Daily Lunch Lab
    • DOM
      • Introduction to the DOM
      • Introduction to the DOM Lab
      • More on the DOM
      • Creating and Inserting DOM Nodes
      • The DOM Is a Tree
      • Listening to Nodes
      • Modify HTML With jQuery
      • Modifying HTML Lab
      • jQuery Selectors
      • Document.ready
      • Acting On Events Lab
      • DOM Quiz
    • Templates
      • Introduction to CSS
      • CSS Quiz
      • CSS Libraries
      • CSS Libraries Lab
      • Intro to Templates
      • Template Engines
      • Template Engines Lab
      • Advanced Templating
      • Advanced Templating Lab
    • Asynchronous JavaScript
      • Intro to XHR Code Along
      • Hitting APIs Lab
      • Advanced AJAX Lab
      • AJAX and Callbacks
      • AJAX and Callbacks Lab
      • REST Refresher
      • REST Quiz
      • Fetch
      • JavaScript fetch() Lab
      • Intro to Mocha
      • Testing with Spies
      • Testing with Mocks and Stubs
  • Rails and JavaScript
Powered by GitBook
On this page
  • Learning Goals
  • Introduction
  • What is Scraping and Why Use it?
  • Scraping HTML Using Nokogiri and Open-URI
  • Refresher: What is Open-URI?
  • What is Nokogiri?
  • Installing Nokogiri
  • Opening a Web Page as HTML with Nokogiri and open-uri
  • Using Nokogiri to Extract Data
  • Using CSS Selectors to Get Data
  • Iterating over elements
  • Operating on XML
  • Conclusion
  • Resources
  1. Scraping and project

Scraping

Learning Goals

  • Introduce web scraping and its usages.

  • Learn how to use Nokogiri to scrape data from an HTML document.

Introduction

Web scraping is the act of parsing a web page's HTML and pulling, or "scraping" pertinent data from that HTML. In this reading, we'll take a brief look at what scraping is and how to accomplish it.

A more thorough code-along is coming up next, but if you would like to follow along, lib/scraper.lib is provided for you.

What is Scraping and Why Use it?

As we established above, scraping is a technique used to grab data out of the HTML that makes up a web page. Scraping can be difficult to accomplish––in order to get the data we want, we need to closely examine the HTML and identify exactly which elements contain the information we're interested in. It requires a high degree of precision.

So, if scraping is so tricky, why do we use it? Well, not all of the data we might be interested in using to program is available to use through APIs. For example, let's say we're creating an app that catalogs popular musicians and searches the web for their upcoming concerts. A quick Google search will reveal that, unfortunately for us, there isn't a "Popular Musician" API out there just waiting to be used. There is, however, a very comprehensive list of musicians on the Billboard website. In such a scenario, you may want to programmatically grab every musician's name from the Billboard list and store those artists in your own database.

Here's another example: let's say you're creating an app that allows a user to subscribe to a news feed. You anticipate that your users are super-tech savvy and might be interested in subscribing to some lesser-known tech news sites. Such sites may not have an API that makes their articles available to you. Instead, you would have to scrape those sites for their latest news articles and send those newest articles to your users.

These are just a few examples of situations in which scraping might come in handy. Now that we have a few use-cases that illustrate the utility of scraping, let's talk about how to scrape.

Scraping HTML Using Nokogiri and Open-URI

Refresher: What is Open-URI?

In other words, running:

html = open('http://www.google.com')

stores the HTML of Google into a variable, html. (More specifically, it actually stores the HTML in a temporary file that we can then call read on to get the raw HTML. We won't worry about that here though.)

What is Nokogiri?

Nokogiri is a Ruby gem that helps us to parse HTML and collect data from it. It allows us to treat a huge string of HTML as if it were a series of nested objects that you can use to extract the desired information using provided methods. Put simply, Nokogiri takes in HTML and spits out a collection of objects we can get information from.

The HTML that would normally be rendered as a webpage can be scraped with Nokogiri into a many small pieces. Nokogiri makes the level of precision required to extract the necessary data much easier to attain. It works like a fine-toothed saw to scrape only the necessary data. In fact, that's what "nokogiri" means: a fine-toothed saw.

Let's get Nokogiri up and running and look at a very basic example of its usage. Then, we'll move on to the next lesson, where you'll try it out for yourself.

Installing Nokogiri

Opening a Web Page as HTML with Nokogiri and open-uri

Let's say we have a file, scraper.rb which is responsible for (you guessed it) scraping. We need to require Nokogiri and open-uri:

require 'nokogiri'
require 'open-uri'

# more code coming soon!

We can use the following line to grab the HTML that makes up the Flatiron School's landing page at flatironschool.com:

html = open("https://flatironschool.com/")

Next, we'll use the Nokogiri::HTML method to take the string of HTML returned by open-uri's open method and convert it into a NodeSet (aka, a bunch of nested "nodes") that we can easily play around with.

Let's save the HTML document in a variable, doc that we can then operate on:

doc = Nokogiri::HTML(html)

If we were to puts out doc right now, we'd see something like this in our terminal:

<!DOCTYPE html>
<html lang="en" class="gr__flatironschool_com" data-react-helmet="lang">
<head>
  <meta charset="utf-8">
  <meta http-equiv="x-ua-compatible" content="ie=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <meta name="google-site-verification" content="X--Dsxcv97NzPomhlz80wswUgUOF8iMxYhmaY-qNHFY">
  <link rel="preload" href="https://fonts.gstatic.com/s/roboto/v18/KFOlCnqEu92Fr1MmWUlfBBc4AMP6lQ.woff2" as="font" type="font/woff2" crossorigin="anonymous">
  <link rel="preload" href="https://fonts.gstatic.com/s/roboto/v18/KFOlCnqEu92Fr1MmSU5fBBc4AMP6lQ.woff2" as="font" type="font/woff2" crossorigin="anonymous">
  <link rel="preload" href="https://fonts.gstatic.com/s/roboto/v18/KFOmCnqEu92Fr1Mu4mxKKTU1Kg.woff2" as="font" type="font/woff2" crossorigin="anonymous">
  <link rel="preload" href="https://fonts.gstatic.com/s/roboto/v18/KFOlCnqEu92Fr1MmEU9fBBc4AMP6lQ.woff2" as="font" type="font/woff2" crossorigin="anonymous">
  <link rel="preload" href="https://fonts.gstatic.com/s/roboto/v18/KFOkCnqEu92Fr1Mu51xIIzIXKMny.woff2" as="font" type="font/woff2" crossorigin="anonymous"> 
  <script type="text/javascript" async="" src="https://widget.intercom.io/widget/j4d6dyie"></script>
  <script src="https://connect.facebook.net/signals/plugins/inferredEvents.js?v=2.9.4" async=""></script>
  <script src="https://connect.facebook.net/signals/config/1706055166302798?v=2.9.4&amp;r=stable" async=""></script>
  <script async="" src="https://connect.facebook.net/en_US/fbevents.js"></script>
  <script type="text/javascript" async="" src="https://www.google-analytics.com/analytics.js"></script>
  <script type="text/javascript" async="" src="https://www.google-analytics.com/plugins/ua/linkid.js"></script>
  <script src="//js.hs-analytics.net/analytics/1568493600000/69751.js" type="text/javascript" id="hs-analytics"></script>
  <script src="https://js.hsadspixel.net/fb.js" type="text/javascript" id="hs-ads-pixel-69751" data-ads-portal-id="69751" data-ads-env="prod" data-loader="hs-scriptloader" data-hsjs-portal="69751" data-hsjs-env="prod"></script>
...

If you look through further, you can find the body with lots of content.

<body data-gr-c-s-loaded="true">
  <div id="pull-down" class="hb-50 hellobar hb-bottom-right se-714205 inverted" style="background-color: rgb(255, 255, 255);">
    <div class="hellobar-arrow">
      <svg xmlns="http://www.w3.org/2000/svg" width="11px" height="11px" viewBox="43.6 92.5 315 315"><path d="M49.6 92.5c-3.3 0-6 2.7-6 6v303c0 3.3 2.7 6 6 6h303c3.3 0 6-2.7 6-6v-303c0-3.3-2.7-6-6-6H49.6zM229.6 254.3c-3.3 0-6 2.7-6 6V360c0 3.3-2.7 6-6 6h-33c-3.3 0-6-2.7-6-6v-99.7c0-3.3-2.7-6-6-6H99.2c-3.3 0-4.2-2-2-4.5l99.9-111.4c2.2-2.5 5.8-2.5 8 0l99.9 111.4c2.2 2.5 1.3 4.5-2 4.5H229.6z"></path></svg>
    </div>
  </div>
  <iframe src="about:blank" id="o8689c8c25e49f24d1d4bd1d49101d50041f42abc-container" class="HB-Slider hb-animated azuki hb-bottom-right" name="o8689c8c25e49f24d1d4bd1d49101d50041f42abc-container-0" style="display: none;"></iframe>
  <noscript>
    <iframe src="https://www.googletagmanager.com/ns.html?id=GTM-KZZ9JB" height="0" width="0" style="display: none; visibility: hidden"></iframe>
  </noscript>
  <noscript id="gatsby-noscript">This app works best with JavaScript enabled.</noscript>
  <div id="___gatsby">
    <div style="outline:none" tabindex="-1" role="group" id="gatsby-focus-wrapper">
      <header class="siteHeader-2Zios8 blue-2siWYz normal-1mR2Ud" name="site-header">
        <picture>
          <source type="image/webp" media="(min-width: 1920px)" srcset="https://images.ctfassets.net/hkpf2qd2vxgx/6uCU1D03JXktRj8kwW7Nip/a2572151bf67c3552d88139ee0299ef2/homepage-hero.jpg?fm=webp&amp;q=85&amp;w=1920,
      https://images.ctfassets.net/hkpf2qd2vxgx/6uCU1D03JXktRj8kwW7Nip/a2572151bf67c3552d88139ee0299ef2/homepage-hero.jpg?fm=webp&amp;q=40&amp;w=3840 2x">
          <source type="image/jpeg" media="(min-width: 1920px)" srcset="https://images.ctfassets.net/hkpf2qd2vxgx/6uCU1D03JXktRj8kwW7Nip/a2572151bf67c3552d88139ee0299ef2/homepage-hero.jpg?fl=progressive&amp;fm=jpg&amp;q=85&amp;w=1920,
      https://images.ctfassets.net/hkpf2qd2vxgx/6uCU1D03JXktRj8kwW7Nip/a2572151bf67c3552d88139ee0299ef2/homepage-hero.jpg?fl=progressive&amp;fm=jpg&amp;q=40&amp;w=3840 2x">
          <source type="image/webp" media="(min-width: 1280px)" srcset="https://images.ctfassets.net/hkpf2qd2vxgx/6uCU1D03JXktRj8kwW7Nip/a2572151bf67c3552d88139ee0299ef2/homepage-hero.jpg?fm=webp&amp;q=85&amp;w=1280,
      https://images.ctfassets.net/hkpf2qd2vxgx/6uCU1D03JXktRj8kwW7Nip/a2572151bf67c3552d88139ee0299ef2/homepage-hero.jpg?fm=webp&amp;q=40&amp;w=2560 2x">
...

What are Nested Nodes?

Nested nodes refers to any tree of elements in which parent elements branch off to contain children elements. In fact, we've seen similarly nested structures before when we dealt with nested data structures like hashes. By creating a nested structure, Nokogiri allows us to do things like iterate over a collection of elements from the HTML document and use brackets,[], and dot notation to access elements within the nested structure.

Using Nokogiri to Extract Data

Note: For this reading, we'll be using the Flatiron School website. However, how you scrape a page is very specific to the content of the page you are trying to scrape. That means that if the webpage you wrote certain scraping code for ever changes, your scraping code will likely no longer work correctly. So, the Flatiron School website that this reading refers to may have changed! Some of the examples here may specific to an earlier version of the site and won't work look or work exactly as shown when you try them out on your own. That's okay though. Just follow along with the reading and, if you want to try it out, feel free to use the examples provided to guide you in scraping content that is present on the page.

You should see something like this:

Now that we understand what Nokogiri is and have seen how it opens the HTML that makes up a web page, let's look at how we use it to actually scrape information.

Using CSS Selectors to Get Data

Nokogiri allows you to use CSS selectors in order to retrieve specific pieces of information out of an HTML document.

What is a CSS Selector

In the following code:

<div id="my-div">
  <p class="my-paragraph"></p>
</div>

The id and class attributes of the HTML elements are the CSS selectors. You would refer to the div with this selector: #my-div (using the # to indicate id), and the paragraph with this selector: .my-paragraph (using the . to indicate class).

Nokogiri's .css Method

Nokogiri's .css method can be called on the doc variable that we set equal to that giant string of HTML that Nokogiri retrieved for us. The .css method takes in an argument of the CSS selector you want to retrieve. Let's take a look.

Choosing a CSS Selector

In order to identify the CSS selector, click the button in the upper left corner of the console pane that looks like a mouse icon partially in a box.

Once activated, hover over the 'Change things' text. This will highlight its HTML element for us. Notice that:

<h1 class="headline-26OIBN">...</h1>

is highlighted in the above image. If you click on the carrot at the left end of that line, it will open up to show you what that element contains (with lots of spacing around it):

We found it! That text lives in a span whose class is "headline-26OIBN". Now we're ready to use the .css method to grab the text we want:

Calling the .css method

In our scraper.rb file, we had the following code:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open("https://flatironschool.com/"))

Let's call .css on doc and give it the argument of our CSS selector:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open("https://flatironschool.com/"))
doc.css(".headline-26OIBN")

If we were to copy and paste the above code into IRB, the last line would return something like:

[#<Nokogiri::XML::Element:0x3fdb39ac8380 name="h1" attributes=[#<Nokogiri::XML::Attr:0x3fdb39ac86dc name="class" value="headline-26OIBN">] children=[#<Nokogiri::XML::Text:0x3fdb39ac5d60 "Change things.">]>]

Although dense, it is possible to figure some things out. First of all, doc.css(".headline-26OIBN") returned what looks like an Array containing one Nokogiri object. In actuality, this 'Array' is also a special Nokogiri object, but works very much like an Array. If you look closely at the object contained within it, you'll see that it has the 'Change things' text towards the end! To get it out, we can call .text:

doc.css(".headline-26OIBN").text

Using .text allows us to access text content inside an element scraped by Nokogiri. Run in IRB, we'd see something like this returned:

Aside: In general adding .strip to the end will allow us to clean up the extra whitespace and simply return the text contained inside the element.

An interesting thing to note: If you're coding along in the provided lib/scraper.rb file, using puts or print on doc.css will cause the HTML elements to print out.

puts doc.css(".headline-26OIBN")

Will print out:

<h1 class="headline-26OIBN">Change things.</h1>

However, just as before, we can just add .text (and .strip) and get only the text contained inside the element that we want. Alternatively, using p will produce the array-like object we saw from before.

We did it! We used Nokogiri to get the HTML of a web page. We used the element inspector in the browser to ID the CSS selector of the data we wanted to scrape. We used the .css Nokogiri method, along with that CSS selector, to grab the element that contains our desired data. Finally, we used the .text method to retrieve the desired text.

Iterating over elements

This time, if we hover over one of the elements containing a course, we'll see there are two classes assigned, inlineMobileLeft-2Yo002, and imageTextBlockGrid3-2XAK6G. Since CSS classes are often shared, we'll use all two to try and get only the content we need:

require 'nokogiri'
require 'open-uri'

html = open("https://flatironschool.com/")
doc = Nokogiri::HTML(html)

doc.css(".inlineMobileLeft-2Yo002.imageTextBlockGrid3-2XAK6G")

Notice that each class is listed without spaces! To make sure we only select the three courses in this section and not also other elements that might share the same class name, we will increase the specificity of our CSS selector. In order to make our selector more strict, we'll target the correct section first and then the elements containing a course.

This section has the id 2a778efd-1685-5ec6-9e5a-0843d6a88b7b, and adding that to the final selector will produce: .css("#2a778efd-1685-5ec6-9e5a-0843d6a88b7b .inlineMobileLeft-2Yo002.imageTextBlockGrid3-2XAK6G"). Even though the Nokogiri gem returns a Nokogiri::XML::NodeSet (which looks like an Array in Ruby), we can use Ruby methods, such as .each and .collect, to iterate over it.

[#<Nokogiri::XML::Element:0x3fdf31ee8eb4 name="h2" attributes=[#<Nokogiri::XML::Attr:0x3fdf31ee8e28 name="class" value="inlineMobileLeft-2Yo002 imageTextBlockGrid3-2XAK6G">] children=[#<Nokogiri::XML::Text:0x3fdf31ee8900 "\n          $1M in Scholarships for Women\n        ">]>, #<Nokogiri::XML::Element:0x3fdf31ee8748 name="h2" attributes=[#<Nokogiri::XML::Attr:0x3fdf31ee86e4 name="class" value="tout__label heading heading--level-4">] children=[#<Nokogiri::XML::Text:0x3fdf31ee8270 "\n          What Kind of Coding Program is Right for You?\n        ">]>, #<Nokogiri::XML::Element:0x3fdf31ee807c name="h2" attributes=[#<Nokogiri::XML::Attr:0x3fdf31eedfcc name="class" value="tout__label heading heading--level-4">] children=[#<Nokogiri::XML::Text:0x3fdf31eeda90 "\n          Attend an Online Info Session\n        ">]>, #<Nokogiri::XML::Element:0x3fdf31eed8d8 name="h2" attributes=[#<Nokogiri::XML::Attr:0x3fdf31eed860 name="class" value="tout__label heading heading--level-4">] children=[#<Nokogiri::XML::Text:0x3fdf31eed43c "\n          Coding Bootcamp Prep\n        ">]>, #<Nokogiri::XML::Element:0x3fdf31eed284 name="h2" attributes=[#<Nokogiri::XML::Attr:0x3fdf31eed220 name="class" value="tout__label heading heading--level-4">] children=[#<Nokogiri::XML::Text:0x3fdf31eecdc0 "\n          Online Software Engineering\n        ">]>, #<Nokogiri::XML::Element:0x3fdf31eecc1c name="h2" attributes=[#<Nokogiri::XML::Attr:0x3fdf31eecba4 name="class" value="tout__label heading heading--level-4">] children=[#<Nokogiri::XML::Text:0x3fdf31eec744 "\n          Data Science Bootcamp Prep\n        ">]>, #<Nokogiri::XML::Element:0x3fdf31eec5a0 name="h2" attributes=[#<Nokogiri::XML::Attr:0x3fdf31eec53c name="class" value="tout__label heading heading--level-4">] children=[#<Nokogiri::XML::Text:0x3fdf31ef3fbc "\n          Online Data Science\n        ">]>]

Instead of just outputting the results of doc.css, if we assign them to a variable, we can then iterate over that variable with .each and puts out each course:

courses = doc.css("#2a778efd-1685-5ec6-9e5a-0843d6a88b7b .inlineMobileLeft-2Yo002.imageTextBlockGrid3-2XAK6G")

courses.each do |course|
  puts course.text.strip
end

We'd see something like this:

Software EngineeringLaunch your career as a full-stack web developer ...
Data ScienceOur full-time data science program that gives students ...
UX/UI DesignLearn the user experience, client management, technical, ...

Not exactly the course listing as it scraped some other content as well - a great example how tricky scraping can be - but we've still achieved iteration!

Operating on XML

Let's take another look at the element returned to us by our call on the .css method. In the previous example, we had many Nokogiri objects to iterate over. Looking at just the first one:

p doc.css("#2a778efd-1685-5ec6-9e5a-0843d6a88b7b .inlineMobileLeft-2Yo002.imageTextBlockGrid3-2XAK6G")[0]

We get the following:

#<Nokogiri::XML::Element:0x3fc3494ba054 name="div" attributes=[#<Nokogiri::XML::Attr:0x3fc3494ba298 name="class" value="inlineMobileLeft-2Yo002 imageTextBlockGrid3-2XAK6G">] children=[#<Nokogiri::XML::Element:0x3fc3494abd88 name="div" attributes=[#<Nokogiri::XML::Attr:0x3fc3494abd24 name="class" value="media-3NKI6- horizontalImageContainer-1a2NpA">]...

This is an XML element. XML stands for Extensible Markup Language. Just like HTML, it is a set of rules for encoding and displaying data on the web.

When we use Nokogiri methods, we get XML elements in return. Looking at the output object, we can see it has a name, "h2". We can get this info directly by adding these to the end of our doc.css call:

p doc.css("#2a778efd-1685-5ec6-9e5a-0843d6a88b7b .inlineMobileLeft-2Yo002.imageTextBlockGrid3-2XAK6G")[0].name
# => "div"

This is the name of the XML element, not to be confused with the HTML attribute 'name' that can be assigned to elements. Those types of attributes can be accessed with .attributes. Using .attributes will return ids, names, and classes, but will also return other useful content like alt and src for images.

p doc.css("#2a778efd-1685-5ec6-9e5a-0843d6a88b7b .inlineMobileLeft-2Yo002.imageTextBlockGrid3-2XAK6G")[0].attributes

Since this example doesn't have any attributes besides the CSS classes, we just get back the classes we already know:

{"class"=>#<Nokogiri::XML::Attr:0x3fd0e50e1b04 name="class" value="inlineMobileLeft-2Yo002 imageTextBlockGrid3-2XAK6G">}

One last but important method to note is children. Adding children will return any child nodes nested inside this element. In this particular example, all that is contained is a text node, but this XML element can contain all types of XML elements, nested as children. On a webpage, an h2 HTML element may be nested within a div. When scraped, this relationship can be represented by having an XML object named "div" with a child XML object named "h2".

Nokogiri collects these objects into a hierarchical data structure, much like the nested arrays and hashes we've been building and manipulating for a while now. This structure allows us to iterate over an array of Nokogiri objects and use enumerators to grab the values of attributes and text.

Conclusion

By using Nokogiri, we can get any website's HTML, represented in XML objects, including any text or data displayed on that site. Using methods like .css, we can then filter out the specific parts of the website we need and use additional methods like .text and .attributes to extract the content we want.

As each website is designed differently, scraping tends to require customized code for each site you want to scrape. As sites update their styles and designs, scrapers we've built may no longer work.

However, being able to scrape websites gives us access to information that can be time-consuming or otherwise very difficult to collect. Taking a little time to update a scraper is typically much easier and faster than manually updating data.

Note: One final note about scraping - the content we're getting by scraping is all technically publicly available, as it all visible on public websites. Be careful, however, as some content may not be used without permission or licensing. Images, for instance, often belong to someone and can have a license attached - something you would probably want to look into before scraping any image src attributes from a site!

Resources

  • Scraping is a big topic, and it takes a lot of practice to get comfortable doing it. The below resource is a great place to learn more about scraping and even get some practice with simple examples. If you felt really confused by this reading, we recommend checking it out before moving on.

PreviousGems and BundlerNextScraping Lab

Last updated 5 years ago

is a module in Ruby that allows us to programmatically make HTTP requests. It gives us a bunch of useful methods to make different types of requests, but for this guide, we're interested in only one: open. This method takes one argument, a URL, and will return to us the HTML content of that URL.

Installing Nokogiri is as easy as gem install nokogiri. If you run into any issues with this, check out the following documentation: .

On and on. It is a lot to go through, and it can also look pretty messy and difficult to read. But don't worry! Nokogiri will help us parse this. What we're looking at here is all of the HTML that makes up the web page found at . The massive lines above are actually a snapshot of that HTML converted into a structure of nested nodes by Nokogiri.

Visit and use your browser's developer tools to inspect the page. (You can just right-click anywhere on the page and select "inspect element".)

The view on the bottom half of the page is revealing all of the page's HTML to us! In fact, the HTML it is showing us is exactly the same as the HTML put out to our terminal with the help of Nokogiri and open-uri.

How do we determine which selector to use to retrieve the desired information? Remember that the HTML document that Nokogiri retrieved for us to operate on is exactly the same HTML that makes up the web page. Let's go back to "" and use the element inspector to find the selector of a certain piece of our HTML. In this case, we'll look the element containing the text 'Change things':

Sometimes we want to get a collection of the same elements, so we can iterate over them. For instance, a little further down the are some of the courses offered by Flatiron School. We can practice iterating by trying to scrape the titles of all courses from these elements.

View on Learn.co and start learning to code for free.

Open-URI
Nokogiri Installation Guide
www.flatironschool.com
this Flatiron School link
element inspector
www.flatironschool.com
page
The Bastard's Book of Ruby - Parsing HTML with Nokogiri
Video Review- Scraping and Object Orientation
Nokogiri Installation Guide
Scraping
Nokogiri Scraping
courses
element inspector
fine-toothed saw
browser console example