From 21eee1dd56e45f0a0b41ac2c9f2127c250b43366 Mon Sep 17 00:00:00 2001 From: dquinonez25 Date: Sat, 23 Aug 2025 11:12:53 -0500 Subject: [PATCH 01/13] =?UTF-8?q?Se=20cre=C3=B3=20con=20Colab?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- lessons/02_web_scraping.ipynb | 2217 ++++++++++++++++++--------------- 1 file changed, 1203 insertions(+), 1014 deletions(-) diff --git a/lessons/02_web_scraping.ipynb b/lessons/02_web_scraping.ipynb index 385806a..ad68180 100644 --- a/lessons/02_web_scraping.ipynb +++ b/lessons/02_web_scraping.ipynb @@ -1,1015 +1,1204 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Web Scraping with Beautiful Soup\n", - "\n", - "* * * \n", - "\n", - "### Icons used in this notebook\n", - "🔔 **Question**: A quick question to help you understand what's going on.
\n", - "🥊 **Challenge**: Interactive exercise. We'll work through these in the workshop!
\n", - "⚠️ **Warning**: Heads-up about tricky stuff or common mistakes.
\n", - "💡 **Tip**: How to do something a bit more efficiently or effectively.
\n", - "🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!
\n", - "\n", - "### Learning Objectives\n", - "1. [Reflection: To Scape Or Not To Scrape](#when)\n", - "2. [Extracting and Parsing HTML](#extract)\n", - "3. [Scraping the Illinois General Assembly](#scrape)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "# To Scrape Or Not To Scrape\n", - "\n", - "When we'd like to access data from the web, we first have to make sure if the website we are interested in offers a Web API. Platforms like Twitter, Reddit, and the New York Times offer APIs. **Check out D-Lab's [Python Web APIs](https://github.com/dlab-berkeley/Python-Web-APIs) workshop if you want to learn how to use APIs.**\n", - "\n", - "However, there are often cases when a Web API does not exist. In these cases, we may have to resort to web scraping, where we extract the underlying HTML from a web page, and directly obtain the information we want. There are several packages in Python we can use to accomplish these tasks. We'll focus two packages: Requests and Beautiful Soup.\n", - "\n", - "Our case study will be scraping information on the [state senators of Illinois](http://www.ilga.gov/senate), as well as the [list of bills](http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True) each senator has sponsored. Before we get started, peruse these websites to take a look at their structure." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Installation\n", - "\n", - "We will use two main packages: [Requests](http://docs.python-requests.org/en/latest/user/quickstart/) and [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/). Go ahead and install these packages, if you haven't already:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%pip install requests" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%pip install beautifulsoup4" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We'll also install the `lxml` package, which helps support some of the parsing that Beautiful Soup performs:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%pip install lxml" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# Import required libraries\n", - "from bs4 import BeautifulSoup\n", - "from datetime import datetime\n", - "import requests\n", - "import time" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "# Extracting and Parsing HTML \n", - "\n", - "In order to succesfully scrape and analyse HTML, we'll be going through the following 4 steps:\n", - "1. Make a GET request\n", - "2. Parse the page with Beautiful Soup\n", - "3. Search for HTML elements\n", - "4. Get attributes and text of these elements" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Step 1: Make a GET Request to Obtain a Page's HTML\n", - "\n", - "We can use the Requests library to:\n", - "\n", - "1. Make a GET request to the page, and\n", - "2. Read in the webpage's HTML code.\n", - "\n", - "The process of making a request and obtaining a result resembles that of the Web API workflow. Now, however, we're making a request directly to the website, and we're going to have to parse the HTML ourselves. This is in contrast to being provided data organized into a more straightforward `JSON` or `XML` output." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# Make a GET request\n", - "req = requests.get('http://www.ilga.gov/senate/default.asp')\n", - "# Read the content of the server’s response\n", - "src = req.text\n", - "# View some output\n", - "print(src[:1000])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Step 2: Parse the Page with Beautiful Soup\n", - "\n", - "Now, we use the `BeautifulSoup` function to parse the reponse into an HTML tree. This returns an object (called a **soup object**) which contains all of the HTML in the original document.\n", - "\n", - "If you run into an error about a parser library, make sure you've installed the `lxml` package to provide Beautiful Soup with the necessary parsing tools." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Parse the response into an HTML tree\n", - "soup = BeautifulSoup(src, 'lxml')\n", - "# Take a look\n", - "print(soup.prettify()[:1000])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The output looks pretty similar to the above, but now it's organized in a `soup` object which allows us to more easily traverse the page." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Step 3: Search for HTML Elements\n", - "\n", - "Beautiful Soup has a number of functions to find useful components on a page. Beautiful Soup lets you find elements by their:\n", - "\n", - "1. HTML tags\n", - "2. HTML Attributes\n", - "3. CSS Selectors\n", - "\n", - "Let's search first for **HTML tags**. \n", - "\n", - "The function `find_all` searches the `soup` tree to find all the elements with an a particular HTML tag, and returns all of those elements.\n", - "\n", - "What does the example below do?" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Find all elements with a certain tag\n", - "a_tags = soup.find_all(\"a\")\n", - "print(a_tags[:10])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Because `find_all()` is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object as though it were a function, then it’s the same as calling `find_all()` on that object. \n", - "\n", - "These two lines of code are equivalent:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "a_tags = soup.find_all(\"a\")\n", - "a_tags_alt = soup(\"a\")\n", - "print(a_tags[0])\n", - "print(a_tags_alt[0])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "How many links did we obtain?" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(len(a_tags))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "That's a lot! Many elements on a page will have the same HTML tag. For instance, if you search for everything with the `a` tag, you're likely to get more hits, many of which you might not want. Remember, the `a` tag defines a hyperlink, so you'll usually find many on any given page.\n", - "\n", - "What if we wanted to search for HTML tags with certain attributes, such as particular CSS classes? \n", - "\n", - "We can do this by adding an additional argument to the `find_all`. In the example below, we are finding all the `a` tags, and then filtering those with `class_=\"sidemenu\"`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# Get only the 'a' tags in 'sidemenu' class\n", - "side_menus = soup(\"a\", class_=\"sidemenu\")\n", - "side_menus[:5]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "A more efficient way to search for elements on a website is via a **CSS selector**. For this we have to use a different method called `select()`. Just pass a string into the `.select()` to get all elements with that string as a valid CSS selector.\n", - "\n", - "In the example above, we can use `\"a.sidemenu\"` as a CSS selector, which returns all `a` tags with class `sidemenu`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# Get elements with \"a.sidemenu\" CSS Selector.\n", - "selected = soup.select(\"a.sidemenu\")\n", - "selected[:5]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 🥊 Challenge: Find All\n", - "\n", - "Use BeautifulSoup to find all the `a` elements with class `mainmenu`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# YOUR CODE HERE\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Step 4: Get Attributes and Text of Elements\n", - "\n", - "Once we identify elements, we want the access information in that element. Usually, this means two things:\n", - "\n", - "1. Text\n", - "2. Attributes\n", - "\n", - "Getting the text inside an element is easy. All we have to do is use the `text` member of a `tag` object:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# Get all sidemenu links as a list\n", - "side_menu_links = soup.select(\"a.sidemenu\")\n", - "\n", - "# Examine the first link\n", - "first_link = side_menu_links[0]\n", - "print(first_link)\n", - "\n", - "# What class is this variable?\n", - "print('Class: ', type(first_link))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "It's a Beautiful Soup tag! This means it has a `text` member:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "print(first_link.text)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Sometimes we want the value of certain attributes. This is particularly relevant for `a` tags, or links, where the `href` attribute tells us where the link goes.\n", - "\n", - "💡 **Tip**: You can access a tag’s attributes by treating the tag like a dictionary:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "print(first_link['href'])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 🥊 Challenge: Extract specific attributes\n", - "\n", - "Extract all `href` attributes for each `mainmenu` URL." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# YOUR CODE HERE\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "# Scraping the Illinois General Assembly\n", - "\n", - "Believe it or not, those are really the fundamental tools you need to scrape a website. Once you spend more time familiarizing yourself with HTML and CSS, then it's simply a matter of understanding the structure of a particular website and intelligently applying the tools of Beautiful Soup and Python.\n", - "\n", - "Let's apply these skills to scrape the [Illinois 98th General Assembly](http://www.ilga.gov/senate/default.asp?GA=98).\n", - "\n", - "Specifically, our goal is to scrape information on each senator, including their name, district, and party." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Scrape and Soup the Webpage\n", - "\n", - "Let's scrape and parse the webpage, using the tools we learned in the previous section." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# Make a GET request\n", - "req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')\n", - "# Read the content of the server’s response\n", - "src = req.text\n", - "# Soup it\n", - "soup = BeautifulSoup(src, \"lxml\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Search for the Table Elements\n", - "\n", - "Our goal is to obtain the elements in the table on the webpage. Remember: rows are identified by the `tr` tag. Let's use `find_all` to obtain these elements." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Get all table row elements\n", - "rows = soup.find_all(\"tr\")\n", - "len(rows)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "⚠️ **Warning**: Keep in mind: `find_all` gets *all* the elements with the `tr` tag. We only want some of them. If we use the 'Inspect' function in Google Chrome and look carefully, then we can use some CSS selectors to get just the rows we're interested in. Specifically, we want the inner rows of the table:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Returns every ‘tr tr tr’ css selector in the page\n", - "rows = soup.select('tr tr tr')\n", - "\n", - "for row in rows[:5]:\n", - " print(row, '\\n')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "It looks like we want everything after the first two rows. Let's work with a single row to start, and build our loop from there." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "example_row = rows[2]\n", - "print(example_row.prettify())" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's break this row down into its component cells/columns using the `select` method with CSS selectors. Looking closely at the HTML, there are a couple of ways we could do this.\n", - "\n", - "* We could identify the cells by their tag `td`.\n", - "* We could use the the class name `.detail`.\n", - "* We could combine both and use the selector `td.detail`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "for cell in example_row.select('td'):\n", - " print(cell)\n", - "print()\n", - "\n", - "for cell in example_row.select('.detail'):\n", - " print(cell)\n", - "print()\n", - "\n", - "for cell in example_row.select('td.detail'):\n", - " print(cell)\n", - "print()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can confirm that these are all the same." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "assert example_row.select('td') == example_row.select('.detail') == example_row.select('td.detail')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's use the selector `td.detail` to be as specific as possible." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Select only those 'td' tags with class 'detail' \n", - "detail_cells = example_row.select('td.detail')\n", - "detail_cells" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Most of the time, we're interested in the actual **text** of a website, not its tags. Recall that to get the text of an HTML element, we use the `text` member:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Keep only the text in each of those cells\n", - "row_data = [cell.text for cell in detail_cells]\n", - "\n", - "print(row_data)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Looks good! Now we just use our basic Python knowledge to get the elements of this list that we want. Remember, we want the senator's name, their district, and their party." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(row_data[0]) # Name\n", - "print(row_data[3]) # District\n", - "print(row_data[4]) # Party" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Getting Rid of Junk Rows\n", - "\n", - "We saw at the beginning that not all of the rows we got actually correspond to a senator. We'll need to do some cleaning before we can proceed forward. Take a look at some examples:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print('Row 0:\\n', rows[0], '\\n')\n", - "print('Row 1:\\n', rows[1], '\\n')\n", - "print('Last Row:\\n', rows[-1])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "When we write our for loop, we only want it to apply to the relevant rows. So we'll need to filter out the irrelevant rows. The way to do this is to compare some of these to the rows we do want, see how they differ, and then formulate that in a conditional.\n", - "\n", - "As you can imagine, there a lot of possible ways to do this, and it'll depend on the website. We'll show some here to give you an idea of how to do this." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Bad rows\n", - "print(len(rows[0]))\n", - "print(len(rows[1]))\n", - "\n", - "# Good rows\n", - "print(len(rows[2]))\n", - "print(len(rows[3]))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Perhaps good rows have a length of 5. Let's check:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "good_rows = [row for row in rows if len(row) == 5]\n", - "\n", - "# Let's check some rows\n", - "print(good_rows[0], '\\n')\n", - "print(good_rows[-2], '\\n')\n", - "print(good_rows[-1])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We found a footer row in our list that we'd like to avoid. Let's try something else:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "rows[2].select('td.detail') " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Bad row\n", - "print(rows[-1].select('td.detail'), '\\n')\n", - "\n", - "# Good row\n", - "print(rows[5].select('td.detail'), '\\n')\n", - "\n", - "# How about this?\n", - "good_rows = [row for row in rows if row.select('td.detail')]\n", - "\n", - "print(\"Checking rows...\\n\")\n", - "print(good_rows[0], '\\n')\n", - "print(good_rows[-1])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Looks like we found something that worked!" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Loop it All Together\n", - "\n", - "Now that we've seen how to get the data we want from one row, as well as filter out the rows we don't want, let's put it all together into a loop." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# Define storage list\n", - "members = []\n", - "\n", - "# Get rid of junk rows\n", - "valid_rows = [row for row in rows if row.select('td.detail')]\n", - "\n", - "# Loop through all rows\n", - "for row in valid_rows:\n", - " # Select only those 'td' tags with class 'detail'\n", - " detail_cells = row.select('td.detail')\n", - " # Keep only the text in each of those cells\n", - " row_data = [cell.text for cell in detail_cells]\n", - " # Collect information\n", - " name = row_data[0]\n", - " district = int(row_data[3])\n", - " party = row_data[4]\n", - " # Store in a tuple\n", - " senator = (name, district, party)\n", - " # Append to list\n", - " members.append(senator)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Should be 61\n", - "len(members)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's take a look at what we have in `members`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(members[:5])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 🥊 Challenge: Get `href` elements pointing to members' bills \n", - "\n", - "The code above retrieves information on: \n", - "\n", - "- the senator's name,\n", - "- their district number,\n", - "- and their party.\n", - "\n", - "We now want to retrieve the URL for each senator's list of bills. Each URL will follow a specific format. \n", - "\n", - "The format for the list of bills for a given senator is:\n", - "\n", - "`http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=[MEMBER_ID]&Primary=True`\n", - "\n", - "to get something like:\n", - "\n", - "`http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True`\n", - "\n", - "in which `MEMBER_ID=1911`. \n", - "\n", - "You should be able to see that, unfortunately, `MEMBER_ID` is not currently something pulled out in our scraping code.\n", - "\n", - "Your initial task is to modify the code above so that we also **retrieve the full URL which points to the corresponding page of primary-sponsored bills**, for each member, and return it along with their name, district, and party.\n", - "\n", - "Tips: \n", - "\n", - "* To do this, you will want to get the appropriate anchor element (``) in each legislator's row of the table. You can again use the `.select()` method on the `row` object in the loop to do this — similar to the command that finds all of the `td.detail` cells in the row. Remember that we only want the link to the legislator's bills, not the committees or the legislator's profile page.\n", - "* The anchor elements' HTML will look like `Bills`. The string in the `href` attribute contains the **relative** link we are after. You can access an attribute of a BeatifulSoup `Tag` object the same way you access a Python dictionary: `anchor['attributeName']`. See the documentation for more details.\n", - "* There are a _lot_ of different ways to use BeautifulSoup to get things done. whatever you need to do to pull the `href` out is fine.\n", - "\n", - "The code has been partially filled out for you. Fill it in where it says `#YOUR CODE HERE`. Save the path into an object called `full_path`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# Make a GET request\n", - "req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')\n", - "# Read the content of the server’s response\n", - "src = req.text\n", - "# Soup it\n", - "soup = BeautifulSoup(src, \"lxml\")\n", - "# Create empty list to store our data\n", - "members = []\n", - "\n", - "# Returns every ‘tr tr tr’ css selector in the page\n", - "rows = soup.select('tr tr tr')\n", - "# Get rid of junk rows\n", - "rows = [row for row in rows if row.select('td.detail')]\n", - "\n", - "# Loop through all rows\n", - "for row in rows:\n", - " # Select only those 'td' tags with class 'detail'\n", - " detail_cells = row.select('td.detail') \n", - " # Keep only the text in each of those cells\n", - " row_data = [cell.text for cell in detail_cells]\n", - " # Collect information\n", - " name = row_data[0]\n", - " district = int(row_data[3])\n", - " party = row_data[4]\n", - "\n", - " # YOUR CODE HERE\n", - " full_path = ''\n", - "\n", - " # Store in a tuple\n", - " senator = (name, district, party, full_path)\n", - " # Append to list\n", - " members.append(senator)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# Uncomment to test \n", - "# members[:5]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 🥊 Challenge: Modularize Your Code\n", - "\n", - "Turn the code above into a function that accepts a URL, scrapes the URL for its senators, and returns a list of tuples containing information about each senator. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# YOUR CODE HERE\n", - "def get_members(url):\n", - " return [___]\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# Test your code\n", - "url = 'http://www.ilga.gov/senate/default.asp?GA=98'\n", - "senate_members = get_members(url)\n", - "len(senate_members)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 🥊 Take-home Challenge: Writing a Scraper Function\n", - "\n", - "We want to scrape the webpages corresponding to bills sponsored by each bills.\n", - "\n", - "Write a function called `get_bills(url)` to parse a given bills URL. This will involve:\n", - "\n", - " - requesting the URL using the `requests` library\n", - " - using the features of the `BeautifulSoup` library to find all of the `` elements with the class `billlist`\n", - " - return a _list_ of tuples, each with:\n", - " - description (2nd column)\n", - " - chamber (S or H) (3rd column)\n", - " - the last action (4th column)\n", - " - the last action date (5th column)\n", - " \n", - "This function has been partially completed. Fill in the rest." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "def get_bills(url):\n", - " src = requests.get(url).text\n", - " soup = BeautifulSoup(src)\n", - " rows = soup.select('tr')\n", - " bills = []\n", - " for row in rows:\n", - " # YOUR CODE HERE\n", - " bill_id =\n", - " description =\n", - " chamber =\n", - " last_action =\n", - " last_action_date =\n", - " bill = (bill_id, description, chamber, last_action, last_action_date)\n", - " bills.append(bill)\n", - " return bills" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# Uncomment to test your code\n", - "# test_url = senate_members[0][3]\n", - "# get_bills(test_url)[0:5]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Scrape All Bills\n", - "\n", - "Finally, create a dictionary `bills_dict` which maps a district number (the key) onto a list of bills (the value) coming from that district. You can do this by looping over all of the senate members in `members_dict` and calling `get_bills()` for each of their associated bill URLs.\n", - "\n", - "**NOTE:** please call the function `time.sleep(1)` for each iteration of the loop, so that we don't destroy the state's web site." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# YOUR CODE HERE\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# Uncomment to test your code\n", - "# bills_dict[52]" - ] - } - ], - "metadata": { - "anaconda-cloud": {}, - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.13" - }, - "vscode": { - "interpreter": { - "hash": "b6f9fe9f4b7182690503d8ecc2bae97b0ee3ebf54e877167ae4d28c119a56988" - } - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pcknDGQPsbHD" + }, + "source": [ + "# Web Scraping with Beautiful Soup\n", + "\n", + "* * *\n", + "\n", + "### Icons used in this notebook\n", + "🔔 **Question**: A quick question to help you understand what's going on.
\n", + "🥊 **Challenge**: Interactive exercise. We'll work through these in the workshop!
\n", + "⚠️ **Warning**: Heads-up about tricky stuff or common mistakes.
\n", + "💡 **Tip**: How to do something a bit more efficiently or effectively.
\n", + "🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!
\n", + "\n", + "### Learning Objectives\n", + "1. [Reflection: To Scape Or Not To Scrape](#when)\n", + "2. [Extracting and Parsing HTML](#extract)\n", + "3. [Scraping the Illinois General Assembly](#scrape)" + ] + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "WYEoSgtHsgp8" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NCNJuUdXsbHE" + }, + "source": [ + "\n", + "\n", + "# To Scrape Or Not To Scrape\n", + "\n", + "When we'd like to access data from the web, we first have to make sure if the website we are interested in offers a Web API. Platforms like Twitter, Reddit, and the New York Times offer APIs. **Check out D-Lab's [Python Web APIs](https://github.com/dlab-berkeley/Python-Web-APIs) workshop if you want to learn how to use APIs.**\n", + "\n", + "However, there are often cases when a Web API does not exist. In these cases, we may have to resort to web scraping, where we extract the underlying HTML from a web page, and directly obtain the information we want. There are several packages in Python we can use to accomplish these tasks. We'll focus two packages: Requests and Beautiful Soup.\n", + "\n", + "Our case study will be scraping information on the [state senators of Illinois](http://www.ilga.gov/senate), as well as the [list of bills](http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True) each senator has sponsored. Before we get started, peruse these websites to take a look at their structure." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2xrchmSasbHF" + }, + "source": [ + "## Installation\n", + "\n", + "We will use two main packages: [Requests](http://docs.python-requests.org/en/latest/user/quickstart/) and [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/). Go ahead and install these packages, if you haven't already:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "id": "GVkS4oD2sbHF", + "outputId": "66f6e8f7-e838-4e56-d024-52bb3dd6b8cd", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Requirement already satisfied: requests in /usr/local/lib/python3.12/dist-packages (2.32.4)\n", + "Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests) (3.4.3)\n", + "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.12/dist-packages (from requests) (3.10)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests) (2.5.0)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.12/dist-packages (from requests) (2025.8.3)\n" + ] + } + ], + "source": [ + "%pip install requests" + ] + }, + { + "cell_type": "markdown", + "source": [ + "Installo requests" + ], + "metadata": { + "id": "ihGiPrM_suxV" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2_tIUc07sbHG" + }, + "outputs": [], + "source": [ + "%pip install beautifulsoup4" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "I0ttueZMsbHG" + }, + "source": [ + "We'll also install the `lxml` package, which helps support some of the parsing that Beautiful Soup performs:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "p4AWMvbdsbHG" + }, + "outputs": [], + "source": [ + "%pip install lxml" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [], + "id": "s0wd-iHdsbHG" + }, + "outputs": [], + "source": [ + "# Import required libraries\n", + "from bs4 import BeautifulSoup\n", + "from datetime import datetime\n", + "import requests\n", + "import time" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Lx3Yey31sbHH" + }, + "source": [ + "\n", + "\n", + "# Extracting and Parsing HTML\n", + "\n", + "In order to succesfully scrape and analyse HTML, we'll be going through the following 4 steps:\n", + "1. Make a GET request\n", + "2. Parse the page with Beautiful Soup\n", + "3. Search for HTML elements\n", + "4. Get attributes and text of these elements" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PhQ9pdG3sbHH" + }, + "source": [ + "## Step 1: Make a GET Request to Obtain a Page's HTML\n", + "\n", + "We can use the Requests library to:\n", + "\n", + "1. Make a GET request to the page, and\n", + "2. Read in the webpage's HTML code.\n", + "\n", + "The process of making a request and obtaining a result resembles that of the Web API workflow. Now, however, we're making a request directly to the website, and we're going to have to parse the HTML ourselves. This is in contrast to being provided data organized into a more straightforward `JSON` or `XML` output." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [], + "id": "0_esi0UGsbHH" + }, + "outputs": [], + "source": [ + "# Make a GET request\n", + "req = requests.get('http://www.ilga.gov/senate/default.asp')\n", + "# Read the content of the server’s response\n", + "src = req.text\n", + "# View some output\n", + "print(src[:1000])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6MbNKWDgsbHH" + }, + "source": [ + "## Step 2: Parse the Page with Beautiful Soup\n", + "\n", + "Now, we use the `BeautifulSoup` function to parse the reponse into an HTML tree. This returns an object (called a **soup object**) which contains all of the HTML in the original document.\n", + "\n", + "If you run into an error about a parser library, make sure you've installed the `lxml` package to provide Beautiful Soup with the necessary parsing tools." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "6Ek1DyCbsbHH" + }, + "outputs": [], + "source": [ + "# Parse the response into an HTML tree\n", + "soup = BeautifulSoup(src, 'lxml')\n", + "# Take a look\n", + "print(soup.prettify()[:1000])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FCE4UraZsbHH" + }, + "source": [ + "The output looks pretty similar to the above, but now it's organized in a `soup` object which allows us to more easily traverse the page." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YmWxNDg4sbHI" + }, + "source": [ + "## Step 3: Search for HTML Elements\n", + "\n", + "Beautiful Soup has a number of functions to find useful components on a page. Beautiful Soup lets you find elements by their:\n", + "\n", + "1. HTML tags\n", + "2. HTML Attributes\n", + "3. CSS Selectors\n", + "\n", + "Let's search first for **HTML tags**.\n", + "\n", + "The function `find_all` searches the `soup` tree to find all the elements with an a particular HTML tag, and returns all of those elements.\n", + "\n", + "What does the example below do?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "at8SUb9vsbHI" + }, + "outputs": [], + "source": [ + "# Find all elements with a certain tag\n", + "a_tags = soup.find_all(\"a\")\n", + "print(a_tags[:10])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qg8tUuS2sbHI" + }, + "source": [ + "Because `find_all()` is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object as though it were a function, then it’s the same as calling `find_all()` on that object.\n", + "\n", + "These two lines of code are equivalent:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [], + "id": "Me55JRw9sbHI" + }, + "outputs": [], + "source": [ + "a_tags = soup.find_all(\"a\")\n", + "a_tags_alt = soup(\"a\")\n", + "print(a_tags[0])\n", + "print(a_tags_alt[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Qb0RaDzhsbHI" + }, + "source": [ + "How many links did we obtain?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "45Zk_D8lsbHI" + }, + "outputs": [], + "source": [ + "print(len(a_tags))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PZMqa4LjsbHI" + }, + "source": [ + "That's a lot! Many elements on a page will have the same HTML tag. For instance, if you search for everything with the `a` tag, you're likely to get more hits, many of which you might not want. Remember, the `a` tag defines a hyperlink, so you'll usually find many on any given page.\n", + "\n", + "What if we wanted to search for HTML tags with certain attributes, such as particular CSS classes?\n", + "\n", + "We can do this by adding an additional argument to the `find_all`. In the example below, we are finding all the `a` tags, and then filtering those with `class_=\"sidemenu\"`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [], + "id": "_1C45IOYsbHI" + }, + "outputs": [], + "source": [ + "# Get only the 'a' tags in 'sidemenu' class\n", + "side_menus = soup(\"a\", class_=\"sidemenu\")\n", + "side_menus[:5]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4_kHFc4csbHI" + }, + "source": [ + "A more efficient way to search for elements on a website is via a **CSS selector**. For this we have to use a different method called `select()`. Just pass a string into the `.select()` to get all elements with that string as a valid CSS selector.\n", + "\n", + "In the example above, we can use `\"a.sidemenu\"` as a CSS selector, which returns all `a` tags with class `sidemenu`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [], + "id": "cT2bsZs_sbHI" + }, + "outputs": [], + "source": [ + "# Get elements with \"a.sidemenu\" CSS Selector.\n", + "selected = soup.select(\"a.sidemenu\")\n", + "selected[:5]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "z0p9WQhHsbHI" + }, + "source": [ + "## 🥊 Challenge: Find All\n", + "\n", + "Use BeautifulSoup to find all the `a` elements with class `mainmenu`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "gd8saWzGsbHI" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3QGR4fa3sbHI" + }, + "source": [ + "## Step 4: Get Attributes and Text of Elements\n", + "\n", + "Once we identify elements, we want the access information in that element. Usually, this means two things:\n", + "\n", + "1. Text\n", + "2. Attributes\n", + "\n", + "Getting the text inside an element is easy. All we have to do is use the `text` member of a `tag` object:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [], + "id": "-5ii1cuysbHJ" + }, + "outputs": [], + "source": [ + "# Get all sidemenu links as a list\n", + "side_menu_links = soup.select(\"a.sidemenu\")\n", + "\n", + "# Examine the first link\n", + "first_link = side_menu_links[0]\n", + "print(first_link)\n", + "\n", + "# What class is this variable?\n", + "print('Class: ', type(first_link))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5y_QnZ_msbHJ" + }, + "source": [ + "It's a Beautiful Soup tag! This means it has a `text` member:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [], + "id": "Lpz1wvTasbHJ" + }, + "outputs": [], + "source": [ + "print(first_link.text)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1mZzwc8msbHJ" + }, + "source": [ + "Sometimes we want the value of certain attributes. This is particularly relevant for `a` tags, or links, where the `href` attribute tells us where the link goes.\n", + "\n", + "💡 **Tip**: You can access a tag’s attributes by treating the tag like a dictionary:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [], + "id": "YdjBhcK4sbHJ" + }, + "outputs": [], + "source": [ + "print(first_link['href'])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_NzAkwhjsbHJ" + }, + "source": [ + "## 🥊 Challenge: Extract specific attributes\n", + "\n", + "Extract all `href` attributes for each `mainmenu` URL." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "bI5uX6JpsbHJ" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_NYvTnO8sbHJ" + }, + "source": [ + "\n", + "\n", + "# Scraping the Illinois General Assembly\n", + "\n", + "Believe it or not, those are really the fundamental tools you need to scrape a website. Once you spend more time familiarizing yourself with HTML and CSS, then it's simply a matter of understanding the structure of a particular website and intelligently applying the tools of Beautiful Soup and Python.\n", + "\n", + "Let's apply these skills to scrape the [Illinois 98th General Assembly](http://www.ilga.gov/senate/default.asp?GA=98).\n", + "\n", + "Specifically, our goal is to scrape information on each senator, including their name, district, and party." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u-S2FUDAsbHJ" + }, + "source": [ + "## Scrape and Soup the Webpage\n", + "\n", + "Let's scrape and parse the webpage, using the tools we learned in the previous section." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [], + "id": "1SqOuhFAsbHJ" + }, + "outputs": [], + "source": [ + "# Make a GET request\n", + "req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')\n", + "# Read the content of the server’s response\n", + "src = req.text\n", + "# Soup it\n", + "soup = BeautifulSoup(src, \"lxml\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aNv4x5EzsbHJ" + }, + "source": [ + "## Search for the Table Elements\n", + "\n", + "Our goal is to obtain the elements in the table on the webpage. Remember: rows are identified by the `tr` tag. Let's use `find_all` to obtain these elements." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "BPhi0XrcsbHJ" + }, + "outputs": [], + "source": [ + "# Get all table row elements\n", + "rows = soup.find_all(\"tr\")\n", + "len(rows)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jBZOOV8jsbHO" + }, + "source": [ + "⚠️ **Warning**: Keep in mind: `find_all` gets *all* the elements with the `tr` tag. We only want some of them. If we use the 'Inspect' function in Google Chrome and look carefully, then we can use some CSS selectors to get just the rows we're interested in. Specifically, we want the inner rows of the table:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "utKa2QuKsbHO" + }, + "outputs": [], + "source": [ + "# Returns every ‘tr tr tr’ css selector in the page\n", + "rows = soup.select('tr tr tr')\n", + "\n", + "for row in rows[:5]:\n", + " print(row, '\\n')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tFN4RFLpsbHO" + }, + "source": [ + "It looks like we want everything after the first two rows. Let's work with a single row to start, and build our loop from there." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "C1LRohx8sbHP" + }, + "outputs": [], + "source": [ + "example_row = rows[2]\n", + "print(example_row.prettify())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hXN1lNrqsbHP" + }, + "source": [ + "Let's break this row down into its component cells/columns using the `select` method with CSS selectors. Looking closely at the HTML, there are a couple of ways we could do this.\n", + "\n", + "* We could identify the cells by their tag `td`.\n", + "* We could use the the class name `.detail`.\n", + "* We could combine both and use the selector `td.detail`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "V2FmLZrJsbHP" + }, + "outputs": [], + "source": [ + "for cell in example_row.select('td'):\n", + " print(cell)\n", + "print()\n", + "\n", + "for cell in example_row.select('.detail'):\n", + " print(cell)\n", + "print()\n", + "\n", + "for cell in example_row.select('td.detail'):\n", + " print(cell)\n", + "print()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5C81ilBFsbHP" + }, + "source": [ + "We can confirm that these are all the same." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [], + "id": "rOZQqe0MsbHP" + }, + "outputs": [], + "source": [ + "assert example_row.select('td') == example_row.select('.detail') == example_row.select('td.detail')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CoXWT6IksbHP" + }, + "source": [ + "Let's use the selector `td.detail` to be as specific as possible." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "44w-eYDZsbHP" + }, + "outputs": [], + "source": [ + "# Select only those 'td' tags with class 'detail'\n", + "detail_cells = example_row.select('td.detail')\n", + "detail_cells" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eab9GLhusbHP" + }, + "source": [ + "Most of the time, we're interested in the actual **text** of a website, not its tags. Recall that to get the text of an HTML element, we use the `text` member:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3tMXEvFSsbHP" + }, + "outputs": [], + "source": [ + "# Keep only the text in each of those cells\n", + "row_data = [cell.text for cell in detail_cells]\n", + "\n", + "print(row_data)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "977EIW0csbHP" + }, + "source": [ + "Looks good! Now we just use our basic Python knowledge to get the elements of this list that we want. Remember, we want the senator's name, their district, and their party." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ddAU76TZsbHP" + }, + "outputs": [], + "source": [ + "print(row_data[0]) # Name\n", + "print(row_data[3]) # District\n", + "print(row_data[4]) # Party" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "E59_BMNKsbHP" + }, + "source": [ + "## Getting Rid of Junk Rows\n", + "\n", + "We saw at the beginning that not all of the rows we got actually correspond to a senator. We'll need to do some cleaning before we can proceed forward. Take a look at some examples:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "T7m-E5kjsbHP" + }, + "outputs": [], + "source": [ + "print('Row 0:\\n', rows[0], '\\n')\n", + "print('Row 1:\\n', rows[1], '\\n')\n", + "print('Last Row:\\n', rows[-1])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hiRz5qMVsbHP" + }, + "source": [ + "When we write our for loop, we only want it to apply to the relevant rows. So we'll need to filter out the irrelevant rows. The way to do this is to compare some of these to the rows we do want, see how they differ, and then formulate that in a conditional.\n", + "\n", + "As you can imagine, there a lot of possible ways to do this, and it'll depend on the website. We'll show some here to give you an idea of how to do this." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "LKo9MyWBsbHP" + }, + "outputs": [], + "source": [ + "# Bad rows\n", + "print(len(rows[0]))\n", + "print(len(rows[1]))\n", + "\n", + "# Good rows\n", + "print(len(rows[2]))\n", + "print(len(rows[3]))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Hg5ZKeJgsbHP" + }, + "source": [ + "Perhaps good rows have a length of 5. Let's check:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "L2awrhEfsbHP" + }, + "outputs": [], + "source": [ + "good_rows = [row for row in rows if len(row) == 5]\n", + "\n", + "# Let's check some rows\n", + "print(good_rows[0], '\\n')\n", + "print(good_rows[-2], '\\n')\n", + "print(good_rows[-1])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UfFTjTwusbHQ" + }, + "source": [ + "We found a footer row in our list that we'd like to avoid. Let's try something else:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "GTFEA5VdsbHQ" + }, + "outputs": [], + "source": [ + "rows[2].select('td.detail')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "pnIFBOsJsbHQ" + }, + "outputs": [], + "source": [ + "# Bad row\n", + "print(rows[-1].select('td.detail'), '\\n')\n", + "\n", + "# Good row\n", + "print(rows[5].select('td.detail'), '\\n')\n", + "\n", + "# How about this?\n", + "good_rows = [row for row in rows if row.select('td.detail')]\n", + "\n", + "print(\"Checking rows...\\n\")\n", + "print(good_rows[0], '\\n')\n", + "print(good_rows[-1])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7hkNTjTZsbHQ" + }, + "source": [ + "Looks like we found something that worked!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Q6P84OeQsbHQ" + }, + "source": [ + "## Loop it All Together\n", + "\n", + "Now that we've seen how to get the data we want from one row, as well as filter out the rows we don't want, let's put it all together into a loop." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [], + "id": "_p6yf187sbHQ" + }, + "outputs": [], + "source": [ + "# Define storage list\n", + "members = []\n", + "\n", + "# Get rid of junk rows\n", + "valid_rows = [row for row in rows if row.select('td.detail')]\n", + "\n", + "# Loop through all rows\n", + "for row in valid_rows:\n", + " # Select only those 'td' tags with class 'detail'\n", + " detail_cells = row.select('td.detail')\n", + " # Keep only the text in each of those cells\n", + " row_data = [cell.text for cell in detail_cells]\n", + " # Collect information\n", + " name = row_data[0]\n", + " district = int(row_data[3])\n", + " party = row_data[4]\n", + " # Store in a tuple\n", + " senator = (name, district, party)\n", + " # Append to list\n", + " members.append(senator)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "X4HSTY1bsbHQ" + }, + "outputs": [], + "source": [ + "# Should be 61\n", + "len(members)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qiNWYMJlsbHQ" + }, + "source": [ + "Let's take a look at what we have in `members`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "OzNtF5Q8sbHQ" + }, + "outputs": [], + "source": [ + "print(members[:5])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JfDqvTq_sbHQ" + }, + "source": [ + "## 🥊 Challenge: Get `href` elements pointing to members' bills\n", + "\n", + "The code above retrieves information on: \n", + "\n", + "- the senator's name,\n", + "- their district number,\n", + "- and their party.\n", + "\n", + "We now want to retrieve the URL for each senator's list of bills. Each URL will follow a specific format.\n", + "\n", + "The format for the list of bills for a given senator is:\n", + "\n", + "`http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=[MEMBER_ID]&Primary=True`\n", + "\n", + "to get something like:\n", + "\n", + "`http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True`\n", + "\n", + "in which `MEMBER_ID=1911`.\n", + "\n", + "You should be able to see that, unfortunately, `MEMBER_ID` is not currently something pulled out in our scraping code.\n", + "\n", + "Your initial task is to modify the code above so that we also **retrieve the full URL which points to the corresponding page of primary-sponsored bills**, for each member, and return it along with their name, district, and party.\n", + "\n", + "Tips:\n", + "\n", + "* To do this, you will want to get the appropriate anchor element (``) in each legislator's row of the table. You can again use the `.select()` method on the `row` object in the loop to do this — similar to the command that finds all of the `td.detail` cells in the row. Remember that we only want the link to the legislator's bills, not the committees or the legislator's profile page.\n", + "* The anchor elements' HTML will look like `Bills`. The string in the `href` attribute contains the **relative** link we are after. You can access an attribute of a BeatifulSoup `Tag` object the same way you access a Python dictionary: `anchor['attributeName']`. See the documentation for more details.\n", + "* There are a _lot_ of different ways to use BeautifulSoup to get things done. whatever you need to do to pull the `href` out is fine.\n", + "\n", + "The code has been partially filled out for you. Fill it in where it says `#YOUR CODE HERE`. Save the path into an object called `full_path`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [], + "id": "QY9WS7V2sbHQ" + }, + "outputs": [], + "source": [ + "# Make a GET request\n", + "req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')\n", + "# Read the content of the server’s response\n", + "src = req.text\n", + "# Soup it\n", + "soup = BeautifulSoup(src, \"lxml\")\n", + "# Create empty list to store our data\n", + "members = []\n", + "\n", + "# Returns every ‘tr tr tr’ css selector in the page\n", + "rows = soup.select('tr tr tr')\n", + "# Get rid of junk rows\n", + "rows = [row for row in rows if row.select('td.detail')]\n", + "\n", + "# Loop through all rows\n", + "for row in rows:\n", + " # Select only those 'td' tags with class 'detail'\n", + " detail_cells = row.select('td.detail')\n", + " # Keep only the text in each of those cells\n", + " row_data = [cell.text for cell in detail_cells]\n", + " # Collect information\n", + " name = row_data[0]\n", + " district = int(row_data[3])\n", + " party = row_data[4]\n", + "\n", + " # YOUR CODE HERE\n", + " full_path = ''\n", + "\n", + " # Store in a tuple\n", + " senator = (name, district, party, full_path)\n", + " # Append to list\n", + " members.append(senator)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [], + "id": "-fdG1ucrsbHQ" + }, + "outputs": [], + "source": [ + "# Uncomment to test\n", + "# members[:5]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RWqf6IOssbHQ" + }, + "source": [ + "## 🥊 Challenge: Modularize Your Code\n", + "\n", + "Turn the code above into a function that accepts a URL, scrapes the URL for its senators, and returns a list of tuples containing information about each senator." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [], + "id": "m3ZGQxd7sbHQ" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE\n", + "def get_members(url):\n", + " return [___]\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [], + "id": "j_RQ-zpXsbHQ" + }, + "outputs": [], + "source": [ + "# Test your code\n", + "url = 'http://www.ilga.gov/senate/default.asp?GA=98'\n", + "senate_members = get_members(url)\n", + "len(senate_members)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rtYyJVg3sbHR" + }, + "source": [ + "## 🥊 Take-home Challenge: Writing a Scraper Function\n", + "\n", + "We want to scrape the webpages corresponding to bills sponsored by each bills.\n", + "\n", + "Write a function called `get_bills(url)` to parse a given bills URL. This will involve:\n", + "\n", + " - requesting the URL using the `requests` library\n", + " - using the features of the `BeautifulSoup` library to find all of the `` elements with the class `billlist`\n", + " - return a _list_ of tuples, each with:\n", + " - description (2nd column)\n", + " - chamber (S or H) (3rd column)\n", + " - the last action (4th column)\n", + " - the last action date (5th column)\n", + " \n", + "This function has been partially completed. Fill in the rest." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [], + "id": "_79DFRiKsbHR" + }, + "outputs": [], + "source": [ + "def get_bills(url):\n", + " src = requests.get(url).text\n", + " soup = BeautifulSoup(src)\n", + " rows = soup.select('tr')\n", + " bills = []\n", + " for row in rows:\n", + " # YOUR CODE HERE\n", + " bill_id =\n", + " description =\n", + " chamber =\n", + " last_action =\n", + " last_action_date =\n", + " bill = (bill_id, description, chamber, last_action, last_action_date)\n", + " bills.append(bill)\n", + " return bills" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [], + "id": "oe2QHteFsbHR" + }, + "outputs": [], + "source": [ + "# Uncomment to test your code\n", + "# test_url = senate_members[0][3]\n", + "# get_bills(test_url)[0:5]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YsOIClahsbHR" + }, + "source": [ + "### Scrape All Bills\n", + "\n", + "Finally, create a dictionary `bills_dict` which maps a district number (the key) onto a list of bills (the value) coming from that district. You can do this by looping over all of the senate members in `members_dict` and calling `get_bills()` for each of their associated bill URLs.\n", + "\n", + "**NOTE:** please call the function `time.sleep(1)` for each iteration of the loop, so that we don't destroy the state's web site." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [], + "id": "zIN9ZpKVsbHR" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [], + "id": "Y02z46MlsbHR" + }, + "outputs": [], + "source": [ + "# Uncomment to test your code\n", + "# bills_dict[52]" + ] + } + ], + "metadata": { + "anaconda-cloud": {}, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.13" + }, + "vscode": { + "interpreter": { + "hash": "b6f9fe9f4b7182690503d8ecc2bae97b0ee3ebf54e877167ae4d28c119a56988" + } + }, + "colab": { + "provenance": [], + "include_colab_link": true + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file From 07eaffb2fe40ee6254b3d8c262508d7d318b7eea Mon Sep 17 00:00:00 2001 From: dquinonez25 Date: Sat, 23 Aug 2025 11:46:23 -0500 Subject: [PATCH 02/13] =?UTF-8?q?Se=20cre=C3=B3=20con=20Colab?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- lessons/02_web_scraping.ipynb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lessons/02_web_scraping.ipynb b/lessons/02_web_scraping.ipynb index ad68180..269348b 100644 --- a/lessons/02_web_scraping.ipynb +++ b/lessons/02_web_scraping.ipynb @@ -72,7 +72,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "metadata": { "id": "GVkS4oD2sbHF", "outputId": "66f6e8f7-e838-4e56-d024-52bb3dd6b8cd", @@ -100,7 +100,7 @@ { "cell_type": "markdown", "source": [ - "Installo requests" + "Installo requests." ], "metadata": { "id": "ihGiPrM_suxV" From 4e7b80711be3c5b7d72e6f001a9d88be7c643d10 Mon Sep 17 00:00:00 2001 From: dquinonez25 Date: Sat, 23 Aug 2025 19:13:12 -0500 Subject: [PATCH 03/13] =?UTF-8?q?Se=20cre=C3=B3=20con=20Colab?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- lessons/02_web_scraping.ipynb | 262 +++++++++++++++++++++++++++++----- 1 file changed, 228 insertions(+), 34 deletions(-) diff --git a/lessons/02_web_scraping.ipynb b/lessons/02_web_scraping.ipynb index 269348b..13f3308 100644 --- a/lessons/02_web_scraping.ipynb +++ b/lessons/02_web_scraping.ipynb @@ -72,10 +72,10 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "metadata": { "id": "GVkS4oD2sbHF", - "outputId": "66f6e8f7-e838-4e56-d024-52bb3dd6b8cd", + "outputId": "d95d5dcc-b04f-4048-d2be-1b20c6b446ad", "colab": { "base_uri": "https://localhost:8080/" } @@ -108,11 +108,25 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 2, "metadata": { - "id": "2_tIUc07sbHG" + "id": "2_tIUc07sbHG", + "outputId": "cb0e9a0e-fdee-4cd1-c42d-24a384833326", + "colab": { + "base_uri": "https://localhost:8080/" + } }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.12/dist-packages (4.13.4)\n", + "Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.12/dist-packages (from beautifulsoup4) (2.7)\n", + "Requirement already satisfied: typing-extensions>=4.0.0 in /usr/local/lib/python3.12/dist-packages (from beautifulsoup4) (4.14.1)\n" + ] + } + ], "source": [ "%pip install beautifulsoup4" ] @@ -128,18 +142,30 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 3, "metadata": { - "id": "p4AWMvbdsbHG" + "id": "p4AWMvbdsbHG", + "outputId": "710c0ec9-ac3b-41d6-8627-68c7d3bb767d", + "colab": { + "base_uri": "https://localhost:8080/" + } }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Requirement already satisfied: lxml in /usr/local/lib/python3.12/dist-packages (5.4.0)\n" + ] + } + ], "source": [ "%pip install lxml" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 4, "metadata": { "tags": [], "id": "s0wd-iHdsbHG" @@ -188,12 +214,42 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 5, "metadata": { "tags": [], - "id": "0_esi0UGsbHH" + "id": "0_esi0UGsbHH", + "outputId": "388e94df-63c8-49b7-982f-7a7ba77a5dd7", + "colab": { + "base_uri": "https://localhost:8080/" + } }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\r\n", + "\r\n", + "\r\n", + " \r\n", + " \r\n", + " \r\n", + " \r\n", + " \r\n", + " \r\n", + " \r\n", + " \r\n", + " \r\n", + " \r\n", + " \r\n", + " \r\n", + " \r\n", + " \r\n", + " \r\n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " English\r\n", + " , \n", + " Afrikaans\r\n", + " , \n", + " Albanian\r\n", + " , \n", + " Arabic\r\n", + " , \n", + " Armenian\r\n", + " , \n", + " Azerbaijani\r\n", + " , \n", + " Basque\r\n", + " , \n", + " Bengali\r\n", + " , \n", + " Bosnian\r\n", + " , \n", + " Catalan\r\n", + " ]\n" + ] + } + ], "source": [ "# Find all elements with a certain tag\n", "a_tags = soup.find_all(\"a\")\n", @@ -286,12 +404,29 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 8, "metadata": { "tags": [], - "id": "Me55JRw9sbHI" + "id": "Me55JRw9sbHI", + "outputId": "280fb558-309b-4440-c4c7-5c13ae947ec7", + "colab": { + "base_uri": "https://localhost:8080/" + } }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + " English\r\n", + " \n", + "\n", + " English\r\n", + " \n" + ] + } + ], "source": [ "a_tags = soup.find_all(\"a\")\n", "a_tags_alt = soup(\"a\")\n", @@ -310,11 +445,23 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 9, "metadata": { - "id": "45Zk_D8lsbHI" + "id": "45Zk_D8lsbHI", + "outputId": "cf21cf2d-4cb0-4063-9ac7-34aa84b44d89", + "colab": { + "base_uri": "https://localhost:8080/" + } }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "270\n" + ] + } + ], "source": [ "print(len(a_tags))" ] @@ -334,12 +481,27 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 10, "metadata": { "tags": [], - "id": "_1C45IOYsbHI" + "id": "_1C45IOYsbHI", + "outputId": "86959042-b7f8-4f72-f70e-961c80e3ffa2", + "colab": { + "base_uri": "https://localhost:8080/" + } }, - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[]" + ] + }, + "metadata": {}, + "execution_count": 10 + } + ], "source": [ "# Get only the 'a' tags in 'sidemenu' class\n", "side_menus = soup(\"a\", class_=\"sidemenu\")\n", @@ -359,12 +521,27 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 11, "metadata": { "tags": [], - "id": "cT2bsZs_sbHI" + "id": "cT2bsZs_sbHI", + "outputId": "57e9de65-3fe6-493e-f811-8a01ed8b3d42", + "colab": { + "base_uri": "https://localhost:8080/" + } }, - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[]" + ] + }, + "metadata": {}, + "execution_count": 11 + } + ], "source": [ "# Get elements with \"a.sidemenu\" CSS Selector.\n", "selected = soup.select(\"a.sidemenu\")\n", @@ -384,7 +561,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 12, "metadata": { "id": "gd8saWzGsbHI" }, @@ -411,12 +588,29 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 13, "metadata": { "tags": [], - "id": "-5ii1cuysbHJ" + "id": "-5ii1cuysbHJ", + "outputId": "f52dd18b-94ad-4614-bccf-9ae61ffc0702", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 211 + } }, - "outputs": [], + "outputs": [ + { + "output_type": "error", + "ename": "IndexError", + "evalue": "list index out of range", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mIndexError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m/tmp/ipython-input-1553754977.py\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;31m# Examine the first link\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 5\u001b[0;31m \u001b[0mfirst_link\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mside_menu_links\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 6\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfirst_link\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mIndexError\u001b[0m: list index out of range" + ] + } + ], "source": [ "# Get all sidemenu links as a list\n", "side_menu_links = soup.select(\"a.sidemenu\")\n", From ad53d897e9565779b84fa2ff0eb69d3e68ee4445 Mon Sep 17 00:00:00 2001 From: dquinonez25 Date: Sat, 23 Aug 2025 21:49:38 -0500 Subject: [PATCH 04/13] =?UTF-8?q?Se=20cre=C3=B3=20con=20Colab?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- lessons/02_web_scraping.ipynb | 79 +++++++++++++++++++++++++---------- 1 file changed, 58 insertions(+), 21 deletions(-) diff --git a/lessons/02_web_scraping.ipynb b/lessons/02_web_scraping.ipynb index 13f3308..65ec79e 100644 --- a/lessons/02_web_scraping.ipynb +++ b/lessons/02_web_scraping.ipynb @@ -100,7 +100,7 @@ { "cell_type": "markdown", "source": [ - "Installo requests." + "***Install requests permite hacer peticiones directamente desde el código.***" ], "metadata": { "id": "ihGiPrM_suxV" @@ -111,10 +111,10 @@ "execution_count": 2, "metadata": { "id": "2_tIUc07sbHG", - "outputId": "cb0e9a0e-fdee-4cd1-c42d-24a384833326", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "cb0e9a0e-fdee-4cd1-c42d-24a384833326" }, "outputs": [ { @@ -131,6 +131,15 @@ "%pip install beautifulsoup4" ] }, + { + "cell_type": "markdown", + "source": [ + "***Install beatufulsoup4 permite extraer información de páginas web usando Python.***" + ], + "metadata": { + "id": "L6_vVIk69DDa" + } + }, { "cell_type": "markdown", "metadata": { @@ -145,10 +154,10 @@ "execution_count": 3, "metadata": { "id": "p4AWMvbdsbHG", - "outputId": "710c0ec9-ac3b-41d6-8627-68c7d3bb767d", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "710c0ec9-ac3b-41d6-8627-68c7d3bb767d" }, "outputs": [ { @@ -163,6 +172,16 @@ "%pip install lxml" ] }, + { + "cell_type": "markdown", + "source": [ + "\n", + "***Install lxml sirve para que puedas procesar y analizar documentos HTML y XML de forma rápida y eficiente en Python.***" + ], + "metadata": { + "id": "yj7Ijz-a-SCY" + } + }, { "cell_type": "code", "execution_count": 4, @@ -179,6 +198,24 @@ "import time" ] }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "qRK7gt4A-Ai4" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "***Install lxml sirve para que puedas procesar y analizar documentos HTML y XML de forma rápida y eficiente en Python.***" + ], + "metadata": { + "id": "1N8YdG__99vj" + } + }, { "cell_type": "markdown", "metadata": { @@ -218,10 +255,10 @@ "metadata": { "tags": [], "id": "0_esi0UGsbHH", - "outputId": "388e94df-63c8-49b7-982f-7a7ba77a5dd7", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "388e94df-63c8-49b7-982f-7a7ba77a5dd7" }, "outputs": [ { @@ -277,10 +314,10 @@ "execution_count": 6, "metadata": { "id": "6Ek1DyCbsbHH", - "outputId": "290b5464-7cab-4c8e-a3b3-1d25378c59bb", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "290b5464-7cab-4c8e-a3b3-1d25378c59bb" }, "outputs": [ { @@ -351,10 +388,10 @@ "execution_count": 7, "metadata": { "id": "at8SUb9vsbHI", - "outputId": "9f629416-1ab4-4ee7-806f-8cea11579869", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "9f629416-1ab4-4ee7-806f-8cea11579869" }, "outputs": [ { @@ -408,10 +445,10 @@ "metadata": { "tags": [], "id": "Me55JRw9sbHI", - "outputId": "280fb558-309b-4440-c4c7-5c13ae947ec7", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "280fb558-309b-4440-c4c7-5c13ae947ec7" }, "outputs": [ { @@ -448,10 +485,10 @@ "execution_count": 9, "metadata": { "id": "45Zk_D8lsbHI", - "outputId": "cf21cf2d-4cb0-4063-9ac7-34aa84b44d89", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "cf21cf2d-4cb0-4063-9ac7-34aa84b44d89" }, "outputs": [ { @@ -485,10 +522,10 @@ "metadata": { "tags": [], "id": "_1C45IOYsbHI", - "outputId": "86959042-b7f8-4f72-f70e-961c80e3ffa2", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "86959042-b7f8-4f72-f70e-961c80e3ffa2" }, "outputs": [ { @@ -525,10 +562,10 @@ "metadata": { "tags": [], "id": "cT2bsZs_sbHI", - "outputId": "57e9de65-3fe6-493e-f811-8a01ed8b3d42", "colab": { "base_uri": "https://localhost:8080/" - } + }, + "outputId": "57e9de65-3fe6-493e-f811-8a01ed8b3d42" }, "outputs": [ { @@ -592,11 +629,11 @@ "metadata": { "tags": [], "id": "-5ii1cuysbHJ", - "outputId": "f52dd18b-94ad-4614-bccf-9ae61ffc0702", "colab": { "base_uri": "https://localhost:8080/", "height": 211 - } + }, + "outputId": "f52dd18b-94ad-4614-bccf-9ae61ffc0702" }, "outputs": [ { From b580478a067c46a361dc51104b9605620593d761 Mon Sep 17 00:00:00 2001 From: dquinonez25 Date: Sat, 23 Aug 2025 22:40:18 -0500 Subject: [PATCH 05/13] =?UTF-8?q?Se=20cre=C3=B3=20con=20Colab?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- lessons/02_web_scraping.ipynb | 88 ++++++++++++++++++++++++----------- 1 file changed, 60 insertions(+), 28 deletions(-) diff --git a/lessons/02_web_scraping.ipynb b/lessons/02_web_scraping.ipynb index 65ec79e..f162c01 100644 --- a/lessons/02_web_scraping.ipynb +++ b/lessons/02_web_scraping.ipynb @@ -39,7 +39,7 @@ "metadata": { "id": "WYEoSgtHsgp8" }, - "execution_count": null, + "execution_count": 1, "outputs": [] }, { @@ -72,10 +72,10 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 15, "metadata": { "id": "GVkS4oD2sbHF", - "outputId": "d95d5dcc-b04f-4048-d2be-1b20c6b446ad", + "outputId": "f610d796-12a2-419b-d19c-9687f1fdb93d", "colab": { "base_uri": "https://localhost:8080/" } @@ -108,13 +108,22 @@ }, { "cell_type": "code", + "source": [], + "metadata": { + "id": "4Otf-8ea_5XH" + }, "execution_count": 2, + "outputs": [] + }, + { + "cell_type": "code", + "execution_count": 16, "metadata": { "id": "2_tIUc07sbHG", "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "cb0e9a0e-fdee-4cd1-c42d-24a384833326" + "outputId": "1deaf362-5741-443a-eba6-bedb4e60048c" }, "outputs": [ { @@ -151,13 +160,13 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 17, "metadata": { "id": "p4AWMvbdsbHG", "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "710c0ec9-ac3b-41d6-8627-68c7d3bb767d" + "outputId": "56fa69c0-ead4-4a6b-eb33-3803b3fa9e3a" }, "outputs": [ { @@ -184,7 +193,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 19, "metadata": { "tags": [], "id": "s0wd-iHdsbHG" @@ -204,13 +213,16 @@ "metadata": { "id": "qRK7gt4A-Ai4" }, - "execution_count": null, + "execution_count": 19, "outputs": [] }, { "cell_type": "markdown", "source": [ - "***Install lxml sirve para que puedas procesar y analizar documentos HTML y XML de forma rápida y eficiente en Python.***" + "***Import required libraries permite importar las librerías necesarias para que un programa funcione correctamente.***\n", + "\n", + "> Agregar bloque entrecomillado\n", + "\n" ], "metadata": { "id": "1N8YdG__99vj" @@ -251,14 +263,14 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 20, "metadata": { "tags": [], "id": "0_esi0UGsbHH", "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "388e94df-63c8-49b7-982f-7a7ba77a5dd7" + "outputId": "1dd267bc-9a00-4e82-998d-2cff82b3524b" }, "outputs": [ { @@ -296,6 +308,17 @@ "print(src[:1000])" ] }, + { + "cell_type": "markdown", + "source": [ + "***Se utiliza la librería requests para conectarse a la página del Senado de Illinois y se recupera el contenido de esa página web.***\n", + "src = req.text ***sirve para guardar el HTML completo de la respuesta en la variable src.***\n", + "print(src[:1000]) ***sirve para imprimir los primeros 1000 caracteres del HTML.***\n" + ], + "metadata": { + "id": "_84r_J7vDIoz" + } + }, { "cell_type": "markdown", "metadata": { @@ -311,13 +334,13 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 7, "metadata": { "id": "6Ek1DyCbsbHH", "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "290b5464-7cab-4c8e-a3b3-1d25378c59bb" + "outputId": "1e9976f8-e99d-4e7d-cb6d-118ac43a1afa" }, "outputs": [ { @@ -353,6 +376,15 @@ "print(soup.prettify()[:1000])" ] }, + { + "cell_type": "markdown", + "source": [ + "***Sirve para convertir el contenido HTML de una página web en una estructura que Python pueda analizar.*** soup = BeautifulSoup(src, 'lxml'.***Toma el HTML crudo (src) que obtuviste con requests.get() y lo convierte en un árbol de elementos HTML. Usa el parser lxml, que es rápido y robusto para analizar HTML y XML. La variable soup ahora contiene una versión estructurada del HTML que puedes recorrer y manipular fácilmente.*** print(soup.prettify()[:1000]).***Devuelve el HTML lo que permite enbellecer y facilita leer la estructura del documento y*** [:1000]) ***muestra solo los primeros 1000 caracteres para no saturar la salida.***" + ], + "metadata": { + "id": "LuNmKppLGD6p" + } + }, { "cell_type": "markdown", "metadata": { @@ -385,13 +417,13 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 8, "metadata": { "id": "at8SUb9vsbHI", "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "9f629416-1ab4-4ee7-806f-8cea11579869" + "outputId": "65649766-49c3-41dc-9755-84d78756b710" }, "outputs": [ { @@ -441,14 +473,14 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 9, "metadata": { "tags": [], "id": "Me55JRw9sbHI", "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "280fb558-309b-4440-c4c7-5c13ae947ec7" + "outputId": "9d7341df-3f17-466f-de85-e288ac0a6865" }, "outputs": [ { @@ -482,13 +514,13 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 10, "metadata": { "id": "45Zk_D8lsbHI", "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "cf21cf2d-4cb0-4063-9ac7-34aa84b44d89" + "outputId": "cbfbf663-02d9-48eb-80cc-b400271dac5c" }, "outputs": [ { @@ -518,14 +550,14 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 11, "metadata": { "tags": [], "id": "_1C45IOYsbHI", "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "86959042-b7f8-4f72-f70e-961c80e3ffa2" + "outputId": "4bb7c581-8d7c-44fb-a170-1eb8b0f75c6e" }, "outputs": [ { @@ -536,7 +568,7 @@ ] }, "metadata": {}, - "execution_count": 10 + "execution_count": 11 } ], "source": [ @@ -558,14 +590,14 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 12, "metadata": { "tags": [], "id": "cT2bsZs_sbHI", "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "57e9de65-3fe6-493e-f811-8a01ed8b3d42" + "outputId": "8fbe01c6-3435-48a2-a40a-bf030d9421ba" }, "outputs": [ { @@ -576,7 +608,7 @@ ] }, "metadata": {}, - "execution_count": 11 + "execution_count": 12 } ], "source": [ @@ -598,7 +630,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 13, "metadata": { "id": "gd8saWzGsbHI" }, @@ -625,7 +657,7 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 14, "metadata": { "tags": [], "id": "-5ii1cuysbHJ", @@ -633,7 +665,7 @@ "base_uri": "https://localhost:8080/", "height": 211 }, - "outputId": "f52dd18b-94ad-4614-bccf-9ae61ffc0702" + "outputId": "2d34d20b-31db-47a6-83c5-96a2dc319f27" }, "outputs": [ { From 0a8759ae10639ecf8bd49a9c776294ac55408601 Mon Sep 17 00:00:00 2001 From: dquinonez25 Date: Sun, 24 Aug 2025 01:31:27 -0500 Subject: [PATCH 06/13] =?UTF-8?q?Se=20cre=C3=B3=20con=20Colab?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- lessons/02_web_scraping.ipynb | 90 ++++++++++++++++++++++++----------- 1 file changed, 61 insertions(+), 29 deletions(-) diff --git a/lessons/02_web_scraping.ipynb b/lessons/02_web_scraping.ipynb index f162c01..480efcc 100644 --- a/lessons/02_web_scraping.ipynb +++ b/lessons/02_web_scraping.ipynb @@ -39,7 +39,7 @@ "metadata": { "id": "WYEoSgtHsgp8" }, - "execution_count": 1, + "execution_count": null, "outputs": [] }, { @@ -72,7 +72,7 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": null, "metadata": { "id": "GVkS4oD2sbHF", "outputId": "f610d796-12a2-419b-d19c-9687f1fdb93d", @@ -112,12 +112,12 @@ "metadata": { "id": "4Otf-8ea_5XH" }, - "execution_count": 2, + "execution_count": null, "outputs": [] }, { "cell_type": "code", - "execution_count": 16, + "execution_count": null, "metadata": { "id": "2_tIUc07sbHG", "colab": { @@ -160,7 +160,7 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": null, "metadata": { "id": "p4AWMvbdsbHG", "colab": { @@ -193,7 +193,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 2, "metadata": { "tags": [], "id": "s0wd-iHdsbHG" @@ -213,7 +213,7 @@ "metadata": { "id": "qRK7gt4A-Ai4" }, - "execution_count": 19, + "execution_count": null, "outputs": [] }, { @@ -263,14 +263,14 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 3, "metadata": { "tags": [], "id": "0_esi0UGsbHH", "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "1dd267bc-9a00-4e82-998d-2cff82b3524b" + "outputId": "a6a8415a-6b5f-4664-a5cb-49e7182137a0" }, "outputs": [ { @@ -311,9 +311,7 @@ { "cell_type": "markdown", "source": [ - "***Se utiliza la librería requests para conectarse a la página del Senado de Illinois y se recupera el contenido de esa página web.***\n", - "src = req.text ***sirve para guardar el HTML completo de la respuesta en la variable src.***\n", - "print(src[:1000]) ***sirve para imprimir los primeros 1000 caracteres del HTML.***\n" + "Sirve para conectarse a la página del Senado de Illinois y se recupera el contenido de esa página web. Además se guarda el HTML completo de la respuesta en la variable src y se imprime los primeros 1000 caracteres del HTML.\n" ], "metadata": { "id": "_84r_J7vDIoz" @@ -334,13 +332,13 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 4, "metadata": { "id": "6Ek1DyCbsbHH", "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "1e9976f8-e99d-4e7d-cb6d-118ac43a1afa" + "outputId": "a42163a3-39e0-4702-f63a-1d65da6a7a10" }, "outputs": [ { @@ -379,7 +377,7 @@ { "cell_type": "markdown", "source": [ - "***Sirve para convertir el contenido HTML de una página web en una estructura que Python pueda analizar.*** soup = BeautifulSoup(src, 'lxml'.***Toma el HTML crudo (src) que obtuviste con requests.get() y lo convierte en un árbol de elementos HTML. Usa el parser lxml, que es rápido y robusto para analizar HTML y XML. La variable soup ahora contiene una versión estructurada del HTML que puedes recorrer y manipular fácilmente.*** print(soup.prettify()[:1000]).***Devuelve el HTML lo que permite enbellecer y facilita leer la estructura del documento y*** [:1000]) ***muestra solo los primeros 1000 caracteres para no saturar la salida.***" + "Se convierte el contenido HTML de una página web en una estructura que Python, se toma el HTML que se obtuvo con requests y se lo convierte en un árbol de elementos HTML. Se usa parser lxml, que es rápido y robusto para analizar HTML y XML. Devuelve el HTML lo que permite embellecer y facilita leer la estructura del documento y muestra solo los primeros 1000 caracteres para no saturar la salida." ], "metadata": { "id": "LuNmKppLGD6p" @@ -417,13 +415,13 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 5, "metadata": { "id": "at8SUb9vsbHI", "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "65649766-49c3-41dc-9755-84d78756b710" + "outputId": "2f1460b8-6352-453a-f9e7-84285f43adcf" }, "outputs": [ { @@ -460,6 +458,15 @@ "print(a_tags[:10])" ] }, + { + "cell_type": "markdown", + "source": [ + "***Busca todas las etiquetas en el documento HTML, toma los primeros 10 elementos encontrados y muestra esos 10 elementos en la consola.***" + ], + "metadata": { + "id": "tbH6B05TlQ-6" + } + }, { "cell_type": "markdown", "metadata": { @@ -473,14 +480,14 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 7, "metadata": { "tags": [], "id": "Me55JRw9sbHI", "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "9d7341df-3f17-466f-de85-e288ac0a6865" + "outputId": "5264ba50-a74f-417a-e995-ef8c7ab0ef5a" }, "outputs": [ { @@ -514,7 +521,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": null, "metadata": { "id": "45Zk_D8lsbHI", "colab": { @@ -550,14 +557,14 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 8, "metadata": { "tags": [], "id": "_1C45IOYsbHI", "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "4bb7c581-8d7c-44fb-a170-1eb8b0f75c6e" + "outputId": "2c3a0945-e3f3-4fe8-e0b9-6040a582df82" }, "outputs": [ { @@ -568,7 +575,7 @@ ] }, "metadata": {}, - "execution_count": 11 + "execution_count": 8 } ], "source": [ @@ -577,6 +584,15 @@ "side_menus[:5]" ] }, + { + "cell_type": "markdown", + "source": [ + "Sirve para analizar un documento HTML, y su propósito es extraer enlaces específicos, busca todas las etiquetas que tengan la clase sidemenu y toma los primeros 5 elementos encontrados." + ], + "metadata": { + "id": "FdlJ0c5uq6Rz" + } + }, { "cell_type": "markdown", "metadata": { @@ -590,14 +606,14 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 9, "metadata": { "tags": [], "id": "cT2bsZs_sbHI", "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "8fbe01c6-3435-48a2-a40a-bf030d9421ba" + "outputId": "51f29555-050e-4fdf-82de-37b33801d6ae" }, "outputs": [ { @@ -608,7 +624,7 @@ ] }, "metadata": {}, - "execution_count": 12 + "execution_count": 9 } ], "source": [ @@ -617,6 +633,15 @@ "selected[:5]" ] }, + { + "cell_type": "markdown", + "source": [ + "Se utiliza un selector CSS para encontrar todas las etiquetas que tengan la clase sidemenu y selecciona los primeros 5 elementos encontrados." + ], + "metadata": { + "id": "0urELMRuu57U" + } + }, { "cell_type": "markdown", "metadata": { @@ -630,7 +655,7 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 11, "metadata": { "id": "gd8saWzGsbHI" }, @@ -639,6 +664,13 @@ "# YOUR CODE HERE\n" ] }, + { + "cell_type": "markdown", + "source": [], + "metadata": { + "id": "LnO5snVswzX0" + } + }, { "cell_type": "markdown", "metadata": { @@ -657,7 +689,7 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 12, "metadata": { "tags": [], "id": "-5ii1cuysbHJ", @@ -665,7 +697,7 @@ "base_uri": "https://localhost:8080/", "height": 211 }, - "outputId": "2d34d20b-31db-47a6-83c5-96a2dc319f27" + "outputId": "f63d27c9-85ad-4244-bb40-759a72f47f54" }, "outputs": [ { From d449b39e71ebb567e326aef64e6225ea9bbe69d6 Mon Sep 17 00:00:00 2001 From: dquinonez25 Date: Sun, 24 Aug 2025 14:28:22 -0500 Subject: [PATCH 07/13] =?UTF-8?q?Se=20cre=C3=B3=20con=20Colab?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- lessons/02_web_scraping.ipynb | 236 ++++++++++++++++++++++++++-------- 1 file changed, 184 insertions(+), 52 deletions(-) diff --git a/lessons/02_web_scraping.ipynb b/lessons/02_web_scraping.ipynb index 480efcc..9c65e3a 100644 --- a/lessons/02_web_scraping.ipynb +++ b/lessons/02_web_scraping.ipynb @@ -193,7 +193,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "metadata": { "tags": [], "id": "s0wd-iHdsbHG" @@ -263,7 +263,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": null, "metadata": { "tags": [], "id": "0_esi0UGsbHH", @@ -311,7 +311,7 @@ { "cell_type": "markdown", "source": [ - "Sirve para conectarse a la página del Senado de Illinois y se recupera el contenido de esa página web. Además se guarda el HTML completo de la respuesta en la variable src y se imprime los primeros 1000 caracteres del HTML.\n" + "***Se ejecuta conexión a la página del Senado de Illinois y se recupera el contenido de esa página web. Se guarda el HTML completo de la respuesta en la variable src y se imprime los primeros 1000 caracteres del HTML.***\n" ], "metadata": { "id": "_84r_J7vDIoz" @@ -332,7 +332,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": null, "metadata": { "id": "6Ek1DyCbsbHH", "colab": { @@ -377,7 +377,7 @@ { "cell_type": "markdown", "source": [ - "Se convierte el contenido HTML de una página web en una estructura que Python, se toma el HTML que se obtuvo con requests y se lo convierte en un árbol de elementos HTML. Se usa parser lxml, que es rápido y robusto para analizar HTML y XML. Devuelve el HTML lo que permite embellecer y facilita leer la estructura del documento y muestra solo los primeros 1000 caracteres para no saturar la salida." + "***Se convierte el contenido HTML de una página web en una estructura que Python, se toma el HTML que se obtuvo con requests y se lo convierte en un árbol de elementos HTML. Se usa parser lxml, que es rápido y robusto para analizar HTML y XML. Devuelve el HTML lo que permite embellecer y facilita leer la estructura del documento y muestra solo los primeros 1000 caracteres para no saturar la salida.***" ], "metadata": { "id": "LuNmKppLGD6p" @@ -415,7 +415,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": null, "metadata": { "id": "at8SUb9vsbHI", "colab": { @@ -461,7 +461,7 @@ { "cell_type": "markdown", "source": [ - "***Busca todas las etiquetas en el documento HTML, toma los primeros 10 elementos encontrados y muestra esos 10 elementos en la consola.***" + "***Busca todas las etiquetas < a > en el documento HTML, toma los primeros 10 elementos encontrados y muestra esos 10 elementos en la consola.***" ], "metadata": { "id": "tbH6B05TlQ-6" @@ -480,7 +480,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": null, "metadata": { "tags": [], "id": "Me55JRw9sbHI", @@ -557,7 +557,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": null, "metadata": { "tags": [], "id": "_1C45IOYsbHI", @@ -587,7 +587,7 @@ { "cell_type": "markdown", "source": [ - "Sirve para analizar un documento HTML, y su propósito es extraer enlaces específicos, busca todas las etiquetas que tengan la clase sidemenu y toma los primeros 5 elementos encontrados." + "***Sirve para analizar un documento HTML, y su propósito es extraer enlaces específicos, busca todas las etiquetas < a > que tengan la clase sidemenu y toma los primeros 5 elementos encontrados.***" ], "metadata": { "id": "FdlJ0c5uq6Rz" @@ -606,7 +606,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": null, "metadata": { "tags": [], "id": "cT2bsZs_sbHI", @@ -636,7 +636,7 @@ { "cell_type": "markdown", "source": [ - "Se utiliza un selector CSS para encontrar todas las etiquetas que tengan la clase sidemenu y selecciona los primeros 5 elementos encontrados." + "***Se utiliza un selector CSS para encontrar todas las etiquetas < a > que tengan la clase sidemenu y selecciona los primeros 5 elementos encontrados.***" ], "metadata": { "id": "0urELMRuu57U" @@ -655,7 +655,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": null, "metadata": { "id": "gd8saWzGsbHI" }, @@ -689,41 +689,62 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 3, "metadata": { "tags": [], "id": "-5ii1cuysbHJ", "colab": { - "base_uri": "https://localhost:8080/", - "height": 211 + "base_uri": "https://localhost:8080/" }, - "outputId": "f63d27c9-85ad-4244-bb40-759a72f47f54" + "outputId": "082f089c-2ce6-478b-d442-4c454a9e0bab" }, "outputs": [ { - "output_type": "error", - "ename": "IndexError", - "evalue": "list index out of range", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mIndexError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m/tmp/ipython-input-1553754977.py\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;31m# Examine the first link\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 5\u001b[0;31m \u001b[0mfirst_link\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mside_menu_links\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 6\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfirst_link\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mIndexError\u001b[0m: list index out of range" + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + " English\r\n", + " \n", + "Class: \n" ] } ], "source": [ "# Get all sidemenu links as a list\n", - "side_menu_links = soup.select(\"a.sidemenu\")\n", + "# Make a GET request\n", + "import requests\n", + "from bs4 import BeautifulSoup\n", "\n", - "# Examine the first link\n", - "first_link = side_menu_links[0]\n", - "print(first_link)\n", + "req = requests.get('http://www.ilga.gov/senate/default.asp')\n", + "# Read the content of the server’s response\n", + "src = req.text\n", + "# Parse the response into an HTML tree\n", + "soup = BeautifulSoup(src, 'lxml')\n", + "\n", + "# Select all 'a' tags instead of those with class 'sidemenu'\n", + "all_links = soup.select(\"a\")\n", "\n", - "# What class is this variable?\n", - "print('Class: ', type(first_link))" + "# Examine the first link, checking if the list is not empty\n", + "if all_links:\n", + " first_link = all_links[0]\n", + " print(first_link)\n", + "\n", + " # What class is this variable?\n", + " print('Class: ', type(first_link))\n", + "else:\n", + " print(\"No links found on the page.\")" ] }, + { + "cell_type": "markdown", + "source": [ + "***Busca todos los elementos < a > que tengan la clase sidemenu que se ha cargado en la variable soup y devuelve una lista con todos esos elementos. Toma el primer elemento de esa lista y muestra en pantalla el HTML completo de ese primer enlace.***" + ], + "metadata": { + "id": "0vJ2TCmoSFGl" + } + }, { "cell_type": "markdown", "metadata": { @@ -735,16 +756,43 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 6, "metadata": { "tags": [], - "id": "Lpz1wvTasbHJ" + "id": "Lpz1wvTasbHJ", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "d52ef0b7-0f95-41c8-e983-7c1b065939b3" }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + " English\r\n", + " \n" + ] + } + ], "source": [ - "print(first_link.text)" + "# Check if first_link is defined before printing its text\n", + "if 'first_link' in locals():\n", + " print(first_link.text)\n", + "else:\n", + " print(\"first_link is not defined.\")" ] }, + { + "cell_type": "markdown", + "source": [ + "***Se devuelve un diccionario con todas las variables locales definidas en ese momento y se comprueba si la variable first_link está entre ellas. Si está definida se imprime el texto dentro del enlace y sino está definida se imprime un mensaje de notificación.***" + ], + "metadata": { + "id": "Jik2G9RXYKsS" + } + }, { "cell_type": "markdown", "metadata": { @@ -758,16 +806,41 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 8, "metadata": { "tags": [], - "id": "YdjBhcK4sbHJ" + "id": "YdjBhcK4sbHJ", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "f1537250-b1e9-4052-c08a-714f78869064" }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "#\n" + ] + } + ], "source": [ - "print(first_link['href'])" + "# Check if first_link is defined before accessing its attributes\n", + "if 'first_link' in locals():\n", + " print(first_link['href'])\n", + "else:\n", + " print(\"first_link is not defined.\")" ] }, + { + "cell_type": "markdown", + "source": [ + "***Se devuelve un diccionario con todas las variables locales definidas en ese momento y se comprueba si la variable first_link existe. Si existe se imprime el valor del atributo href para establecer el destino de un enlace y sino existe se imprime un mensaje que no está definida.***" + ], + "metadata": { + "id": "zpyNALOJZprN" + } + }, { "cell_type": "markdown", "metadata": { @@ -781,7 +854,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 9, "metadata": { "id": "bI5uX6JpsbHJ" }, @@ -820,7 +893,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 10, "metadata": { "tags": [], "id": "1SqOuhFAsbHJ" @@ -835,6 +908,15 @@ "soup = BeautifulSoup(src, \"lxml\")" ] }, + { + "cell_type": "markdown", + "source": [ + "***Se envía una solicitud HTTP GET a la URL especificada y recupera el contenido de la página. Se guarda el contenido HTML de la respuesta en la variable src y dicho código fuente de la página web como texto plano. Luego se convierte el texto HTML en un objeto para recorrerlo e interpretarlo.***" + ], + "metadata": { + "id": "Z87FWjzsc7dY" + } + }, { "cell_type": "markdown", "metadata": { @@ -848,17 +930,41 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 15, "metadata": { - "id": "BPhi0XrcsbHJ" + "id": "BPhi0XrcsbHJ", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "d2508bcf-41da-4ca0-c5b3-53f3236777ac" }, - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0" + ] + }, + "metadata": {}, + "execution_count": 15 + } + ], "source": [ "# Get all table row elements\n", "rows = soup.find_all(\"tr\")\n", "len(rows)" ] }, + { + "cell_type": "markdown", + "source": [ + "***Se busca todos los elementos*** *tr*, ***es decir una fila de tabla y devuelve el número total de filas encontradas.***" + ], + "metadata": { + "id": "efDFTUz2f5Bh" + } + }, { "cell_type": "markdown", "metadata": { @@ -870,19 +976,29 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 16, "metadata": { "id": "utKa2QuKsbHO" }, "outputs": [], "source": [ - "# Returns every ‘tr tr tr’ css selector in the page\n", - "rows = soup.select('tr tr tr')\n", + "# Filter rows to find the relevant ones.\n", + "# Based on the HTML structure, the relevant rows seem to contain 'td' elements with the class 'detail'.\n", + "rows = [row for row in rows if row.select('td.detail')]\n", "\n", "for row in rows[:5]:\n", " print(row, '\\n')" ] }, + { + "cell_type": "markdown", + "source": [ + "***Se analiza encontrar elementos*** *tr* ***en filas anidadas tres niveles dentro de otras filas; se imprime los primeros cinco elementos encontrados y se agrega una línea en blanco entre cada impresión para facilitar la lectura.***" + ], + "metadata": { + "id": "r0HKR_3ihjjQ" + } + }, { "cell_type": "markdown", "metadata": { @@ -894,14 +1010,30 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 17, "metadata": { - "id": "C1LRohx8sbHP" + "id": "C1LRohx8sbHP", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "9fbc5aff-f3cd-4f7f-a899-8ba7d28d1da4" }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "No relevant rows found on the page.\n" + ] + } + ], "source": [ - "example_row = rows[2]\n", - "print(example_row.prettify())" + "# Check if the rows list is not empty before accessing elements\n", + "if rows:\n", + " example_row = rows[0] # Use the first valid row as an example\n", + " print(example_row.prettify())\n", + "else:\n", + " print(\"No relevant rows found on the page.\")" ] }, { From bc0e3d30dcf2a94b183364e2109206aa5c053328 Mon Sep 17 00:00:00 2001 From: dquinonez25 Date: Sun, 24 Aug 2025 16:00:21 -0500 Subject: [PATCH 08/13] =?UTF-8?q?Se=20cre=C3=B3=20con=20Colab?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- lessons/02_web_scraping.ipynb | 85 ++++++++++++++++++++++++----------- 1 file changed, 60 insertions(+), 25 deletions(-) diff --git a/lessons/02_web_scraping.ipynb b/lessons/02_web_scraping.ipynb index 9c65e3a..cd9f222 100644 --- a/lessons/02_web_scraping.ipynb +++ b/lessons/02_web_scraping.ipynb @@ -689,7 +689,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": null, "metadata": { "tags": [], "id": "-5ii1cuysbHJ", @@ -756,7 +756,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": null, "metadata": { "tags": [], "id": "Lpz1wvTasbHJ", @@ -806,7 +806,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": null, "metadata": { "tags": [], "id": "YdjBhcK4sbHJ", @@ -854,7 +854,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": null, "metadata": { "id": "bI5uX6JpsbHJ" }, @@ -893,7 +893,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": null, "metadata": { "tags": [], "id": "1SqOuhFAsbHJ" @@ -930,7 +930,7 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": null, "metadata": { "id": "BPhi0XrcsbHJ", "colab": { @@ -976,7 +976,7 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 18, "metadata": { "id": "utKa2QuKsbHO" }, @@ -993,10 +993,10 @@ { "cell_type": "markdown", "source": [ - "***Se analiza encontrar elementos*** *tr* ***en filas anidadas tres niveles dentro de otras filas; se imprime los primeros cinco elementos encontrados y se agrega una línea en blanco entre cada impresión para facilitar la lectura.***" + "***Recorre cada lista*** *row*, ***verifica si contiene al menos un elemento*** *td* ***con clase*** *detail*, ***conserva aquellas filas que si tienen, imprime los primeros cinco elementos filtrados y se agrega una línea en blanco entre cada impresión para facilitar la lectura.***" ], "metadata": { - "id": "r0HKR_3ihjjQ" + "id": "eKpvx5Qelchs" } }, { @@ -1010,13 +1010,13 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 20, "metadata": { "id": "C1LRohx8sbHP", "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "9fbc5aff-f3cd-4f7f-a899-8ba7d28d1da4" + "outputId": "c07e169c-c068-4bf7-82b3-c409791aea85" }, "outputs": [ { @@ -1036,6 +1036,15 @@ " print(\"No relevant rows found on the page.\")" ] }, + { + "cell_type": "markdown", + "source": [ + "***Verifica si la lista*** *rows* ***no está vacía, si hay elementos toma el primero de la lista e imprime el contenido de esa fila. Caso contrario si la lista está vacía, muestra un mensaje indicando que no se encontraron filas relevantes.***" + ], + "metadata": { + "id": "tD5cVX01no9g" + } + }, { "cell_type": "markdown", "metadata": { @@ -1051,25 +1060,51 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 23, "metadata": { - "id": "V2FmLZrJsbHP" + "id": "V2FmLZrJsbHP", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "f1d25335-5a81-4534-f8aa-2b8117554415" }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "No rows found to process.\n" + ] + } + ], "source": [ - "for cell in example_row.select('td'):\n", - " print(cell)\n", - "print()\n", - "\n", - "for cell in example_row.select('.detail'):\n", - " print(cell)\n", - "print()\n", - "\n", - "for cell in example_row.select('td.detail'):\n", - " print(cell)\n", - "print()" + "# Check if the rows list is not empty before accessing elements and then select the first row\n", + "if rows:\n", + " example_row = rows[0]\n", + " for cell in example_row.select('td'):\n", + " print(cell)\n", + " print()\n", + "\n", + " for cell in example_row.select('.detail'):\n", + " print(cell)\n", + " print()\n", + "\n", + " for cell in example_row.select('td.detail'):\n", + " print(cell)\n", + " print()\n", + "else:\n", + " print(\"No rows found to process.\")" ] }, + { + "cell_type": "markdown", + "source": [ + "***Verifica si la lista*** *rows* ***contiene elementos, selecciona la primera fila como ejemplo para inspección, muestra cada celda de la fila, extrae cualquier elemento con clase*** *detail*, ***se busca solo celdas*** < td > ***que tengan la clase*** *detail*, ***caso contrario imprime un mensaje si no hay filas disponibles para analizar.***" + ], + "metadata": { + "id": "DzwCJ2IYs5KP" + } + }, { "cell_type": "markdown", "metadata": { From 3ab1d39e3af66c0333c1c0c058fcce1a70c48263 Mon Sep 17 00:00:00 2001 From: dquinonez25 Date: Sun, 24 Aug 2025 21:09:16 -0500 Subject: [PATCH 09/13] =?UTF-8?q?Se=20cre=C3=B3=20con=20Colab?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- lessons/02_web_scraping.ipynb | 497 ++++++++++++++++++++++++++++------ 1 file changed, 416 insertions(+), 81 deletions(-) diff --git a/lessons/02_web_scraping.ipynb b/lessons/02_web_scraping.ipynb index cd9f222..bb85311 100644 --- a/lessons/02_web_scraping.ipynb +++ b/lessons/02_web_scraping.ipynb @@ -976,7 +976,7 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": null, "metadata": { "id": "utKa2QuKsbHO" }, @@ -1010,7 +1010,7 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": null, "metadata": { "id": "C1LRohx8sbHP", "colab": { @@ -1060,13 +1060,13 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 25, "metadata": { "id": "V2FmLZrJsbHP", "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "f1d25335-5a81-4534-f8aa-2b8117554415" + "outputId": "39f9aad6-86dd-435f-9769-d2fd99431371" }, "outputs": [ { @@ -1116,16 +1116,41 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 28, "metadata": { "tags": [], - "id": "rOZQqe0MsbHP" + "id": "rOZQqe0MsbHP", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "0c62ae11-2d9f-473d-b321-5d4bd8b4403b" }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "example_row is not defined, cannot perform assertion.\n" + ] + } + ], "source": [ - "assert example_row.select('td') == example_row.select('.detail') == example_row.select('td.detail')" + "# Check if example_row is defined before asserting\n", + "if 'example_row' in locals():\n", + " assert example_row.select('td') == example_row.select('.detail') == example_row.select('td.detail')\n", + "else:\n", + " print(\"example_row is not defined, cannot perform assertion.\")" ] }, + { + "cell_type": "markdown", + "source": [ + "***Valida si la variable está definida y existe en el entorno local, compara tres selecciones de elementos y se comprueba si los tres resultados son iguales el programa continúa, caso contrario sino son iguales imprime que no está definido, no se puede realizar la afirmación.***" + ], + "metadata": { + "id": "QQBUUJdO78NN" + } + }, { "cell_type": "markdown", "metadata": { @@ -1137,17 +1162,41 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 31, "metadata": { - "id": "44w-eYDZsbHP" + "id": "44w-eYDZsbHP", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "c3510d1b-de0d-4df7-d047-1552fcef0197" }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "example_row is not defined. Please ensure relevant rows are found before proceeding.\n" + ] + } + ], "source": [ "# Select only those 'td' tags with class 'detail'\n", - "detail_cells = example_row.select('td.detail')\n", - "detail_cells" + "if 'example_row' in locals():\n", + " detail_cells = example_row.select('td.detail')\n", + " detail_cells\n", + "else:\n", + " print(\"example_row is not defined. Please ensure relevant rows are found before proceeding.\")" ] }, + { + "cell_type": "markdown", + "source": [ + "***Verifica si la variable está definida, busca todas las celdas*** < td > ***dentro de esa fila que tengan la clase detail, guarda en la variable*** *detail_cells*, ***caso contrario imprime que la variable no existe en el entorno local y que no se encontraron filas relevantes.***" + ], + "metadata": { + "id": "JC4jI1w0OF2k" + } + }, { "cell_type": "markdown", "metadata": { @@ -1159,18 +1208,41 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 34, "metadata": { - "id": "3tMXEvFSsbHP" + "id": "3tMXEvFSsbHP", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "11e68f46-7531-4f23-b9de-58dba41563d9" }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "detail_cells is not defined. Please ensure example_row and detail_cells are properly defined before proceeding.\n" + ] + } + ], "source": [ "# Keep only the text in each of those cells\n", - "row_data = [cell.text for cell in detail_cells]\n", - "\n", - "print(row_data)" + "if 'detail_cells' in locals():\n", + " row_data = [cell.text for cell in detail_cells]\n", + " print(row_data)\n", + "else:\n", + " print(\"detail_cells is not defined. Please ensure example_row and detail_cells are properly defined before proceeding.\")" ] }, + { + "cell_type": "markdown", + "source": [ + "***Verifica si la variable*** *detail_cells existe* ***en el entorno local, recorre cada elemento y extrae solo el texto visible dentro de cada celda. Muestra la lista de textos extraídos, caso contrario imprime un mensaje de advertencia para que sea revisado antes de continuar.***" + ], + "metadata": { + "id": "u81RUjHVSQdp" + } + }, { "cell_type": "markdown", "metadata": { @@ -1182,17 +1254,47 @@ }, { "cell_type": "code", + "source": [], + "metadata": { + "id": "8906HB-8UaHW" + }, "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "execution_count": 38, "metadata": { - "id": "ddAU76TZsbHP" + "id": "ddAU76TZsbHP", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "6544b3da-a376-4dfb-b82b-49755c9a5f8f" }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "row_data is not defined. Please ensure detail_cells and row_data are properly defined before proceeding.\n" + ] + } + ], "source": [ "print(row_data[0]) # Name\n", "print(row_data[3]) # District\n", "print(row_data[4]) # Party" ] }, + { + "cell_type": "markdown", + "source": [ + "***Verifica si la variable existe en el entorno local. Muestra el primer, cuarto y quinto elemento de la lista con su respectivo nombre; caso contrario sino existe imprime un mensaje de advertencia para que sea revisado antes de continuar.***" + ], + "metadata": { + "id": "Lr8FSmd5VOPK" + } + }, { "cell_type": "markdown", "metadata": { @@ -1206,17 +1308,41 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 41, "metadata": { - "id": "T7m-E5kjsbHP" + "id": "T7m-E5kjsbHP", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "84123d37-aff4-42de-ed0f-c7604bcd34b9" }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "The 'rows' list is empty. No rows were found on the page in the previous step.\n" + ] + } + ], "source": [ - "print('Row 0:\\n', rows[0], '\\n')\n", - "print('Row 1:\\n', rows[1], '\\n')\n", - "print('Last Row:\\n', rows[-1])" + "if rows:\n", + " print('Row 0:\\n', rows[0], '\\n')\n", + " print('Row 1:\\n', rows[1], '\\n')\n", + " print('Last Row:\\n', rows[-1])\n", + "else:\n", + " print(\"The 'rows' list is empty. No rows were found on the page in the previous step.\")" ] }, + { + "cell_type": "markdown", + "source": [ + "***Verifica si la lista rows contiene elementos, muestra el contenido completo de la primera fila, muestra la segunda fila útil para comparar estructura o contenido, muestra la última fila de la lista usando índice negativo. Caso contrario imprime un mensaje indicando que no se encontraron filas en el paso anterior.***" + ], + "metadata": { + "id": "_CjvzH-PXFfX" + } + }, { "cell_type": "markdown", "metadata": { @@ -1230,21 +1356,45 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 43, "metadata": { - "id": "LKo9MyWBsbHP" + "id": "LKo9MyWBsbHP", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "0cdb4435-2efd-49af-c311-94bb66bb86f7" }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "The 'rows' list does not contain enough elements to perform this check.\n" + ] + } + ], "source": [ - "# Bad rows\n", - "print(len(rows[0]))\n", - "print(len(rows[1]))\n", + "if len(rows) >= 4:\n", + " # Bad rows\n", + " print(len(rows[0]))\n", + " print(len(rows[1]))\n", "\n", - "# Good rows\n", - "print(len(rows[2]))\n", - "print(len(rows[3]))" + " # Good rows\n", + " print(len(rows[2]))\n", + " print(len(rows[3]))\n", + "else:\n", + " print(\"The 'rows' list does not contain enough elements to perform this check.\")" ] }, + { + "cell_type": "markdown", + "source": [ + "***Verifica que haya al menos 4 filas en la lista rows, imprime la cantidad de elementos en las dos primeras filas, que se consideran malas o inútiles; y en las siguientes dos filas, imprime las se consideran buenas o útiles para extraer datos. Caso contrario si hay menos de 4 filas, evita el análisis e informa que no hay suficientes datos.***" + ], + "metadata": { + "id": "NEKFDLdOZ4OL" + } + }, { "cell_type": "markdown", "metadata": { @@ -1256,20 +1406,46 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 47, "metadata": { - "id": "L2awrhEfsbHP" + "id": "L2awrhEfsbHP", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "6a715c8c-f559-43ff-ee52-84fc5cb20da2" }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "The 'good_rows' list is empty. No rows with length 5 were found.\n" + ] + } + ], "source": [ "good_rows = [row for row in rows if len(row) == 5]\n", "\n", "# Let's check some rows\n", - "print(good_rows[0], '\\n')\n", - "print(good_rows[-2], '\\n')\n", - "print(good_rows[-1])" + "if good_rows:\n", + " print(good_rows[0], '\\n')\n", + " # Ensure there are at least 2 elements for [-2]\n", + " if len(good_rows) >= 2:\n", + " print(good_rows[-2], '\\n')\n", + " print(good_rows[-1])\n", + "else:\n", + " print(\"The 'good_rows' list is empty. No rows with length 5 were found.\")" ] }, + { + "cell_type": "markdown", + "source": [ + "***Se valida creación de una nueva lista good_rows que solo incluye las filas que tienen exactamente 5 elementos, si hay filas útiles muestra, caso contrario imprime que no se encontraron filas válidas.***" + ], + "metadata": { + "id": "22kA-378cNBH" + } + }, { "cell_type": "markdown", "metadata": { @@ -1281,37 +1457,86 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 49, "metadata": { - "id": "GTFEA5VdsbHQ" + "id": "GTFEA5VdsbHQ", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "f6537a87-4c6a-479a-e721-e93f31ea1f13" }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "The 'rows' list does not contain enough elements to access index 2.\n" + ] + } + ], "source": [ - "rows[2].select('td.detail')" + "if len(rows) > 2:\n", + " rows[2].select('td.detail')\n", + "else:\n", + " print(\"The 'rows' list does not contain enough elements to access index 2.\")" ] }, + { + "cell_type": "markdown", + "source": [ + "***Valida que la lista rows tenga al menos tres elementos, los índices 0, 1 y 2. Accede a la tercera fila y selecciona solo las celdas*** < td > ***que tengan la clase*** *detail.* ***Caso contrario imprime un mensaje indicando que no hay suficientes filas y no se puede acceder al índice 2.***" + ], + "metadata": { + "id": "8mptx0DSefWN" + } + }, { "cell_type": "code", - "execution_count": null, + "execution_count": 51, "metadata": { - "id": "pnIFBOsJsbHQ" + "id": "pnIFBOsJsbHQ", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "07ba24b1-617e-4c8d-c7d9-c23deda98fff" }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "The 'good_rows' list is empty. No rows with 'td.detail' were found.\n" + ] + } + ], "source": [ - "# Bad row\n", - "print(rows[-1].select('td.detail'), '\\n')\n", + "if len(rows) > 5:\n", + " # Bad row\n", + " print(rows[-1].select('td.detail'), '\\n')\n", "\n", - "# Good row\n", - "print(rows[5].select('td.detail'), '\\n')\n", + " # Good row\n", + " print(rows[5].select('td.detail'), '\\n')\n", "\n", "# How about this?\n", "good_rows = [row for row in rows if row.select('td.detail')]\n", "\n", - "print(\"Checking rows...\\n\")\n", - "print(good_rows[0], '\\n')\n", - "print(good_rows[-1])" + "if good_rows:\n", + " print(\"Checking rows...\\n\")\n", + " print(good_rows[0], '\\n')\n", + " print(good_rows[-1])\n", + "else:\n", + " print(\"The 'good_rows' list is empty. No rows with 'td.detail' were found.\")" ] }, + { + "cell_type": "markdown", + "source": [ + "***Valida que hay al menos 6 filas, imprime la última posiblemente sea mala o no deseada, imprime la sexta fila, que se considera buena o con datos válidos, se filtra las filas útiles limpiando la tabla y quedar con lo que interesa. Si hay filas buenas muestra la primera y última, caso contrario imprime que no hay e informa que la lista está vacía.***" + ], + "metadata": { + "id": "n-r_AsAdgoOb" + } + }, { "cell_type": "markdown", "metadata": { @@ -1334,7 +1559,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 52, "metadata": { "tags": [], "id": "_p6yf187sbHQ" @@ -1363,18 +1588,51 @@ " members.append(senator)" ] }, + { + "cell_type": "markdown", + "source": [ + "***Se guardan los datos de cada senador, se filtran las filas útiles descartando encabezados, filas vacías o decorativas. Se comienza un bucle para procesar cada fila que contiene datos útiles, se extrae las celdas con clase*** *detail* ***y se obtiene sólo las relevantes dentro de la fila. Se convierte las celdas HTML en texto plano, se extrae campos específicos, se agrupa los datos en una estructura simple y ordenada, y se guarda la información en la lista members.***" + ], + "metadata": { + "id": "EKC27QripQF_" + } + }, { "cell_type": "code", - "execution_count": null, + "execution_count": 53, "metadata": { - "id": "X4HSTY1bsbHQ" + "id": "X4HSTY1bsbHQ", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "f4efa76e-8116-4e6f-e6b0-7cdfb73ac257" }, - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0" + ] + }, + "metadata": {}, + "execution_count": 53 + } + ], "source": [ "# Should be 61\n", "len(members)" ] }, + { + "cell_type": "markdown", + "source": [ + "***Devuelve el número total de tuplas almacenadas en la lista*** *members.*" + ], + "metadata": { + "id": "a3hcHEBnr8nH" + } + }, { "cell_type": "markdown", "metadata": { @@ -1386,15 +1644,36 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 54, "metadata": { - "id": "OzNtF5Q8sbHQ" + "id": "OzNtF5Q8sbHQ", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "77fd9634-a91e-46cd-c794-86b48229cf32" }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "[]\n" + ] + } + ], "source": [ "print(members[:5])" ] }, + { + "cell_type": "markdown", + "source": [ + "***Imprime los primeros cinco elementos de la lista*** *members.*" + ], + "metadata": { + "id": "ED9Zer-6ttsl" + } + }, { "cell_type": "markdown", "metadata": { @@ -1436,7 +1715,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 55, "metadata": { "tags": [], "id": "QY9WS7V2sbHQ" @@ -1477,9 +1756,18 @@ " members.append(senator)" ] }, + { + "cell_type": "markdown", + "source": [ + "**Se hace una petición GET al sitio web del Senado de Illinois para obtener la lista de senadores, se extrae el contenido HTML como texto plano desde la respuesta, se utiliza** *BeautifulSoup* ***con el parser*** lxml ***para convertir el HTML en una estructura que se puede navegar fácilmente, se crea una lista vacía para guardar la información de cada senador, se seleccionan todos los elementos*** < tr > ***anidados tres veces y se filtran las filas que contienen al menos una celda*** < td > ***con clase*** *detail.* ***Se seleccionan las celdas dentro de cada fila y se extrae el texto de cada una. Se organiza los datos, se crea una tupla con esa información y se la agrega a la lista members.***" + ], + "metadata": { + "id": "F6H6fl5vvm8F" + } + }, { "cell_type": "code", - "execution_count": null, + "execution_count": 56, "metadata": { "tags": [], "id": "-fdG1ucrsbHQ" @@ -1503,7 +1791,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 58, "metadata": { "tags": [], "id": "m3ZGQxd7sbHQ" @@ -1515,14 +1803,38 @@ " return [___]\n" ] }, + { + "cell_type": "markdown", + "source": [ + "***Se define*** *get_members* ***que recibe un parámetro url y que retorna una lista de miembros extraídos desde la misma.***" + ], + "metadata": { + "id": "vpLmuWkmz3DN" + } + }, { "cell_type": "code", - "execution_count": null, + "execution_count": 59, "metadata": { "tags": [], - "id": "j_RQ-zpXsbHQ" + "id": "j_RQ-zpXsbHQ", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "45b7f2a1-0a9b-4360-8e9a-9f48f771df8e" }, - "outputs": [], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "1" + ] + }, + "metadata": {}, + "execution_count": 59 + } + ], "source": [ "# Test your code\n", "url = 'http://www.ilga.gov/senate/default.asp?GA=98'\n", @@ -1530,6 +1842,15 @@ "len(senate_members)" ] }, + { + "cell_type": "markdown", + "source": [ + "***Se define la URL de la página web que contiene los datos del Senado de Illinois, se hace scraping a la página web para obtener una lista de los senadores, el resultado se guarda en una variable*** *senate_members* ***y se calcula la cantidad total de miembros que fueron extraídos de la página.***" + ], + "metadata": { + "id": "Cl0yDQHa3C20" + } + }, { "cell_type": "markdown", "metadata": { @@ -1555,7 +1876,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 61, "metadata": { "tags": [], "id": "_79DFRiKsbHR" @@ -1568,20 +1889,34 @@ " rows = soup.select('tr')\n", " bills = []\n", " for row in rows:\n", - " # YOUR CODE HERE\n", - " bill_id =\n", - " description =\n", - " chamber =\n", - " last_action =\n", - " last_action_date =\n", - " bill = (bill_id, description, chamber, last_action, last_action_date)\n", - " bills.append(bill)\n", + " # Find all td elements within the current row\n", + " cells = row.select('td')\n", + " # Check if the row has enough cells and the class is 'billlist'\n", + " if len(cells) >= 5 and 'billlist' in cells[1].get('class', []):\n", + " # YOUR CODE HERE\n", + " bill_id = cells[0].text.strip() if len(cells) > 0 else ''\n", + " description = cells[1].text.strip() if len(cells) > 1 else ''\n", + " chamber = cells[2].text.strip() if len(cells) > 2 else ''\n", + " last_action = cells[3].text.strip() if len(cells) > 3 else ''\n", + " last_action_date = cells[4].text.strip() if len(cells) > 4 else ''\n", + "\n", + " bill = (bill_id, description, chamber, last_action, last_action_date)\n", + " bills.append(bill)\n", " return bills" ] }, + { + "cell_type": "markdown", + "source": [ + "***Obtiene el contenido HTML de la página y lo descarga como texto desde la URL, lo convierte en un objeto BeautifulSoup para poder analizarlo, selecciona todas las filas*** < tr > ***de la tabla, itera sobre cada fila y busca las celdas*** < td >, ***filtra las filas relevantes procesando sólo las que tienen al menos 5 celdas y cuya segunda celda tiene la clase*** *billlist.* ***Extrae los datos de cada celda, guarda cada proyecto como una tupla y lo agrega a la lista bills y retorna la lista completa de proyectos.***" + ], + "metadata": { + "id": "SABK8Jfj41Bw" + } + }, { "cell_type": "code", - "execution_count": null, + "execution_count": 62, "metadata": { "tags": [], "id": "oe2QHteFsbHR" @@ -1608,7 +1943,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 63, "metadata": { "tags": [], "id": "zIN9ZpKVsbHR" @@ -1620,7 +1955,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 64, "metadata": { "tags": [], "id": "Y02z46MlsbHR" From a52c895c09c454034e686b5f96fcd9979713711e Mon Sep 17 00:00:00 2001 From: dquinonez25 Date: Sun, 24 Aug 2025 22:37:52 -0500 Subject: [PATCH 10/13] =?UTF-8?q?Se=20cre=C3=B3=20con=20Colab?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- lessons/02_web_scraping.ipynb | 135 +++++++++++++++++++++++----------- 1 file changed, 91 insertions(+), 44 deletions(-) diff --git a/lessons/02_web_scraping.ipynb b/lessons/02_web_scraping.ipynb index bb85311..927f345 100644 --- a/lessons/02_web_scraping.ipynb +++ b/lessons/02_web_scraping.ipynb @@ -311,7 +311,7 @@ { "cell_type": "markdown", "source": [ - "***Se ejecuta conexión a la página del Senado de Illinois y se recupera el contenido de esa página web. Se guarda el HTML completo de la respuesta en la variable src y se imprime los primeros 1000 caracteres del HTML.***\n" + "***Se ejecuta conexión a la página del Senado de Illinois y se recupera el contenido de esa página web. Se guarda el HTML completo de la respuesta en la variable*** src ***y se imprime los primeros 1000 caracteres del HTML.***\n" ], "metadata": { "id": "_84r_J7vDIoz" @@ -377,7 +377,7 @@ { "cell_type": "markdown", "source": [ - "***Se convierte el contenido HTML de una página web en una estructura que Python, se toma el HTML que se obtuvo con requests y se lo convierte en un árbol de elementos HTML. Se usa parser lxml, que es rápido y robusto para analizar HTML y XML. Devuelve el HTML lo que permite embellecer y facilita leer la estructura del documento y muestra solo los primeros 1000 caracteres para no saturar la salida.***" + "***Se convierte el contenido HTML de una página web, se toma el HTML que se obtuvo con requests y se lo convierte en un árbol de elementos HTML. Se usa parser*** lxml ***para analizar HTML y XML. Devuelve el HTML lo que permite embellecer y facilita leer la estructura del documento y muestra solo los primeros 1000 caracteres para no saturar la salida.***" ], "metadata": { "id": "LuNmKppLGD6p" @@ -461,7 +461,7 @@ { "cell_type": "markdown", "source": [ - "***Busca todas las etiquetas < a > en el documento HTML, toma los primeros 10 elementos encontrados y muestra esos 10 elementos en la consola.***" + "***Busca todas las etiquetas*** < a > ***en el documento HTML, toma los primeros 10 elementos encontrados y muestra esos 10 elementos en la consola.***" ], "metadata": { "id": "tbH6B05TlQ-6" @@ -480,36 +480,56 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 75, "metadata": { "tags": [], "id": "Me55JRw9sbHI", "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "5264ba50-a74f-417a-e995-ef8c7ab0ef5a" + "outputId": "a7e13e2e-d1df-46f7-ac15-35d158b96ca7" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ - "\n", - " English\r\n", - " \n", - "\n", - " English\r\n", - " \n" + "No 'a' tags found.\n", + "No 'a' tags found using the alternative method.\n" ] } ], "source": [ "a_tags = soup.find_all(\"a\")\n", "a_tags_alt = soup(\"a\")\n", - "print(a_tags[0])\n", - "print(a_tags_alt[0])" + "\n", + "if a_tags:\n", + " print(a_tags[0])\n", + "else:\n", + " print(\"No 'a' tags found.\")\n", + "\n", + "if a_tags_alt:\n", + " print(a_tags_alt[0])\n", + "else:\n", + " print(\"No 'a' tags found using the alternative method.\")" ] }, + { + "cell_type": "markdown", + "source": [ + "***Busca todas las etiquetas*** < a > ***en el documento HTML y devuelve una lista con todos los elementos encontrados. Verifica si se encontraron enlaces muestra el primer enlace, caso contrario imprime un mensaje indicando que ninguno se encontró y replica el proceso con la variable alternativa*** a_tags_alt" + ], + "metadata": { + "id": "IOG2VygfPWUX" + } + }, + { + "cell_type": "markdown", + "source": [], + "metadata": { + "id": "iALb8mkJNX5z" + } + }, { "cell_type": "markdown", "metadata": { @@ -521,20 +541,20 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 76, "metadata": { "id": "45Zk_D8lsbHI", "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "cbfbf663-02d9-48eb-80cc-b400271dac5c" + "outputId": "3d6cf350-0ad1-4cd4-f0c5-53c5821dd6f6" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ - "270\n" + "0\n" ] } ], @@ -542,6 +562,15 @@ "print(len(a_tags))" ] }, + { + "cell_type": "markdown", + "source": [ + "***Muestra cuántas etiquetas*** < a > ***hay en una página web analizada con BeautifulSoup.***" + ], + "metadata": { + "id": "mmXCsf-1SKiK" + } + }, { "cell_type": "markdown", "metadata": { @@ -587,7 +616,7 @@ { "cell_type": "markdown", "source": [ - "***Sirve para analizar un documento HTML, y su propósito es extraer enlaces específicos, busca todas las etiquetas < a > que tengan la clase sidemenu y toma los primeros 5 elementos encontrados.***" + "***Analiza un documento HTML y extrae enlaces específicos, busca todas las etiquetas*** < a > ***que tengan la clase*** *sidemenu* ***y toma los primeros 5 elementos encontrados.***" ], "metadata": { "id": "FdlJ0c5uq6Rz" @@ -636,7 +665,7 @@ { "cell_type": "markdown", "source": [ - "***Se utiliza un selector CSS para encontrar todas las etiquetas < a > que tengan la clase sidemenu y selecciona los primeros 5 elementos encontrados.***" + "***Se utiliza un selector CSS para encontrar todas las etiquetas*** < a > ***que tengan la clase*** sidemenu ***y selecciona los primeros 5 elementos encontrados.***" ], "metadata": { "id": "0urELMRuu57U" @@ -739,7 +768,7 @@ { "cell_type": "markdown", "source": [ - "***Busca todos los elementos < a > que tengan la clase sidemenu que se ha cargado en la variable soup y devuelve una lista con todos esos elementos. Toma el primer elemento de esa lista y muestra en pantalla el HTML completo de ese primer enlace.***" + "***Busca todos los elementos*** < a > ***que tengan la clase sidemenu que se ha cargado en la variable soup y devuelve una lista con todos esos elementos. Toma el primer elemento de esa lista y muestra en pantalla el HTML completo de ese primer enlace.***" ], "metadata": { "id": "0vJ2TCmoSFGl" @@ -787,7 +816,7 @@ { "cell_type": "markdown", "source": [ - "***Se devuelve un diccionario con todas las variables locales definidas en ese momento y se comprueba si la variable first_link está entre ellas. Si está definida se imprime el texto dentro del enlace y sino está definida se imprime un mensaje de notificación.***" + "***Se devuelve un diccionario con todas las variables locales definidas en ese momento y se comprueba si la variable*** first_link ***está entre ellas. Si está definida se imprime el texto dentro del enlace y sino está definida se imprime un mensaje de notificación.***" ], "metadata": { "id": "Jik2G9RXYKsS" @@ -835,7 +864,7 @@ { "cell_type": "markdown", "source": [ - "***Se devuelve un diccionario con todas las variables locales definidas en ese momento y se comprueba si la variable first_link existe. Si existe se imprime el valor del atributo href para establecer el destino de un enlace y sino existe se imprime un mensaje que no está definida.***" + "***Se devuelve un diccionario con todas las variables locales definidas en ese momento y se comprueba si la variable*** first_link ***existe. Si existe se imprime el valor del atributo href para establecer el destino de un enlace y sino existe se imprime un mensaje que no está definida.***" ], "metadata": { "id": "zpyNALOJZprN" @@ -976,7 +1005,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 74, "metadata": { "id": "utKa2QuKsbHO" }, @@ -984,7 +1013,25 @@ "source": [ "# Filter rows to find the relevant ones.\n", "# Based on the HTML structure, the relevant rows seem to contain 'td' elements with the class 'detail'.\n", - "rows = [row for row in rows if row.select('td.detail')]\n", + "# Assuming the rows are within a table, we can select 'tr' within the table body or a specific table id/class.\n", + "# Let's refine the row selection based on the structure observed in the output of cell 0_esi0UGsbHH\n", + "# We can look for rows that contain 'td' elements with the class 'detail' within a table.\n", + "# A more robust approach might be to look for rows within the main content area or a specific table.\n", + "# Based on the HTML structure, the relevant rows seem to be within a table. Let's try selecting 'tr' elements\n", + "# that are descendants of a table. We'll still filter based on the presence of 'td.detail'.\n", + "\n", + "# Re-fetch the soup object as it might have been modified or cleared in previous steps\n", + "req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')\n", + "src = req.text\n", + "soup = BeautifulSoup(src, \"lxml\")\n", + "\n", + "# Select all table rows. We will filter later.\n", + "all_table_rows = soup.find_all(\"tr\")\n", + "\n", + "# Filter rows to find the relevant ones. We are looking for rows that contain 'td' elements with the class 'detail'.\n", + "# This assumes that the senator information is within such td elements.\n", + "rows = [row for row in all_table_rows if row.select('td.detail')]\n", + "\n", "\n", "for row in rows[:5]:\n", " print(row, '\\n')" @@ -1060,7 +1107,7 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": null, "metadata": { "id": "V2FmLZrJsbHP", "colab": { @@ -1116,7 +1163,7 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": null, "metadata": { "tags": [], "id": "rOZQqe0MsbHP", @@ -1162,7 +1209,7 @@ }, { "cell_type": "code", - "execution_count": 31, + "execution_count": null, "metadata": { "id": "44w-eYDZsbHP", "colab": { @@ -1208,7 +1255,7 @@ }, { "cell_type": "code", - "execution_count": 34, + "execution_count": null, "metadata": { "id": "3tMXEvFSsbHP", "colab": { @@ -1263,7 +1310,7 @@ }, { "cell_type": "code", - "execution_count": 38, + "execution_count": null, "metadata": { "id": "ddAU76TZsbHP", "colab": { @@ -1308,7 +1355,7 @@ }, { "cell_type": "code", - "execution_count": 41, + "execution_count": null, "metadata": { "id": "T7m-E5kjsbHP", "colab": { @@ -1356,7 +1403,7 @@ }, { "cell_type": "code", - "execution_count": 43, + "execution_count": null, "metadata": { "id": "LKo9MyWBsbHP", "colab": { @@ -1406,7 +1453,7 @@ }, { "cell_type": "code", - "execution_count": 47, + "execution_count": null, "metadata": { "id": "L2awrhEfsbHP", "colab": { @@ -1457,7 +1504,7 @@ }, { "cell_type": "code", - "execution_count": 49, + "execution_count": null, "metadata": { "id": "GTFEA5VdsbHQ", "colab": { @@ -1492,7 +1539,7 @@ }, { "cell_type": "code", - "execution_count": 51, + "execution_count": null, "metadata": { "id": "pnIFBOsJsbHQ", "colab": { @@ -1559,7 +1606,7 @@ }, { "cell_type": "code", - "execution_count": 52, + "execution_count": null, "metadata": { "tags": [], "id": "_p6yf187sbHQ" @@ -1599,7 +1646,7 @@ }, { "cell_type": "code", - "execution_count": 53, + "execution_count": null, "metadata": { "id": "X4HSTY1bsbHQ", "colab": { @@ -1644,7 +1691,7 @@ }, { "cell_type": "code", - "execution_count": 54, + "execution_count": null, "metadata": { "id": "OzNtF5Q8sbHQ", "colab": { @@ -1715,7 +1762,7 @@ }, { "cell_type": "code", - "execution_count": 55, + "execution_count": null, "metadata": { "tags": [], "id": "QY9WS7V2sbHQ" @@ -1767,7 +1814,7 @@ }, { "cell_type": "code", - "execution_count": 56, + "execution_count": null, "metadata": { "tags": [], "id": "-fdG1ucrsbHQ" @@ -1791,7 +1838,7 @@ }, { "cell_type": "code", - "execution_count": 58, + "execution_count": null, "metadata": { "tags": [], "id": "m3ZGQxd7sbHQ" @@ -1814,7 +1861,7 @@ }, { "cell_type": "code", - "execution_count": 59, + "execution_count": null, "metadata": { "tags": [], "id": "j_RQ-zpXsbHQ", @@ -1876,7 +1923,7 @@ }, { "cell_type": "code", - "execution_count": 61, + "execution_count": null, "metadata": { "tags": [], "id": "_79DFRiKsbHR" @@ -1916,7 +1963,7 @@ }, { "cell_type": "code", - "execution_count": 62, + "execution_count": null, "metadata": { "tags": [], "id": "oe2QHteFsbHR" @@ -1943,7 +1990,7 @@ }, { "cell_type": "code", - "execution_count": 63, + "execution_count": null, "metadata": { "tags": [], "id": "zIN9ZpKVsbHR" @@ -1955,7 +2002,7 @@ }, { "cell_type": "code", - "execution_count": 64, + "execution_count": null, "metadata": { "tags": [], "id": "Y02z46MlsbHR" From 681641b4796f41a29b0adb983bd838bbdeb4b791 Mon Sep 17 00:00:00 2001 From: dquinonez25 Date: Mon, 25 Aug 2025 22:37:48 -0500 Subject: [PATCH 11/13] =?UTF-8?q?Se=20cre=C3=B3=20con=20Colab?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- lessons/02_web_scraping.ipynb | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/lessons/02_web_scraping.ipynb b/lessons/02_web_scraping.ipynb index 927f345..7e90307 100644 --- a/lessons/02_web_scraping.ipynb +++ b/lessons/02_web_scraping.ipynb @@ -480,7 +480,7 @@ }, { "cell_type": "code", - "execution_count": 75, + "execution_count": null, "metadata": { "tags": [], "id": "Me55JRw9sbHI", @@ -541,7 +541,7 @@ }, { "cell_type": "code", - "execution_count": 76, + "execution_count": null, "metadata": { "id": "45Zk_D8lsbHI", "colab": { @@ -864,7 +864,7 @@ { "cell_type": "markdown", "source": [ - "***Se devuelve un diccionario con todas las variables locales definidas en ese momento y se comprueba si la variable*** first_link ***existe. Si existe se imprime el valor del atributo href para establecer el destino de un enlace y sino existe se imprime un mensaje que no está definida.***" + "***Se devuelve un diccionario con todas las variables locales definidas en ese momento y se comprueba si la variable*** *first_link* ***existe. Si existe se imprime el valor del atributo*** *href* ***para establecer el destino de un enlace y sino existe se imprime un mensaje que no está definida.***" ], "metadata": { "id": "zpyNALOJZprN" @@ -1005,7 +1005,7 @@ }, { "cell_type": "code", - "execution_count": 74, + "execution_count": null, "metadata": { "id": "utKa2QuKsbHO" }, From 8543d3549ef7e70fbc06f94718ec14f3fdb72a2d Mon Sep 17 00:00:00 2001 From: dquinonez25 Date: Mon, 25 Aug 2025 23:40:12 -0500 Subject: [PATCH 12/13] =?UTF-8?q?Se=20cre=C3=B3=20con=20Colab?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- lessons/02_web_scraping.ipynb | 75 ++++++++++++++--------------------- 1 file changed, 29 insertions(+), 46 deletions(-) diff --git a/lessons/02_web_scraping.ipynb b/lessons/02_web_scraping.ipynb index 7e90307..f5551e2 100644 --- a/lessons/02_web_scraping.ipynb +++ b/lessons/02_web_scraping.ipynb @@ -100,7 +100,7 @@ { "cell_type": "markdown", "source": [ - "***Install requests permite hacer peticiones directamente desde el código.***" + "***Con*** *install requests* ***se realizan peticiones directamente desde el código.***" ], "metadata": { "id": "ihGiPrM_suxV" @@ -143,7 +143,7 @@ { "cell_type": "markdown", "source": [ - "***Install beatufulsoup4 permite extraer información de páginas web usando Python.***" + "***Con*** *install beatufulsoup4* ***se extrae información de páginas web.***" ], "metadata": { "id": "L6_vVIk69DDa" @@ -184,8 +184,7 @@ { "cell_type": "markdown", "source": [ - "\n", - "***Install lxml sirve para que puedas procesar y analizar documentos HTML y XML de forma rápida y eficiente en Python.***" + "***Con*** *install lxml* ***se procesa y analiza documentos HTML y XML de forma rápida y eficiente.***" ], "metadata": { "id": "yj7Ijz-a-SCY" @@ -219,9 +218,7 @@ { "cell_type": "markdown", "source": [ - "***Import required libraries permite importar las librerías necesarias para que un programa funcione correctamente.***\n", - "\n", - "> Agregar bloque entrecomillado\n", + "***Se efectuan importación de librerías para poder realizar web scraping y extraer información de páginas web. Permite trabajar con fechas y horas, peticiones http, funciones para medir y controlar el tiempo, como*** *sleep()* ***para pausar la ejecución por unos segundos y evitar que el scraper haga demasiadas peticiones seguidas y sea bloqueado.***\n", "\n" ], "metadata": { @@ -311,7 +308,7 @@ { "cell_type": "markdown", "source": [ - "***Se ejecuta conexión a la página del Senado de Illinois y se recupera el contenido de esa página web. Se guarda el HTML completo de la respuesta en la variable*** src ***y se imprime los primeros 1000 caracteres del HTML.***\n" + "***Se ejecuta conexión a la página web del Senado de Illinois y se recupera el contenido de esa página. Se guarda el HTML completo de la respuesta en la variable*** src ***y se imprime los primeros 1000 caracteres.***\n" ], "metadata": { "id": "_84r_J7vDIoz" @@ -377,7 +374,7 @@ { "cell_type": "markdown", "source": [ - "***Se convierte el contenido HTML de una página web, se toma el HTML que se obtuvo con requests y se lo convierte en un árbol de elementos HTML. Se usa parser*** lxml ***para analizar HTML y XML. Devuelve el HTML lo que permite embellecer y facilita leer la estructura del documento y muestra solo los primeros 1000 caracteres para no saturar la salida.***" + "***Se convierte el contenido HTML de la página web, y el que se obtuvo con requests se lo convierte en un árbol de elementos HTML. Con parser*** lxml ***se analiza HTML y XML. Devuelve el HTML que permite embellecer facilitando leer la estructura del documento y muestra los primeros 1000 caracteres para no saturar la salida.***" ], "metadata": { "id": "LuNmKppLGD6p" @@ -461,7 +458,7 @@ { "cell_type": "markdown", "source": [ - "***Busca todas las etiquetas*** < a > ***en el documento HTML, toma los primeros 10 elementos encontrados y muestra esos 10 elementos en la consola.***" + "***Busca todas las etiquetas*** < a > ***en el documento HTML, toma los primeros 10 elementos encontrados y los muestra en la consola.***" ], "metadata": { "id": "tbH6B05TlQ-6" @@ -517,19 +514,12 @@ { "cell_type": "markdown", "source": [ - "***Busca todas las etiquetas*** < a > ***en el documento HTML y devuelve una lista con todos los elementos encontrados. Verifica si se encontraron enlaces muestra el primer enlace, caso contrario imprime un mensaje indicando que ninguno se encontró y replica el proceso con la variable alternativa*** a_tags_alt" + "***Busca todas las etiquetas*** < a > ***en el documento HTML y devuelve una lista con todos los elementos encontrados. Verifica si se encontraron enlaces y muestra el primero, caso contrario imprime un mensaje indicando que ninguno se encontró y replica el proceso con la variable alternativa*** a_tags_alt" ], "metadata": { "id": "IOG2VygfPWUX" } }, - { - "cell_type": "markdown", - "source": [], - "metadata": { - "id": "iALb8mkJNX5z" - } - }, { "cell_type": "markdown", "metadata": { @@ -565,7 +555,7 @@ { "cell_type": "markdown", "source": [ - "***Muestra cuántas etiquetas*** < a > ***hay en una página web analizada con BeautifulSoup.***" + "***Muestra cuántas etiquetas*** < a > ***hay en la página web analizada con BeautifulSoup.***" ], "metadata": { "id": "mmXCsf-1SKiK" @@ -616,7 +606,7 @@ { "cell_type": "markdown", "source": [ - "***Analiza un documento HTML y extrae enlaces específicos, busca todas las etiquetas*** < a > ***que tengan la clase*** *sidemenu* ***y toma los primeros 5 elementos encontrados.***" + "***Analiza el documento HTML y extrae enlaces específicos, busca todas las etiquetas*** < a > ***que tengan la clase*** *sidemenu* ***y toma los primeros 5 elementos encontrados.***" ], "metadata": { "id": "FdlJ0c5uq6Rz" @@ -665,7 +655,7 @@ { "cell_type": "markdown", "source": [ - "***Se utiliza un selector CSS para encontrar todas las etiquetas*** < a > ***que tengan la clase*** sidemenu ***y selecciona los primeros 5 elementos encontrados.***" + "***Usando un selector CSS encuentra todas las etiquetas*** < a > ***que tengan la clase*** sidemenu ***y selecciona los primeros 5 elementos encontrados.***" ], "metadata": { "id": "0urELMRuu57U" @@ -693,13 +683,6 @@ "# YOUR CODE HERE\n" ] }, - { - "cell_type": "markdown", - "source": [], - "metadata": { - "id": "LnO5snVswzX0" - } - }, { "cell_type": "markdown", "metadata": { @@ -768,7 +751,7 @@ { "cell_type": "markdown", "source": [ - "***Busca todos los elementos*** < a > ***que tengan la clase sidemenu que se ha cargado en la variable soup y devuelve una lista con todos esos elementos. Toma el primer elemento de esa lista y muestra en pantalla el HTML completo de ese primer enlace.***" + "***Busca todos los elementos*** < a > ***que tengan la clase*** sidemenu ***que se ha cargado en la variable*** soup ***y devuelve una lista con todos esos elementos. Toma el primer elemento de esa lista y muestra en pantalla el HTML completo de ese primer enlace.***" ], "metadata": { "id": "0vJ2TCmoSFGl" @@ -816,7 +799,7 @@ { "cell_type": "markdown", "source": [ - "***Se devuelve un diccionario con todas las variables locales definidas en ese momento y se comprueba si la variable*** first_link ***está entre ellas. Si está definida se imprime el texto dentro del enlace y sino está definida se imprime un mensaje de notificación.***" + "***Devuelve un diccionario con todas las variables locales definidas en ese momento y comprueba si la variable*** first_link ***está entre ellas. Si está definida se imprime el texto dentro del enlace, caso contrario imprime un mensaje de notificación.***" ], "metadata": { "id": "Jik2G9RXYKsS" @@ -864,7 +847,7 @@ { "cell_type": "markdown", "source": [ - "***Se devuelve un diccionario con todas las variables locales definidas en ese momento y se comprueba si la variable*** *first_link* ***existe. Si existe se imprime el valor del atributo*** *href* ***para establecer el destino de un enlace y sino existe se imprime un mensaje que no está definida.***" + "***Devuelve un diccionario con todas las variables locales definidas en ese momento y comprueba si la variable*** *first_link* ***existe. Si existe imprime el valor del atributo*** *href* ***que establece el destino de un enlace, caso contrario imprime un mensaje que no está definido.***" ], "metadata": { "id": "zpyNALOJZprN" @@ -940,7 +923,7 @@ { "cell_type": "markdown", "source": [ - "***Se envía una solicitud HTTP GET a la URL especificada y recupera el contenido de la página. Se guarda el contenido HTML de la respuesta en la variable src y dicho código fuente de la página web como texto plano. Luego se convierte el texto HTML en un objeto para recorrerlo e interpretarlo.***" + "***Envía una solicitud HTTP GET a la URL especificada y recupera el contenido de la página. Guarda el contenido HTML de la respuesta en la variable*** src ***y dicho código fuente como texto plano. Luego convierte el texto HTML en un objeto para recorrerlo e interpretarlo.***" ], "metadata": { "id": "Z87FWjzsc7dY" @@ -988,7 +971,7 @@ { "cell_type": "markdown", "source": [ - "***Se busca todos los elementos*** *tr*, ***es decir una fila de tabla y devuelve el número total de filas encontradas.***" + "***Busca todos los elementos*** *tr*, ***es decir una fila de tabla y devuelve el número total de filas encontradas.***" ], "metadata": { "id": "efDFTUz2f5Bh" @@ -1086,7 +1069,7 @@ { "cell_type": "markdown", "source": [ - "***Verifica si la lista*** *rows* ***no está vacía, si hay elementos toma el primero de la lista e imprime el contenido de esa fila. Caso contrario si la lista está vacía, muestra un mensaje indicando que no se encontraron filas relevantes.***" + "***Verifica si la lista*** *rows* ***no está vacía, si hay elementos toma el primero de la lista e imprime el contenido de esa fila. Caso contrario si la lista está vacía, imprime un mensaje indicando que no se encontraron filas relevantes.***" ], "metadata": { "id": "tD5cVX01no9g" @@ -1146,7 +1129,7 @@ { "cell_type": "markdown", "source": [ - "***Verifica si la lista*** *rows* ***contiene elementos, selecciona la primera fila como ejemplo para inspección, muestra cada celda de la fila, extrae cualquier elemento con clase*** *detail*, ***se busca solo celdas*** < td > ***que tengan la clase*** *detail*, ***caso contrario imprime un mensaje si no hay filas disponibles para analizar.***" + "***Verifica si la lista*** *rows* ***contiene elementos, selecciona la primera fila como ejemplo para inspección, muestra cada celda de la fila, extrae cualquier elemento con clase*** *detail*, ***se busca solo celdas*** < td > ***que tengan la clase*** *detail*, ***caso contrario imprime un mensaje que no hay filas disponibles para analizar.***" ], "metadata": { "id": "DzwCJ2IYs5KP" @@ -1192,7 +1175,7 @@ { "cell_type": "markdown", "source": [ - "***Valida si la variable está definida y existe en el entorno local, compara tres selecciones de elementos y se comprueba si los tres resultados son iguales el programa continúa, caso contrario sino son iguales imprime que no está definido, no se puede realizar la afirmación.***" + "***Valida si la variable está definida y existe en el entorno local, compara tres selecciones de elementos y comprueba si los tres resultados son iguales el programa continúa, caso contrario imprime que no está definido, no se puede realizar la afirmación.***" ], "metadata": { "id": "QQBUUJdO78NN" @@ -1336,7 +1319,7 @@ { "cell_type": "markdown", "source": [ - "***Verifica si la variable existe en el entorno local. Muestra el primer, cuarto y quinto elemento de la lista con su respectivo nombre; caso contrario sino existe imprime un mensaje de advertencia para que sea revisado antes de continuar.***" + "***Verifica si la variable existe en el entorno local. Muestra el primer, cuarto y quinto elemento de la lista con su respectivo nombre; caso contrario imprime un mensaje de advertencia para que sea revisado antes de continuar.***" ], "metadata": { "id": "Lr8FSmd5VOPK" @@ -1384,7 +1367,7 @@ { "cell_type": "markdown", "source": [ - "***Verifica si la lista rows contiene elementos, muestra el contenido completo de la primera fila, muestra la segunda fila útil para comparar estructura o contenido, muestra la última fila de la lista usando índice negativo. Caso contrario imprime un mensaje indicando que no se encontraron filas en el paso anterior.***" + "**Verifica si la lista** rows ***contiene elementos, muestra el contenido completo de la primera fila, muestra la segunda fila útil para comparar estructura o contenido, muestra la última fila de la lista usando índice negativo. Caso contrario imprime un mensaje indicando que no se encontraron filas en el paso anterior.***" ], "metadata": { "id": "_CjvzH-PXFfX" @@ -1436,7 +1419,7 @@ { "cell_type": "markdown", "source": [ - "***Verifica que haya al menos 4 filas en la lista rows, imprime la cantidad de elementos en las dos primeras filas, que se consideran malas o inútiles; y en las siguientes dos filas, imprime las se consideran buenas o útiles para extraer datos. Caso contrario si hay menos de 4 filas, evita el análisis e informa que no hay suficientes datos.***" + "***Verifica que haya al menos 4 filas en la lista*** rows, ***imprime la cantidad de elementos en las dos primeras filas, que se consideran malas o inútiles; y en las siguientes dos filas, imprime las que se consideran buenas o útiles para extraer datos. Caso contrario si hay menos de 4 filas, evita el análisis e informa que no hay suficientes datos.***" ], "metadata": { "id": "NEKFDLdOZ4OL" @@ -1487,7 +1470,7 @@ { "cell_type": "markdown", "source": [ - "***Se valida creación de una nueva lista good_rows que solo incluye las filas que tienen exactamente 5 elementos, si hay filas útiles muestra, caso contrario imprime que no se encontraron filas válidas.***" + "***Se valida creación de una nueva lista*** *good_rows* ***que solo incluye las filas que tienen exactamente 5 elementos, si hay filas útiles muestra, caso contrario imprime que no se encontraron filas válidas.***" ], "metadata": { "id": "22kA-378cNBH" @@ -1531,7 +1514,7 @@ { "cell_type": "markdown", "source": [ - "***Valida que la lista rows tenga al menos tres elementos, los índices 0, 1 y 2. Accede a la tercera fila y selecciona solo las celdas*** < td > ***que tengan la clase*** *detail.* ***Caso contrario imprime un mensaje indicando que no hay suficientes filas y no se puede acceder al índice 2.***" + "***Valida que la lista*** *rows* ***tenga al menos tres elementos, los índices 0, 1 y 2. Accede a la tercera fila y selecciona solo las celdas*** < td > ***que tengan la clase*** *detail.* ***Caso contrario imprime un mensaje indicando que no hay suficientes filas y no se puede acceder al índice 2.***" ], "metadata": { "id": "8mptx0DSefWN" @@ -1578,7 +1561,7 @@ { "cell_type": "markdown", "source": [ - "***Valida que hay al menos 6 filas, imprime la última posiblemente sea mala o no deseada, imprime la sexta fila, que se considera buena o con datos válidos, se filtra las filas útiles limpiando la tabla y quedar con lo que interesa. Si hay filas buenas muestra la primera y última, caso contrario imprime que no hay e informa que la lista está vacía.***" + "***Valida que hay al menos 6 filas, imprime la última posiblemente sea mala o no deseada, imprime la sexta fila, que se considera buena o con datos válidos, se filtra las filas útiles limpiando la tabla y queda con lo que interesa. Si hay filas buenas muestra la primera y última, caso contrario imprime que no hay e informa que la lista está vacía.***" ], "metadata": { "id": "n-r_AsAdgoOb" @@ -1638,7 +1621,7 @@ { "cell_type": "markdown", "source": [ - "***Se guardan los datos de cada senador, se filtran las filas útiles descartando encabezados, filas vacías o decorativas. Se comienza un bucle para procesar cada fila que contiene datos útiles, se extrae las celdas con clase*** *detail* ***y se obtiene sólo las relevantes dentro de la fila. Se convierte las celdas HTML en texto plano, se extrae campos específicos, se agrupa los datos en una estructura simple y ordenada, y se guarda la información en la lista members.***" + "***Guarda los datos de cada senador, filtra las filas útiles descartando encabezados, filas vacías o decorativas. Comienza un bucle para procesar cada fila que contiene datos útiles, extrae las celdas con clase*** *detail* ***y se obtiene sólo las relevantes dentro de la fila. Convierte las celdas HTML en texto plano, extrae campos específicos, agrupa los datos en una estructura simple y ordenada, y guarda la información en la lista*** *members.*" ], "metadata": { "id": "EKC27QripQF_" @@ -1806,7 +1789,7 @@ { "cell_type": "markdown", "source": [ - "**Se hace una petición GET al sitio web del Senado de Illinois para obtener la lista de senadores, se extrae el contenido HTML como texto plano desde la respuesta, se utiliza** *BeautifulSoup* ***con el parser*** lxml ***para convertir el HTML en una estructura que se puede navegar fácilmente, se crea una lista vacía para guardar la información de cada senador, se seleccionan todos los elementos*** < tr > ***anidados tres veces y se filtran las filas que contienen al menos una celda*** < td > ***con clase*** *detail.* ***Se seleccionan las celdas dentro de cada fila y se extrae el texto de cada una. Se organiza los datos, se crea una tupla con esa información y se la agrega a la lista members.***" + "**Hace una petición GET al sitio web del Senado de Illinois para obtener la lista de senadores, extrae el contenido HTML como texto plano desde la respuesta, utiliza** *BeautifulSoup* ***con el parser*** lxml ***para convertir el HTML en una estructura que pueda navegar fácilmente, crea una lista vacía para guardar la información de cada senador, selecciona todos los elementos*** < tr > ***anidados tres veces y filtra las filas que contiene al menos una celda*** < td > ***con clase*** *detail.* ***Selecciona las celdas dentro de cada fila y extrae el texto de cada una. Organiza los datos, crea una tupla con esa información y la agrega a la lista*** *members.*" ], "metadata": { "id": "F6H6fl5vvm8F" @@ -1892,7 +1875,7 @@ { "cell_type": "markdown", "source": [ - "***Se define la URL de la página web que contiene los datos del Senado de Illinois, se hace scraping a la página web para obtener una lista de los senadores, el resultado se guarda en una variable*** *senate_members* ***y se calcula la cantidad total de miembros que fueron extraídos de la página.***" + "***Define la URL de la página web que contiene los datos del Senado de Illinois, hace web scraping a la página para obtener una lista de los senadores, el resultado se guarda en una variable*** *senate_members* ***y calcula la cantidad total de miembros que fueron extraídos de la página.***" ], "metadata": { "id": "Cl0yDQHa3C20" @@ -1955,7 +1938,7 @@ { "cell_type": "markdown", "source": [ - "***Obtiene el contenido HTML de la página y lo descarga como texto desde la URL, lo convierte en un objeto BeautifulSoup para poder analizarlo, selecciona todas las filas*** < tr > ***de la tabla, itera sobre cada fila y busca las celdas*** < td >, ***filtra las filas relevantes procesando sólo las que tienen al menos 5 celdas y cuya segunda celda tiene la clase*** *billlist.* ***Extrae los datos de cada celda, guarda cada proyecto como una tupla y lo agrega a la lista bills y retorna la lista completa de proyectos.***" + "***Obtiene el contenido HTML de la página y lo descarga como texto desde la URL, lo convierte en un objeto BeautifulSoup para poder analizarlo, selecciona todas las filas*** < tr > ***de la tabla, itera sobre cada fila y busca las celdas*** < td >, ***filtra las filas relevantes procesando sólo las que tienen al menos 5 celdas y cuya segunda celda tiene la clase*** *billlist.* ***Extrae los datos de cada celda, guarda cada proyecto como una tupla, lo agrega a la lista bills y retorna la lista completa de proyectos.***" ], "metadata": { "id": "SABK8Jfj41Bw" From 41ce4f56048db7ad79d7511f6965566c562c4071 Mon Sep 17 00:00:00 2001 From: dquinonez25 Date: Mon, 25 Aug 2025 23:51:03 -0500 Subject: [PATCH 13/13] =?UTF-8?q?Se=20cre=C3=B3=20con=20Colab?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- lessons/02_web_scraping.ipynb | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/lessons/02_web_scraping.ipynb b/lessons/02_web_scraping.ipynb index f5551e2..920c5ee 100644 --- a/lessons/02_web_scraping.ipynb +++ b/lessons/02_web_scraping.ipynb @@ -555,7 +555,7 @@ { "cell_type": "markdown", "source": [ - "***Muestra cuántas etiquetas*** < a > ***hay en la página web analizada con BeautifulSoup.***" + "***Imprime la cantidad de elementos que hay en el objeto*** *a_tags.*" ], "metadata": { "id": "mmXCsf-1SKiK" @@ -1175,7 +1175,7 @@ { "cell_type": "markdown", "source": [ - "***Valida si la variable está definida y existe en el entorno local, compara tres selecciones de elementos y comprueba si los tres resultados son iguales el programa continúa, caso contrario imprime que no está definido, no se puede realizar la afirmación.***" + "***Verifica si la variable está definida y existe en el entorno local, compara tres selecciones de elementos y comprueba si los tres resultados son iguales el programa continúa, caso contrario imprime que no está definido, no se puede realizar la afirmación.***" ], "metadata": { "id": "QQBUUJdO78NN" @@ -1367,7 +1367,7 @@ { "cell_type": "markdown", "source": [ - "**Verifica si la lista** rows ***contiene elementos, muestra el contenido completo de la primera fila, muestra la segunda fila útil para comparar estructura o contenido, muestra la última fila de la lista usando índice negativo. Caso contrario imprime un mensaje indicando que no se encontraron filas en el paso anterior.***" + "***Verifica si la lista*** *rows* ***contiene elementos, muestra el contenido completo de la primera fila, muestra la segunda fila útil para comparar estructura o contenido, muestra la última fila de la lista usando índice negativo. Caso contrario imprime un mensaje indicando que no se encontraron filas en el paso anterior.***" ], "metadata": { "id": "_CjvzH-PXFfX" @@ -1789,7 +1789,7 @@ { "cell_type": "markdown", "source": [ - "**Hace una petición GET al sitio web del Senado de Illinois para obtener la lista de senadores, extrae el contenido HTML como texto plano desde la respuesta, utiliza** *BeautifulSoup* ***con el parser*** lxml ***para convertir el HTML en una estructura que pueda navegar fácilmente, crea una lista vacía para guardar la información de cada senador, selecciona todos los elementos*** < tr > ***anidados tres veces y filtra las filas que contiene al menos una celda*** < td > ***con clase*** *detail.* ***Selecciona las celdas dentro de cada fila y extrae el texto de cada una. Organiza los datos, crea una tupla con esa información y la agrega a la lista*** *members.*" + "***Hace una petición GET al sitio web del Senado de Illinois para obtener la lista de senadores, extrae el contenido HTML como texto plano desde la respuesta, utiliza*** *BeautifulSoup* ***con el parser*** lxml ***para convertir el HTML en una estructura que pueda navegar fácilmente, crea una lista vacía para guardar la información de cada senador, selecciona todos los elementos*** < tr > ***anidados tres veces y filtra las filas que contiene al menos una celda*** < td > ***con clase*** *detail.* ***Selecciona las celdas dentro de cada fila y extrae el texto de cada una. Organiza los datos, crea una tupla con esa información y la agrega a la lista*** *members.*" ], "metadata": { "id": "F6H6fl5vvm8F"