How I Am Using AI to Automate Web Scraping

2024-12-12

7 min read

0Views

Web scraping has revolutionized the way we gather data from websites, offering endless possibilities for analytics, insights, and building custom tools. However, as web landscapes grow more dynamic and complex, automating this process effectively has become a necessity. Recently, I’ve started integrating AI with automation tools like Playwright to simplify and supercharge web scraping.

This post explores my journey in combining AI with traditional scraping tools for smarter, more efficient data extraction.

Why Automate Web Scraping?

In our data-driven era, information is the lifeblood of innovation. Yet, while the internet overflows with data, accessing and structuring it isn’t always straightforward. Web scraping bridges this gap, transforming raw web pages into actionable insights. But manual or static scraping scripts often fall short when faced with dynamic content, rate limits, or anti-bot measures.

That’s where automation and AI come in; offering adaptability, precision, and speed.

The Power of Automation with Playwright

Playwright is a robust browser automation library that mimics human interaction with web pages. While originally designed for end-to-end testing, it has become a favorite for web scraping tasks. From navigating links to extracting specific data, Playwright offers the flexibility needed to handle modern websites.

Here’s a simple example of scraping quotes from the website here.

Code Walkthrough

from playwright.sync_api import sync_playwright
 
def scrape_quotes(playwright):
    browser = playwright.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://quotes.toscrape.com")
    quotes = page.locator('.text').all_text_contents()
    for quote in quotes:
        print(quote)
    browser.close()
 
with sync_playwright() as playwright:
    scrape_quotes(playwright)

Print Output:

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
 
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
 
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
 
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
 
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
 
“Try not to become a man of success. Rather become a man of value.”
 
“It is better to be hated for what you are than to be loved for what you are not.”
 
“I have not failed. I've just found 10,000 ways that won't work.”
 
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
“A day without sunshine is like, you know, night.”

Playwright's intuitive API made setting this up a breeze. But what if you’re scraping more complex content, like comments on a YouTube video? That’s where AI comes in.

Lets Introduce AI/ML

same as before we are using playwright but this time we do not have a selector to extract the quotes from the website. We will use AI/ML to extract the comments from youtube comment section. We will do to the following steps. AgentQL allows you to define "queries" that describe the data you want, and it uses machine learning to locate and extract that data, no traditional CSS selectors required.:

prerequisites:

Install the agentql library

pip install agentql
agentql init

Get a API-Key from AgentQL

Scraping YouTube Comments Using AI

Set up logging and launch Playwright with AgentQL

 
logging.basicConfig(level=logging.DEBUG)
log = logging.getLogger(__name__)
URL = "https://www.youtube.com/"
 
with sync_playwright() as playwright, playwright.chromium.launch(headless=False) as browser:
    page = agentql.wrap(browser.new_page())
    page.goto(URL)

Define Queries for Data Extraction We define a set of queries for extracting video links, titles, descriptions, and comments.

# The agent understands this is a search input and the attributes associated with it.
    SEARCH_QUERY = """
    {
        search_input
        search_btn
    }
    """
# We want a list of videos, and they should have a link, title, and channel name.
    VIDEO_QUERY = """
    {
        videos[] {
            video_link
            video_title
            channel_name
        }
    }
    """
# We want to know the state of the play/pause button and the expand description button.
    VIDEO_CONTROL_QUERY = """
    {
        play_or_pause_btn
        expand_description_btn
    }
    """
# We want to capture the description text.
    DESCRIPTION_QUERY = """
    {
        description_text
    }
    """
# We want to capture the comments and the channel name.
    COMMENT_QUERY = """
    {
        comments[] {
            channel_name
            comment_text
        }
    }
    """

 
try:
# Prompt user for what to query via the shell
search_term = input("Enter your YouTube search query: ")
 
# search query
 
        response = page.query_elements(SEARCH_QUERY)
        response.search_input.type(search_term, delay=63)
        response.search_btn.click()
 
# video query
 
        response = page.query_elements(VIDEO_QUERY)
        log.debug(f"Clicking Youtube Video: {response.videos[0].video_title.text_content()}")
        response.videos[0].video_link.click()  # click the first youtube video
 
# video control query
 
        response = page.query_elements(VIDEO_CONTROL_QUERY)
        response.expand_description_btn.click()
 
# description query
 
        response_data = page.query_data(DESCRIPTION_QUERY)
        log.debug(f"Captured the following description: \n{response_data['description_text']}")
 
 
 
# Scroll down the page to load more comments
 
        for _ in range(9):
            page.keyboard.press("PageDown")
            page.wait_for_page_ready_state()
# this output example: Captured 16 comments!
 
 response = page.query_data(COMMENT_QUERY)
        log.debug(f"Captured {len(response['comments'])} comments!")
 
        for each_comment in response['comments']:
            print(f"{each_comment['channel_name']}: {each_comment['comment_text']}")
 
# throw an error if something goes wrong
 
    except Exception as e:
        log.error(f"Found Error: {e}")
        raise e
 
    page.wait_for_timeout(10000)

Output: *the ** are there to remove the user's names

Captured 16 comments!
 
@daniel\*\*\_f: Amazing! As someone who wants to learn ML but has little to no idea about it yet, this video was really easy to follow. Keep it up!
 
@Softoni\_\*\*: Love and respect from a small village in India i even can't have this type of valuable info from the paid sources Thanks you so much 🥰
 
@dec\*\*\_: You just covered the first month of my 400 level machine learning class, minus the math, examples, and a a couple newer dimension reduction techniques. This video is a good resource.
 
@\_\*\*fou7070: I'm preparing a class about algorithms for high school students and this video has synthesized and simplified like half of the job.
 
@tan\*\*\_: Overview of major machine learning algorithms Supervised learning involves predicting and classifying data
 
@ap**99: Studying for my midterm next week. This was a great quick overview!
 
@E**DS: Thanks for this guideline. It makes me want to actually take a class on data science and information theory.
 
@Roy\_\*\*wati: This is just awesome, I was trying to learn ml models since 2-3 months but getting confused, this one video made me understand each with clarity in just 1 hour 😮, this is awesome ❤
 
@Marc\*\*\_ini: Great content! But I'd love to see a series of videos exploring each of these algorithms step by step, with real life examples and with proper time for understanding it.
 
@foxt\_\*\*s501: Dude WHAT? I spent a week trying to understand all of these and here I am, understood everything crystal clear in an hour 🤨
 
@super\*man.: It's very interesting and easy to understand, we need real time example with code in seperate topics
 
@ria\*\*\_53: these short visualized explanations help way more than a certain online course im currently taking 😭👍Great job. Do you have more?!
@sengnaw\*\*\*ng9179: I do not know whether this is a person or not. This is the best explanation.
 
@wik\*\*\_858: I recommend this video. Not only a time saver, but quite a good description of what these methods do and when they work best 🎉
 
@Abdull\_\*\*lamNabi: Love the animations and simplicity you explained all the topics.
 
@jainja\_: Thank you; it was super helpful for me to understand the big picture of ML!

Why Combine Playwright and AI?

Combining Playwright’s automation capabilities with AI’s intelligence allows for:

Dynamic Adaptability: AI can interpret changing layouts or missing elements.
Enhanced Efficiency: Automating repetitive tasks and extracting complex data structures.
Scalability: Processes that once required manual oversight can now run autonomously.

Ready to supercharge your web scraping projects?

Share with your network: