Web Scraping using Python

Letícia Moura - Sep 16 - - Dev Community

Web scraping is a technique widely used to collect data from online platforms, such as websites and social networks, which would otherwise be manually retrieved by a user. It is a technology used to automate the process of retrieving public data and can be particularly useful for social listening analyses.

How to Collect Data from Instagram Using Python: A Step-by-Step Guide

In this article, we will explore how you can use Python to collect post data on Instagram. We will explain the code step by step, starting with importing the necessary libraries.

Step 1: Import Libraries

To begin, we need to import some libraries that will help us collect and manipulate the data. In Python, libraries are sets of ready-made code that you can use in your own projects. The main library I will use here is Instaloader, and you can check its documentation at this link.

import instaloader
import csv
import time
import threading
from datetime import datetime

What each library does:

  • instaloader: This library is used to download Instagram data, such as posts and comments.
  • csv: We use this library to create and manipulate CSV files where we will store the collected data.
  • time: Helps add pauses between actions, which is useful to avoid overloading Instagram with too many requests.
  • threading: Allows multiple tasks to run simultaneously (called threads), speeding up the data collection process.
  • datetime: Used to work with dates and times, allowing us to filter posts based on date.

Step 2: Set Up the Script

At the beginning of the script, we define the date limit for data collection. We also configure the connection to Instagram and prepare the file where we will save the data.

data_limite = datetime(2024, 1, 1)
instagram = instaloader.Instaloader()
instagram.load_session_from_file('insert_your_@_here')
dados = []
csvfile = open('data_collection.csv', 'a', encoding='utf-8')
writer = csv.writer(csvfile)
writer.writerow(['post_number', 'username', 'likes', 'views', 'date',
'post_link', 'image', 'post_description', 'comments'])

Explanation:

  • data_limite = datetime(2024, 1, 1): Sets the date up to which we want to collect posts. In our case, we are only interested in posts made after January 1, 2024.
  • instagram = instaloader.Instaloader(): Creates an object that allows us to interact with Instagram.
  • instagram.load_session_from_file('insert_your_@_here'): Loads a saved session to avoid repeated logins (you should insert your Instagram handle in the placeholder).
  • dados = []: Initializes an empty list to store the collected data.
  • csvfile = open('data_collection.csv', 'a', encoding='utf-8'): Opens (or creates) a CSV file to store the data.
  • writer = csv.writer(csvfile): Creates an object that allows writing to the CSV file.
  • writer.writerow([...]): Writes the header to the CSV file, defining the column names.

Step 3: Create Function to Collect Posts

The next part of the code defines a function that collects posts from a specific Instagram user.

def get_posts_by_username(username):
post_number = 0
for post in instaloader.Profile.from_username(instagram.context, username).get_posts():
time.sleep(10)
post_number += 1
print('Post: ' + post.shortcode)
print(post_number)
if post.date > data_limite and post_number > 4:
print("Post date is before 01/01/2024 and post number exceeds 4. Stopping collection.")
break
substring = "insert_the_word_you_want_to_search_here"
if post.caption is not None and substring in post.caption.lower():
comments = []
for comment in post.get_comments():
comments.append(comment.text)
formatted_description = post.caption.replace('\n', ' ') if post.caption else ''
data = {
"username": post.owner_username,
"likes": post.likes,
"views": post.video_view_count,
"date": post.date.strftime("%m/%d/%Y"),
"post_link": post.shortcode,
"image": post.url,
"post_description": formatted_description,
"comments": comments,
"comment_count": post.comments
}
formatted_comments = [comment.replace('\n', ' ') for comment in comments]
with lock:
for comment in formatted_comments:
writer.writerow([post_number, data['username'], data['likes'], data['views'], data['date'],
data['post_link'], data['image'], data['post_description'], comment])
print('User collection completed: ' + username)

Explanation:

  • def get_posts_by_username(username): Defines a function that collects posts from a user.
  • post_number = 0: Initializes a post counter. for post in instaloader.Profile.from_username(instagram.context, username).get_posts(): Retrieves all posts from the user. -time.sleep(10): Pauses for 10 seconds between requests to avoid overloading Instagram.
  • if post.date > data_limite and post_number > 4: Checks if the post is recent and if the post number exceeds 4. If so, it stops the collection.
  • substring = "insert_the_word_you_want_to_search_here": Defines the keyword to filter the posts. -if post.caption is not None and substring in post.caption.lower(): Checks if the post contains the keyword. -comments = []: Initializes a list to store comments. for comment in post.get_comments(): Retrieves and stores all comments from the post.
  • formatted_description = post.caption.replace('\n', ' ') if post.caption else '': Formats the post description. -data = { ... }: Creates a dictionary with the post information.
  • formatted_comments = [comment.replace('\n', ' ') for comment in comments]: Formats the comments.
  • with lock: Ensures that writing to the CSV file is safe when multiple threads are running.

Step 4: Run Data Collection in Multiple Threads

To speed up data collection, we use multiple threads, one for each user.

lock = threading.Lock()
usernames = ["insert_user_@1", "insert_user_@2"]
threads = []
for username in usernames:
thread = threading.Thread(target=get_posts_by_username, args=[username])
threads.append(thread)
while len(threads) > 10:
threads.remove(threads[0])
for thread in threads:
thread.start()
for thread in threads:
thread.join()
csvfile.close()

Explanation:

  • lock = threading.Lock(): Creates a lock to ensure safe writing to the CSV file.
  • usernames = [...]: List of users to collect data from. -threads = []: Initializes a list of threads.
  • for username in usernames: Creates a thread for each user.
  • while len(threads) > 10: Limits the number of running threads to 10.
  • for thread in threads: thread.start(): Starts all threads.
  • for thread in threads: thread.join(): Waits for all threads to complete.
  • csvfile.close(): Closes the CSV file after data collection.

Conclusion

In this article, I showed how to use Python to collect data from Instagram, including posts and comments, and save that information in a CSV file. Using this code, you can search for a keyword across multiple Instagram accounts and analyze the collected data.

If you follow these steps and understand the explained code, you can adapt and expand this example to meet your specific data collection needs.

. . . . . .
Terabox Video Player