If you've ever wondered how to download an entire website, you're probably familiar with the term web scraping.
What is web scraping?
Web scraping means downloading websites as copies to a computer. This technology is used not only to download entire websites, but also to extract specific data of interest from a given portal. The entire process is carried out using bots, an indexing robot, or a script written in Python. During scraping, specific data is collected and copied from the network to the local database.
Web scraping - what's its use?
You already know what web scraping is and you can probably guess how it can be used. Let us show you some uses of web scraping:
Scraping property listings
More advanced real estate agents use web scraping to populate their database of available properties for sale or rent.
Many companies use web scraping to build huge databases and extract industry-specific insights from them. These companies can then sell the insight to companies in related industries. For example, a company can scrap and analyze oil price, export and import data to sell its insights to oil companies around the world.
Web scraping is also one of the incredibly popular lead generation tools. For example, by scraping online directories, job offers, emails, Twitter profiles, etc.
In short, web scraping is used by companies to collect contact information about potential customers. This is extremely common in the B2B (business-to-business) space, where potential customers publicly post information about their companies on the web.
Web scraping in affiliate marketing
How does web scraping relate to affiliate marketing? Let's start with the biggest argument that prompts you to get interested in web scraping, i.e. the time saved, which you gain by downloading competitors' websites. Everyone knows, or at least guesses, that the process of creating a good landing page can be time-consuming, and that success depends, among other things, on time. Other factors are openness to a change of approach, searching for new campaigns, conducting tests and, of course, advertising analysis. Success is achieved by those who do not stop at trifles, but look for ways to scale. To run one campaign, you need to do a lot of research on the target group, GEO selection, offers, etc., as well as prepare consumables, including a landing page.
Some people prefer to use landing pages provided by the affiliate network, others use ready-made templates from page builders, and others still prefer to create a landing page from scratch. The first two options are the most common. In some cases, they can become profitable, but this is not a long-term solution as competition is fierce and packages with available templates deplete quickly.
A high-quality landing page is the key to future success and a good return on investment. It is worth adding that not every landing page from a competitor can bring the expected result. It is better to fine-tune the desired landing page, taking into account the criteria of the future advertising campaign.
Of course, you have to remember to do everything legally, i.e. according to certain rules, which you will learn about in a moment.
Is web scraping legal?
Yes. Web scraping is not a prohibited technology and companies using it do so legally. Unfortunately, there will always be someone who will start using a given tool for piracy activities. Web scraping can be used to pursue unfair pricing and steal copyrighted content. It is clear that the owner of a website that is under scraper can suffer huge financial losses. Interestingly, web scraping was used by several foreign companies to save Instagram and Facebook stories that should be time-limited.
Scraping is fine as long as you respect the copyright and stick to set standards. If you decide to switch to the darker side that is not accepted in MyLead, you may face various consequences.
Some good practices when scraping websites
Remember about the GDPR
When it comes to EU countries, you must comply with the EU data protection regulation, commonly known as the GDPR. If you aren't scratching personal data, you don't need to worry too much about it. Let us remind you that personal data is any data that can identify a person, for example:
- first and last name,
- phone number,
- username (e.g. login / nickname),
- IP address,
- information about the credit or debit card number,
- medical or biometric data.
To web scrape, you need a reason for storing personal data. Examples of such reasons include:
1. Legitimate interest
It must be proved that data processing is necessary for the purposes of the legitimate business. However, this does not apply to situations where these interests are overridden by the interests or fundamental rights and freedoms of the person whose data you want to process.
2. Customer consent
Each person whose data you want to collect must consent to the collection, storage and use of their data in the way you intend to do so, e.g. for marketing purposes.
If you do not have a legitimate interest or customer consent, you are violating GDPR, which may result in a fine, a restriction of freedom, or imprisonment for up to two years.
GDPR applies only to residents of European Union countries, so it does not apply to countries such as the United States, Japan or Afghanistan.
Comply with copyright
Copyright is the exclusive right to any work done, for example an article, photo, video, piece of music, etc. You can guess that copyright is very important in web scraping, because a lot of data on the internet is copyrighted. Of course, there are exceptions in which you can scrape and use data without violating copyright laws, and these are:
- usage for personal public use,
- usage for didactic purposes or for scientific activity,
- usage under the right to quote.
Web scraping - where to start?
The first step is to find the URL of the page you are interested in. Specify the topic you want to choose. You are only limited by your imagination and data sources.
2. HTML code
Learn the structure of the HTML code. Without knowing HTML, you will have a hard time finding an item that you download from your competitors' website. The best way is to go to the element in the browser and use the Inspect option. Then you will see the HTML tags and be able to identify the element of interest. Here's the example of this on Wikipedia:
As you can see, when you hover the mouse over a given line of code, the element corresponding to this line of code is highlighted on the page.
3. Work environment
Your work environment should be ready. You'll find out later that you'll need text editors like Visual Studio Code, Notepad ++ (Windows), TextEdit (MacOS), or Sublime Text, so get one now.
How to save a website?
Saving the page by the browser
By entering any browser, anyone, including you, can save the selected page on their computer, just by spending a few minutes of your time. A duplicate page is saved on the user's computer as an HTML file and folder. The entire copy of the page opens in the browser and looks quite smooth. However, to save a really large page, this process will have to be repeated many times.
from paid third parties. There are many companies and freelancers on the Internet who will do everything for you for a fee. One of the website copying services is ProWebScraper. It has a trial version available thanks to which you can download 100 pages. Later, of course, you will have to pay. The plans start from $40 a month depending on how many pages you want to scrape. You can always find another site with a free trial period. It is worth mentioning that some portals allow you to check whether a given page is copyable, because many sites protect themselves from this.
Free website downloaders
If you want to save some money, take a look at the list of free website downloaders below.
The WebScrapBook plugin is available for Google Chrome and Mozilla Firefox. It downloads the entire page to your computer and offers several download options: download each file from the target page separately, download an archive or a separate HTML file.
By default, each file is downloaded individually, but if you want to download an archive, go to the options and in the "Capture" tab, change the desired save option.
To download an archive with files, select the HTZ format. After downloading it, select the archiver to unpack the archive manually.
When working on macOS, we advise you to take a closer look at Sitesucker. Its great advantage is working through its own interface.
It is possible to purchase a licensed version on the AppStore.
Cyotek WebCopy allows you to save a landing page on your computer and scan your competitors' websites.
To download a landing page, enter the URL address, specify the folder where you want to save the files and click "Copy Website".
In the free version, you can download up to 40 projects with no more than 500 files in one project. After installing and running the program, you need to create a new project.
As you can see, there are quite a few options for creating a new project:
- creating a visible copy of the site on your hard drive.
- creating a copy of the website along with the directory structure.
- searching for files of a specific type on the site.
- check all sites linked from the hub site.
- downloading one or more files from known addresses.
- searching the site for keywords.
To download the page, select the first option and then enter the landing link. In the next step, select "All" and then click "Finish". Also, remember to save the project and check that it has been saved in the file folder. To have the program download all files, click "Start".
The last free program to create a local copy of a site or set of sites is HTTrack. Its main advantage is many convenient settings. Here you can, for example, configure filters for the required file types. It is also possible to download the necessary data, and all downloaded sites are "scattered" by projects and divided thematically.
Unfortunately, this program has quite a noticeable downside. It is sensitive to the robots.txt file: the photos and pages it indicates may fail to load. To fix this, set the spider settings to "don't obey robots.txt rules". Only in this way can we guarantee that the page is fully loaded. Spiders are classes that determine how a specific site (or group of sites) will be scraped, including how to perform the scraping itself and how to extract structured data from their pages.
Online web scraping services
Save a Web 2 ZIP
Save a Web 2 ZIP is the most popular website when it comes to web scraping via a browser service. A very simple and thoughtful design attracts and inspires confidence, and everything is completely free. All you need to do is provide the link of the page you want to copy, choose the options you want and it's ready.
LPcopier is a Russian service that targets the affiliate marketing world. The portal allows scraping from about $5 per page. Additional services, such as the installation of analytical meters, are considered separately in terms of cost. It is also possible to order a landing page not from the CPA network or from an already ready landing page. If Russian scares you, just use the translation option that Google offers.
The Xdan website is also a Russian website (available in English) offering CopySite, i.e. web scraping services. With the help of this website, you can create a local copy of a landing page for free with the option of cleaning HTML counters, replacing links or domains.
The Russian Copyst service is one of the fastest services of this type offered. They declare that they will contact you within 15 minutes. The web scraping itself is done via a link, and for an additional fee you can update the website.
I downloaded the website. What's next?
Have you already downloaded a website? Great, now you'd have to think about what you want to do with it. You certainly want to modify it a bit. How?
How to redesign copied page?
To redesign the copied page for your own needs, you need to duplicate the asset however you like. To make changes to the structure, you can use any editor that allows you to work with the code, such as Visual Studio Code, Notepad ++ (Windows), TextEdit (MacOS), or Sublime Text. Open an editor that is convenient for you, customize the code, then save it and see how the changes are displayed in the browser. Edit the visual appearance of HTML tags through the use of CSS, add web forms, action buttons, links etc. After saving, the modified file will remain on the computer with updated functions, layout and targeted actions.
There are also websites that collect and analyze all design data from specific web archives that have a website creation and management system (CMS). The system creates a duplicate of the project with the admin and disk space. Archivarix is an example of such a website (the program can restore and archive the project).
Uploading websites to hosting
The last and most important step in web scraping of landing pages is uploading them to your hosting. Remember that coping and making small visual changes is not enough. Other people's affiliate links, scripts, replacement pixels, JS Metrica codes, and other counters almost always remain in the page's code. They must be removed manually (or with paid programs) before uploading to your hosting. If you want to know exactly how to upload your website to hosting, check out our article: “How to create a landing page? Creating a website step by step”.
How to defend against web scraping?
If you've ever noticed that your landing page has fallen victim to web scraping techniques, there is a way to redirect some of the traffic back to your page.
The code can be found in THIS THREAD.
Good to see you here!
We hope that you already know what web scraping is, how to download a web page, and, most importantly, how to comply with copyright laws. Now it's your turn to make your move and start earning. However, if you have any questions about affiliate marketing or you do not know which program to choose, please contact us.