While recently doing a small project, I was amazed by how much web scraping I could do with just one line of Bash. I used the text-based Lynx browser [1] and then piped the output to a grep search. Figure 1 shows the one-line Bash example that scrapes the current snow depth from the Sunshine Village Snow Forecast web page.
In this article, I will introduce some techniques to easily scrape web pages, and then I will create a desktop notification script that provides the daily snow forecast.
The Lynx Text Browser
For my Bash web scraping, I started out by looking at using command-line tools such as curl [2] with the htm12text [3] utility. This technique definitely works, but I found that using the Lynx browser offers a one-step solution with a slightly cleaner text output.
To install Lynx on Raspian/Debian/ Ubuntu, use:
sudo apt install lynx
The Lynx -dump option will output a web page to text with HTML tags, HTML encoding, and JavaScript removed. Figure 2 shows that a Lynx dump can greatly clean up the original web page and make searching considerably easier.
Sometimes a simple Bash grep search might be all that you need. However, there are many cases where some text manipulation is required. The good news is that Bash has a nice selection of line and string manipulation tools.
Esta historia es de la edición #262/September 2022 de Linux Magazine.
Comience su prueba gratuita de Magzter GOLD de 7 días para acceder a miles de historias premium seleccionadas y a más de 9,000 revistas y periódicos.
Ya eres suscriptor ? Conectar
Esta historia es de la edición #262/September 2022 de Linux Magazine.
Comience su prueba gratuita de Magzter GOLD de 7 días para acceder a miles de historias premium seleccionadas y a más de 9,000 revistas y periódicos.
Ya eres suscriptor? Conectar
MADDOG'S DOGHOUSE
The stakeholder approach of open source broadens the pool of who can access, influence, and benefit from information technologies.
MakerSpace
Rust, a potential successor to C/C++, claims to solve some memory safety issues while maintaining high performance. We look at Rust on embedded systems, where memory safety, concurrency, and security are equally important
In Harmony
Using the Go Interface mechanism, Mike demonstrates its practical application with a refresh program for local copies of Git repositories.
Monkey Business
Even small changes in a web page can improve the browsing experience. Your preferred web browser provides all the tools you need to inject JavaScript to adapt the page. You just need a browser with its debugging tools, some knowledge of scripting, and the browser extension Tampermonkey.
Smarter Navigation
Zoxide, a modern version of cd, lets you navigate long directory paths with less typing.
Through the Back Door
Cybercriminals are increasingly discovering Linux and adapting malware previously designed for Windows systems. We take you inside the Linux version of a famous Windows ransomware tool.
Page Pulse
Do you want to be alerted when a product is back in stock on your favorite online store? Do you want to know when a website without an RSS feed gets an update? With changedetection.io, you can stay up-to-date on website changes.
Arco Linux
ArcoLinux, an Arch derivative, offers easier installs while educating users about Arch Linux along the way.
Ghost Coder
Artificial intelligence is increasingly supporting programmers in their daily work. How effective are these tools? What are the dangers? And how can you benefit from Al-assisted development today?
Zack's Kernel News
Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.