Timeline for Can I get Open Graph Protocol data without behaving as a web scraper?
Current License: CC BY-SA 4.0
Post Revisions
19 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| S Sep 18, 2025 at 18:23 | history | suggested | Ali Khakbaz | CC BY-SA 4.0 |
fixed grammar
|
| Sep 18, 2025 at 12:11 | review | Suggested edits | |||
| S Sep 18, 2025 at 18:23 | |||||
| Sep 11, 2025 at 17:43 | comment | added | JimmyJames | I know this doesn't help you but I can't see why they didn't use headers for this. Maybe I'm missing something but it looks like another example of meta creating a new standard for something that's already solved by HTTP natively. | |
| Sep 10, 2025 at 19:12 | history | edited | Lamron | CC BY-SA 4.0 |
"deny" detail
|
| Sep 9, 2025 at 15:46 | answer | added | Doc Brown | timeline score: 2 | |
| Sep 9, 2025 at 12:23 | comment | added | Lamron | To find trending words, my program analyze Bluesky posts and it isn't related to OGP. | |
| Sep 9, 2025 at 11:08 | comment | added | Basilevs | Larmon, how do you find trending words? Are you scanning some sites? I could be wrong here. | |
| Sep 9, 2025 at 10:54 | comment | added | Doc Brown | Correct me, but AFAIK the purpose of a robots.txt is usually to stop search crawlers to scan an entire web site frequently, not to stop anybody from seeing the content of a site (or their headlines) at all. If someone adds OGP data to their site, they want the headlines to be presented on social media / newsfeeds, and the content of robots.txt should usually be in line with that goal (otherwise is misdesigned, which is nothing which should not be your concern.) | |
| Sep 9, 2025 at 10:29 | comment | added | freakish | @Basilevs yeah, yeah. I'm pretty sure companies around the world are ethical with regards to our data as well. Sorry, I don't give a f**k. | |
| Sep 9, 2025 at 10:10 | comment | added | Basilevs | @freakish ethical bot respects robots.txt and presents accurate agent name. | |
| Sep 9, 2025 at 8:30 | comment | added | freakish |
If you truely want to download only meta tags, which typically reside inside <head></head> tag, then you can always just download the page (and parse) chunk by chunk, until you see </head> tag. Choose an xml parser that works chunk by chunk, there are plenty of them. Doable, but pain in the a**. Plus closing an incomplete connection might be suspicious.
|
|
| Sep 9, 2025 at 8:25 | comment | added | freakish | "If website denies bots, it becomes impossible to get OGP data." I don't understand this statement. You literally just make a request to the web server and parse the result. There's no way for the server to prevent that (well, unless you do like millions of requests in short time). They cannot deny you. Just like they cannot deny a human user. There is no difference, as long as you behave. As for the first question: why downloading entire page is a problem? HTML doesn't weight that much compared to say images or videos. | |
| Sep 9, 2025 at 6:31 | review | Close votes | |||
| Sep 14, 2025 at 3:00 | |||||
| Sep 9, 2025 at 3:05 | history | edited | Lamron | CC BY-SA 4.0 |
Add actual cases
|
| Sep 8, 2025 at 22:20 | history | edited | Arseni Mourzenko | CC BY-SA 4.0 |
added 126 characters in body; edited tags; edited title
|
| Sep 8, 2025 at 22:17 | comment | added | Arseni Mourzenko | Good question. I took a liberty to make a few changes, in order to make the question clearer and reduce the risk for it to be downvoted and closed. Check if your intention was preserved. You may also want to add the example of your particular case, i.e. why exactly do you want to extract OGP in the first place—answers may vary depending on that. | |
| Sep 8, 2025 at 22:15 | history | edited | Arseni Mourzenko | CC BY-SA 4.0 |
added 126 characters in body; edited tags; edited title
|
| S Sep 8, 2025 at 21:39 | review | First questions | |||
| Sep 9, 2025 at 1:40 | |||||
| S Sep 8, 2025 at 21:39 | history | asked | Lamron | CC BY-SA 4.0 |