2

For a homework project, I'm creating a PHP driven website which main function is aggregating news about various university courses. The main problem is this: (almost) each course has it's own website. These are usually just plain HTML or built using some simple free CMS system. As a student, participating in 6-7 courses, almost every day you go through 6-7 websites checking if there are any news. The idea behind the project is that you don't have to do that, instead, you just check the aggregation site.

My idea is the following: each time a student logs in, go through his course list. For every course, get it's website (recursively, like with wget), and create a hash value of it. If the hash is different then one stored in database, we know that site has changed, and we notify the student.

So, what do you think, is this reasonable way to achieve the functionality? And if yes, what is (technically) the best way to go about this? I was checking php_curl, put I don't know if it can get a website recursively.

Furthermore, there's a slight problem I have somewhat limited resources, only a few MB of quota on public (university) server. However, if that's a big problem, I could use a seperate hosting solution.

Thanks :)

3
  • 1
    I think it is not a good way to crawl through the websites on user login because it may take sufficient much time. PHP does not support threads, so you will need to crawl pages one by one and it takes time.
    – Karolis
    Commented Jun 18, 2011 at 12:39
  • @Karolis Yeah, I was thinking about the time issue (especially if there will be many users, hopefully). One thing that came to mind was limiting the check to once per hour. Do you have a better idea?
    – Stan
    Commented Jun 18, 2011 at 13:00
  • I think it will be slow enough even if you are the only user :) So, yeah, once per hour would be OK.
    – Karolis
    Commented Jun 18, 2011 at 13:19

1 Answer 1

1

Just use file_get_contents, or cURL if you absolutely have to (in case you need COOKIES).

You can use your hashing trick to check for modifications but it's not very elegant. What you want to know is when was it last changed. I doubt this information is on the website, but maybe they offer an RSS feed or some webservice or API you can use for this purpose.

Don't worry about doing recursive requests. Just make a new request each time.

"When all else fails, build a scraper"

1
  • Unfortunately, there are really no patterns to course sites (sometimes there is no site at all) - it all depends on the lecturer. The only pattern is that they are relatively small (say, less than 20 pages). So, if I understood, you suggest I do the following: file_get_contents on index page, do a regexp search for all inbound links, and repeat?
    – Stan
    Commented Jun 18, 2011 at 13:02

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.