I'm currently developing a web crawler. The first version was developed in Node.js and runs pretty well.
The issues that I encountered with Node.js are in no particular order:
- slow URL and query-string parsing library
- slow HTTP packet parsing due to the C++ binding call overhead
- slow DNS resolution due to synchronous call to
getaddrinforather than the asynchronous C-Ares library - lack of good Unicode support in RegExps
These are for the most part due to the fact that it's still very young and that it's a general purpose language with a small core developer team.
The proof-of-concept I have at the moment runs pretty well thanks to the asynchronous nature of Node.js and some monkey-patches that I had to add.
Now I'm starting to think about what comes after the proof-of-concept.
The question of the programming language is one I wasn't able to answer yet.
The essential qualities that I'm looking for:
- I/O and concurrency management
- fast string parsing capabilities for HTTP messages, HTML, JSON ...etc.
- bindings to some kind of queuing / persistence layer
I thought about C, Erlang, Rust, Go and D.
I know this isn't usually the kind of questions that are accepted here on Stack Exchange but I'm not sure where else I could get an answer.
Update: Some more information:
- we currently use 8 core, 3.2 Ghz machines.
- the persistence layer is ensured by running a PostgreSQL database
- the point is to crawl all kinds of news sites and blogs as well as RSS feeds
- we currently forward the content to a processing pipeline through Apache Kafka