Skip to main content
deleted 197 characters in body
Source Link

I'm currently developing a web crawler. The first version was developed in Node.js and runs pretty well.

The issues that I encountered with Node.js are in no particular order:

  • slow URL and query-string parsing library
  • slow HTTP packet parsing due to the C++ binding call overhead
  • slow DNS resolution due to synchronous call to getaddrinfo rather than the asynchronous C-Ares library
  • lack of good Unicode support in RegExps

These are for the most part due to the fact that it's still very young and that it's a general purpose language with a small core developer team.

The proof-of-concept I have at the moment runs pretty well thanks to the asynchronous nature of Node.js and some monkey-patches that I had to add.

Now I'm starting to think about what comes after the proof-of-concept.

The question of the programming language is one I wasn't able to answer yet.

The essential qualities that I'm looking for:

  • I/O and concurrency management
  • fast string parsing capabilities for HTTP messages, HTML, JSON ...etc.
  • bindings to some kind of queuing / persistence layer

I thought about C, Erlang, Rust, Go and D.

I know this isn't usually the kind of questions that are accepted here on Stack Exchange but I'm not sure where else I could get an answer.

Update: Some more information:

  • we currently use 8 core, 3.2 Ghz machines.
  • the persistence layer is ensured by running a PostgreSQL database
  • the point is to crawl all kinds of news sites and blogs as well as RSS feeds
  • we currently forward the content to a processing pipeline through Apache Kafka

Update: Would it be possible for the mods to unlock this question ? I provided a lot of details so this is not: "Which programming language is the best?" question, it's for a specific use-case.

I'm currently developing a web crawler. The first version was developed in Node.js and runs pretty well.

The issues that I encountered with Node.js are in no particular order:

  • slow URL and query-string parsing library
  • slow HTTP packet parsing due to the C++ binding call overhead
  • slow DNS resolution due to synchronous call to getaddrinfo rather than the asynchronous C-Ares library
  • lack of good Unicode support in RegExps

These are for the most part due to the fact that it's still very young and that it's a general purpose language with a small core developer team.

The proof-of-concept I have at the moment runs pretty well thanks to the asynchronous nature of Node.js and some monkey-patches that I had to add.

Now I'm starting to think about what comes after the proof-of-concept.

The question of the programming language is one I wasn't able to answer yet.

The essential qualities that I'm looking for:

  • I/O and concurrency management
  • fast string parsing capabilities for HTTP messages, HTML, JSON ...etc.
  • bindings to some kind of queuing / persistence layer

I thought about C, Erlang, Rust, Go and D.

I know this isn't usually the kind of questions that are accepted here on Stack Exchange but I'm not sure where else I could get an answer.

Update: Some more information:

  • we currently use 8 core, 3.2 Ghz machines.
  • the persistence layer is ensured by running a PostgreSQL database
  • the point is to crawl all kinds of news sites and blogs as well as RSS feeds
  • we currently forward the content to a processing pipeline through Apache Kafka

Update: Would it be possible for the mods to unlock this question ? I provided a lot of details so this is not: "Which programming language is the best?" question, it's for a specific use-case.

I'm currently developing a web crawler. The first version was developed in Node.js and runs pretty well.

The issues that I encountered with Node.js are in no particular order:

  • slow URL and query-string parsing library
  • slow HTTP packet parsing due to the C++ binding call overhead
  • slow DNS resolution due to synchronous call to getaddrinfo rather than the asynchronous C-Ares library
  • lack of good Unicode support in RegExps

These are for the most part due to the fact that it's still very young and that it's a general purpose language with a small core developer team.

The proof-of-concept I have at the moment runs pretty well thanks to the asynchronous nature of Node.js and some monkey-patches that I had to add.

Now I'm starting to think about what comes after the proof-of-concept.

The question of the programming language is one I wasn't able to answer yet.

The essential qualities that I'm looking for:

  • I/O and concurrency management
  • fast string parsing capabilities for HTTP messages, HTML, JSON ...etc.
  • bindings to some kind of queuing / persistence layer

I thought about C, Erlang, Rust, Go and D.

I know this isn't usually the kind of questions that are accepted here on Stack Exchange but I'm not sure where else I could get an answer.

Update: Some more information:

  • we currently use 8 core, 3.2 Ghz machines.
  • the persistence layer is ensured by running a PostgreSQL database
  • the point is to crawl all kinds of news sites and blogs as well as RSS feeds
  • we currently forward the content to a processing pipeline through Apache Kafka
edited body
Source Link

I'm currently developing a web crawler. The first version was developed in Node.js and runs pretty well.

The issues that I encountered with Node.js are in no particular order:

  • slow URL and query-string parsing library
  • slow HTTP packet parsing due to the C++ binding call overhead
  • slow DNS resolution due to synchronous call to getaddrinfo rather than the asynchronous C-Ares library
  • lack of good Unicode support in RegExps

These are for the most part due to the fact that it's still very young and that it's a general purpose language with a small core developer team.

The proof-of-concept I have at the moment runs pretty well thanks to the asynchronous nature of Node.js and some monkey-patches that I had to add.

Now I'm starting to think about what comes after the proof-of-concept.

The question of the programming language is one I wasn't able to answer yet.

The essential qualities that I'm looking for:

  • I/O and concurrency management
  • fast string parsing capabilities for HTTP messages, HTML, JSON ...etc.
  • bindings to some kind of queuing / persistence layer
  • we currently forward the content to a processing pipeline through Apache Kafka

I thought about C, Erlang, Rust, Go and D.

I know this isn't usually the kind of questions that are accepted here on Stack Exchange but I'm not sure where else I could get an answer.

Update: Some more information:

  • we currently use 8 core, 3.2 Ghz machines.
  • the persistence layer is ensured by running a PostgreSQL database
  • the point is to crawl all kinds of news sites and blogs as well as RSS feeds
  • we currently forward the content to a processing pipeline through Apache Kafka

Update: Would it be possible for the mods to unlock this question ? I provided a lot of details so this is not: "Which programming language is the best?" question, it's for a specific use-case.

I'm currently developing a web crawler. The first version was developed in Node.js and runs pretty well.

The issues that I encountered with Node.js are in no particular order:

  • slow URL and query-string parsing library
  • slow HTTP packet parsing due to the C++ binding call overhead
  • slow DNS resolution due to synchronous call to getaddrinfo rather than the asynchronous C-Ares library
  • lack of good Unicode support in RegExps

These are for the most part due to the fact that it's still very young and that it's a general purpose language with a small core developer team.

The proof-of-concept I have at the moment runs pretty well thanks to the asynchronous nature of Node.js and some monkey-patches that I had to add.

Now I'm starting to think about what comes after the proof-of-concept.

The question of the programming language is one I wasn't able to answer yet.

The essential qualities that I'm looking for:

  • I/O and concurrency management
  • fast string parsing capabilities for HTTP messages, HTML, JSON ...etc.
  • bindings to some kind of queuing / persistence layer
  • we currently forward the content to a processing pipeline through Apache Kafka

I thought about C, Erlang, Rust, Go and D.

I know this isn't usually the kind of questions that are accepted here on Stack Exchange but I'm not sure where else I could get an answer.

Update: Some more information:

  • we currently use 8 core, 3.2 Ghz machines.
  • the persistence layer is ensured by running a PostgreSQL database
  • the point is to crawl all kinds of news sites and blogs as well as RSS feeds

Update: Would it be possible for the mods to unlock this question ? I provided a lot of details so this is not: "Which programming language is the best?" question, it's for a specific use-case.

I'm currently developing a web crawler. The first version was developed in Node.js and runs pretty well.

The issues that I encountered with Node.js are in no particular order:

  • slow URL and query-string parsing library
  • slow HTTP packet parsing due to the C++ binding call overhead
  • slow DNS resolution due to synchronous call to getaddrinfo rather than the asynchronous C-Ares library
  • lack of good Unicode support in RegExps

These are for the most part due to the fact that it's still very young and that it's a general purpose language with a small core developer team.

The proof-of-concept I have at the moment runs pretty well thanks to the asynchronous nature of Node.js and some monkey-patches that I had to add.

Now I'm starting to think about what comes after the proof-of-concept.

The question of the programming language is one I wasn't able to answer yet.

The essential qualities that I'm looking for:

  • I/O and concurrency management
  • fast string parsing capabilities for HTTP messages, HTML, JSON ...etc.
  • bindings to some kind of queuing / persistence layer

I thought about C, Erlang, Rust, Go and D.

I know this isn't usually the kind of questions that are accepted here on Stack Exchange but I'm not sure where else I could get an answer.

Update: Some more information:

  • we currently use 8 core, 3.2 Ghz machines.
  • the persistence layer is ensured by running a PostgreSQL database
  • the point is to crawl all kinds of news sites and blogs as well as RSS feeds
  • we currently forward the content to a processing pipeline through Apache Kafka

Update: Would it be possible for the mods to unlock this question ? I provided a lot of details so this is not: "Which programming language is the best?" question, it's for a specific use-case.

added 197 characters in body
Source Link

I'm currently developing a web crawler. The first version was developed in Node.js and runs pretty well.

The issues that I encountered with Node.js are in no particular order:

  • slow URL and query-string parsing library
  • slow HTTP packet parsing due to the C++ binding call overhead
  • slow DNS resolution due to synchronous call to getaddrinfo rather than the asynchronous C-Ares library
  • lack of good Unicode support in RegExps

These are for the most part due to the fact that it's still very young and that it's a general purpose language with a small core developer team.

The proof-of-concept I have at the moment runs pretty well thanks to the asynchronous nature of Node.js and some monkey-patches that I had to add.

Now I'm starting to think about what comes after the proof-of-concept.

The question of the programming language is one I wasn't able to answer yet.

The essential qualities that I'm looking for:

  • I/O and concurrency management
  • fast string parsing capabilities for HTTP messages, HTML, JSON ...etc.
  • bindings to some kind of queuing / persistence layer
  • we currently forward the content to a processing pipeline through Apache Kafka

I thought about C, Erlang, Rust, Go and D.

I know this isn't usually the kind of questions that are accepted here on Stack Exchange but I'm not sure where else I could get an answer.

Update: Some more information:

  • we currently use 8 core, 3.2 Ghz machines.
  • the persistence layer is ensured by running a PostgreSQL database
  • the point is to crawl all kinds of news sites and blogs as well as RSS feeds

Update: Would it be possible for the mods to unlock this question ? I provided a lot of details so this is not: "Which programming language is the best?" question, it's for a specific use-case.

I'm currently developing a web crawler. The first version was developed in Node.js and runs pretty well.

The issues that I encountered with Node.js are in no particular order:

  • slow URL and query-string parsing library
  • slow HTTP packet parsing due to the C++ binding call overhead
  • slow DNS resolution due to synchronous call to getaddrinfo rather than the asynchronous C-Ares library
  • lack of good Unicode support in RegExps

These are for the most part due to the fact that it's still very young and that it's a general purpose language with a small core developer team.

The proof-of-concept I have at the moment runs pretty well thanks to the asynchronous nature of Node.js and some monkey-patches that I had to add.

Now I'm starting to think about what comes after the proof-of-concept.

The question of the programming language is one I wasn't able to answer yet.

The essential qualities that I'm looking for:

  • I/O and concurrency management
  • fast string parsing capabilities for HTTP messages, HTML, JSON ...etc.
  • bindings to some kind of queuing / persistence layer

I thought about C, Erlang, Rust, Go and D.

I know this isn't usually the kind of questions that are accepted here on Stack Exchange but I'm not sure where else I could get an answer.

I'm currently developing a web crawler. The first version was developed in Node.js and runs pretty well.

The issues that I encountered with Node.js are in no particular order:

  • slow URL and query-string parsing library
  • slow HTTP packet parsing due to the C++ binding call overhead
  • slow DNS resolution due to synchronous call to getaddrinfo rather than the asynchronous C-Ares library
  • lack of good Unicode support in RegExps

These are for the most part due to the fact that it's still very young and that it's a general purpose language with a small core developer team.

The proof-of-concept I have at the moment runs pretty well thanks to the asynchronous nature of Node.js and some monkey-patches that I had to add.

Now I'm starting to think about what comes after the proof-of-concept.

The question of the programming language is one I wasn't able to answer yet.

The essential qualities that I'm looking for:

  • I/O and concurrency management
  • fast string parsing capabilities for HTTP messages, HTML, JSON ...etc.
  • bindings to some kind of queuing / persistence layer
  • we currently forward the content to a processing pipeline through Apache Kafka

I thought about C, Erlang, Rust, Go and D.

I know this isn't usually the kind of questions that are accepted here on Stack Exchange but I'm not sure where else I could get an answer.

Update: Some more information:

  • we currently use 8 core, 3.2 Ghz machines.
  • the persistence layer is ensured by running a PostgreSQL database
  • the point is to crawl all kinds of news sites and blogs as well as RSS feeds

Update: Would it be possible for the mods to unlock this question ? I provided a lot of details so this is not: "Which programming language is the best?" question, it's for a specific use-case.

Post Closed as "Not suitable for this site" by Telastyn, amon, MetaFight, gnat, CommunityBot
Source Link
Loading