9

I use PHP and have like 10 tasks that need to run. Each one of them should not timeout but all 10 tasks together might.

Is it a good solution to use a modular approach with new http requests?

Something like this:

http://example.com/some/module/fetch
http://example.com/some/module/parse
http://example.com/some/module/save

Maybe these urls do one task each. If it's successful, do the next task from that task. A kind of chain reaction. One path calls the next (with curl).

Pros and cons? Is it a good approach? If not, what is a better alternative?

3
  • 1
    I think it's normal, and you can do that, but why you don't do it on one php? Do you want to use it with API? Or this urls in your own project? Commented Aug 31, 2016 at 21:08
  • @TeymurMardaliyerLennon I don't want to run everything at the sametime in case of timeout. Commented Sep 1, 2016 at 11:26
  • 1
    For such things like running workers in rich configurations I am using RabbitMQ rabbitmq.com/tutorials/tutorial-two-php.html "The main idea behind Work Queues (aka: Task Queues) is to avoid doing a resource-intensive task immediately and having to wait for it to complete" Commented Sep 2, 2016 at 0:22

5 Answers 5

2
+50

The modular approach is a good idea (if one "unit" fails, the job stops as you desire; plus it's simpler to debug/test each individual unit).

It will work, but your approach to chaining has some issues:

  • if there is a bottleneck (i.e. one "unit" takes longer than the others) then you may end up with 100 of the bottleneck processes all running and you lose control of server resources
  • there is a lack of control; let's say the server needs to be rebooted: to restart the jobs then you need to start them all at the beginning.
  • similarly if there is a reason you need to stop/start/debug an individual unit when running, you'll need to restart the job at the first unit to repeat.
  • by making a web request, you are using Apache/NGIX resources, memory, socket connections etc just to run a PHP process. You could just run the PHP process directly without using the overheads.
  • and finally, if on a DMZ'd web server, the server might not actually be able to make requests to itself.

To get more control, you should use a queuing system for this kind of operation.

Using PHP (or any language, really), your basic process is:

  1. each "unit" is a continuously looping php script that never ends*

  2. each "unit" process listens to a queuing system; when a job arrives on the queue that it can handle then it takes it off the queue

  3. when each unit is finished with the job, it confirms handled and pushes to the next queue.

  4. if the unit decides the job should not continue, confirm the job handled but don't push to the next queue.

Advantages:

  • if a "unit" stops, then the job remains on the queue and can be collected when you restart the "unit". Makes it easier restarting the units/server or if one unit crashes.
  • if one "unit" is very heavy, you can just start a second process doing exactly the same if you have space server capacity. If no server capacity, you accept the bottleneck; you therefore have a very transparent view of how much resource you are using.
  • if you decide that another language will handle the request better, you can mix NodeJS, Python, Ruby and... they can all talk to the same queues.

Side note on "continually looping PHP": this is done by setting max_execution_time "0". Make sure that you don't cause "memory leaks" and have cleanm . You can auto-start the process on boot (systemd, or task scheduler depending on OS) or run manually for testing. If you don't want to have it continuously looping, timeout after 5 minutes and have cron/task scheduler restart.

Side note on queues: you can "roll your own" using a database of memory cache for simple applications (e.g. can easily cope with 100,000 items an hour in a queue using a database system) but avoiding conflicts / managing state/retries is a bit of an art. A better option is RabbitMQ (https://www.rabbitmq.com/). It's a bit of a niggle to install, but once you've installed it, follow the PHP tutorials and you'll never look back!

Sign up to request clarification or add additional context in comments.

Comments

1

Assuming you want to use HTTP requests, you have a few options, set a timeout, each time less:

function doTaskWithEnd($uri, $end, $ctx = null) {
    if (!$ctx) { $ctx = stream_context_create(); }
    stream_context_set_option($ctx, "http", "timeout", $end - time());
    $ret = file_get_contents($uri, false, $ctx));
    if ($ret === false) {
        throw new \Exception("Request failed or timed out!");
    }
    return $ret;
}

$end = time() + 100;
$fetched = doTaskWithEnd("http://example.com/some/module/fetch", $end);
$ctx = stream_context_create(["http" => ["method" => "POST", "content" => $fetched]]);
$parsed = doTaskWithEnd("http://example.com/some/module/parsed", $end, $ctx);
$ctx = stream_context_create(["http" => ["method" => "PUT", "content" => $parsed]]);
doTaskWithEnd("http://example.com/some/module/save", $end, $ctx);

Or alternatively, with an non-blocking solution (let's use amphp/amp + amphp/artax for this):

function doTaskWithTimeout($requestPromise, $timeout) {
    $ret = yield \Amp\first($requestPromise, $timeout);
    if ($ret === null) {
        throw new \Exception("Timed out!");
    }
    return $ret;
}

\Amp\execute(function() {
    $end = new \Amp\Pause(100000); /* timeout in ms */

    $client = new \Amp\Artax\Client;
    $fetched = yield from doTaskWithTimeout($client->request("http://example.com/some/module/fetch"));
    $req = (new \Amp\Artax\Request)
        ->setUri("http://example.com/some/module/parsed")
        ->setMethod("POST")
        ->setBody($fetched)
    ;
    $parsed = yield from doTaskWithTimeout($client->request($req), $end);
    $req = (new \Amp\Artax\Request)
        ->setUri("http://example.com/some/module/save")
        ->setMethod("PUT")
        ->setBody($parsed)
    ;
    yield from doTaskWithTimeout($client->request($req), $end);
});

Now, I ask, do you really want to offload to separate requests? Can't we just assume there are now functions fetch(), parse($fetched) and save($parsed)?

In this case it's easy and we just may set up an alarm:

declare(ticks=10); // this declare() line must happen before the first include/require
pcntl_signal(\SIGALRM, function() {
    throw new \Exception("Timed out!");
});
pcntl_alarm(100);

$fetched = fetch();
$parsed = parse($fetched);
save($parsed);

pcntl_alarm(0); // we're done, reset the alarm

Alternatively, the non-blocking solution works too (assuming fetch(), parse($fetched) and save($parsed) properly return Promises and are designed non-blockingly):

\Amp\execute(function() {
    $end = new \Amp\Pause(100000); /* timeout in ms */
    $fetched = yield from doTaskWithTimeout(fetch(), $end);
    $parsed = yield from doTaskWithTimeout(parse($fetched), $end);
    yield from doTaskWithTimeout(save($parsed), $end);
});

If you just want to have a global timeout for different sequential tasks, I'd preferably go with doing it just all in one script with a pcntl_alarm(), alternatively go with the stream context timeout option.

The non-blocking solutions are mainly applicable if you happen to need to do other things at the same time. E.g. if you want to do that fetch+parse+save cycle multiple times, independently from each other cycle.

Comments

1

I think "Chain reaction" is a clue that this approach may be over-complicated...

There may be good reasons for switching to a robust messaging/work-queue system such as RabbitMQ or SQS, especially if you are handling significant load. Messaging queues are invaluable in the appropriate context, but they add a lot of complexity/overhead/code if they are used unnecessarily.

Simplest solution

...but if your only concern is preventing a timeout, I would not make it more complicated than it needs to be; you can easily extend or disable timeouts completely using:

set_time_limit(0); //no time limit, not recommended
set_time_limit(300); //5 mins

Your proposed "chaining" pattern is sensible in principle because it allows you to identify precisely where any faults occur, but you can do this all inside the same request/function rather than relying on the network.

Rather than handling faults in one neat location, this would require two (or more) layers of fault handling: one layer which handles the individual request and another that makes the request.

Assuming the work can successfully be handled in a single request (or even with no remote requests at all) then No, it is not a "good solution to use a modular approach with new http requests" because you are adding unnecessary work & complexity by making unnecessary http calls/responses." i.e. This introduces additional failure possibilities, particularly network connectivity/delays, DNS, difficulty testing & debugging, etc.

Separating into separate remote calls may even add 10x network/server/authentication latency and makes it trickier to do sensible things like database connection pooling.

Other ways to simplify the problem?

If possible, it may be worth investigating why this chain of requests takes so long - if you can optimize them to run faster you may be able to avoid adding unnecessary complexity in this part of your system. e.g. Things like database latency or not using db connection pooling could add up to serious overhead across 10 separate processes.

1 Comment

I will download a CSV file which might be 2-20MB big, convert it to an array, parse it and insert it to a database, like 20 000 inserts.
1

This answer assumes you're using PHP and running tasks by making HTTP requests to each of the URLs in your question.

Your solution depends on what your business requirements are. If you don't care about the order of completion for the HTTP requests, I suggest looking at curl_multi_init() to start learning about the cURL PHP extension's curl_multi_* functions.

If you do care about order of completion (e.g., a specific task must complete before the next), take a look at the curl_init().

To eliminate the possibility of your calling script of timing out, please read about the set_time_limit function or consider forking your process using pcntl_fork.

Alternatively, I would research a message queue. Specifically, check out Amazon's SQS and read about about to interface with it in PHP. Here are a few links regarding SQS and PHP:

Comments

0

background jobs with workers is best way because :

Applications often need to perform operations that are time (or computationally) intensive, but it is usually not desirable to do so during a request, as the resulting slowness is perceived directly by the application’s users. Instead, any task that takes longer than a few dozen milliseconds, such as image processing, the sending of email, or any kind of background synchronization, should be carried out as a background task. Furthermore, a worker queue also makes it easy to perform scheduled jobs, as the same queue infrastructure can be utilized by a clock process.

used php resque for implement background jobs : php resque workers

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.