Easy parallel loops in Python, R, Matlab and Octave


The Domino data science platform makes it trivial to run your analysis in the cloud on very powerful hardware (up to 32 cores and 250GB of memory), allowing massive performance increases through parallelism. In this post, we'll show you how to parallelize your code in a variety of languages to utilize multiple cores. This may sound intimidating, but Python, R, and Matlab have features that make it very simple.

Read on to see how you can get over 3000% CPU output from one machine. And check out our public parallelism project on Domino to see the examples below working in the wild.

Perf stats from some parallelized Python code running on a single, 32-core machine

Is my code parallelizable?

For the purpose of this post, we assume a common analysis scenario: you need to perform some calculation on many items, and the calculation for one item does not depend on any other. More precisely:

  1. Your analysis processes a list of things, e.g., products, stores, files, people, species. Let's call this the inputs.

  2. You can structure your code such that you have a function which takes one such thing and returns a result you care about. Let's call this function processInput. (After this step, you can then combine your results however you want, e.g., aggregating them, saving them to a file — it doesn't matter for our purposes.)

Normally you would loop over your items, processing each one:

for i in inputs
    results[i] = processInput(i)
// now do something with results

Instead of processing your items in a normal a loop, we'll show you how to process all your items in parallel, spreading the work across multiple cores.

To make our examples below concrete, we use a list of numbers, and a function that squares the numbers. You would use your specific data and logic, of course.

Let's get started!


Python has a great package, joblib, that makes parallelism incredibly easy.

from joblib import Parallel, delayed
import multiprocessing
# what are your inputs, and what operation do you want to 
# perform on each input. For example...
inputs = range(10) 
def processInput(i):
	return i * i

num_cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=num_cores)(delayed(processInput)(i) for i in inputs)

results is now [1, 4, 9 ... ]

Get the above code in our sample file, parallel.py


Since 2.14, R has included the Parallel library, which makes this sort of task very easy.

# what are your inputs, and what operation do you want to 
# perform on each input. For example...

inputs <- 1:10
processInput <- function(i) {
	i * i

numCores <- detectCores()

results = mclapply(inputs, processInput, mc.cores = numCores)

# the above won't work on Windows, but this will:
cl <- makeCluster(numCores)
results = parLapply(cl, inputs, processInput)

Get the above code in our sample file, parallel.R. You can find some more info on the difference between mclapply and parLapply on this StackOverflow post

As an alternative, you can also use the foreach package, which lets you use a familiar for loop syntax, automatically parallelizing your code under the hood:


numCores <- detectCores()
cl <- makeCluster(numCores)

inputs <- 1:10
processInput <- function(i) {
  i * i

results <- foreach(i=inputs) %dopar% {

Get the above code in our sample file, parallelForeach.R.


Matlab's Parallel Computing Toolbox makes it trivial to use parallel for loops using the parfor construct. For example:

inputs = 1:10;
results = [];
% assumes that processInput is defined in a separate function file
parfor i = inputs
	results(i) = processInput(i);

Note that if your inputs are not integers (e.g., they are file names or item identifiers), you can use the parcellfun function, which operates on cell inputs, rather than array inputs.


Unfortunately, Octave doesn't have a nice parfor equivalent — but it does have its own Parallel package. Here's how you can use it:

if exist('OCTAVE_VERSION') ~= 0
	% you'll need to run this once, to install the package:
	% pkg install -forge parallel
	pkg load parallel

inputs = 1:10;
numCores = nproc();

% assumes that processInput is defined in a separate function file
[result] = pararrayfun (numCores, @processInput, inputs);

Note that you can use the parcellfun function if your inputs are not numbers (e.g., if they are file names or product identifiers).

Get the above code in our sample file, parallel.m.


Modern statistical languages make it incredibly easy to parallelize your code across cores, and Domino makes it trivial to access very powerful machines, with many cores. By using these techniques, we've seen users speed up their code over 32x while still using a single machine.

If you'd like to try applying this approach to your analysis, please let us know, we're happy to help!

You can see the examples above, along with their output, in my parallelism project on Domino.

  • Good explanation!
    I wonder why the Parallel function gets stuck when I use my own defined function.

    I mean the line
    Parallel(n_jobs=num_cores)(delayed(sqrt)(i) for i in inputs)
    works fine for sqrt Python built-in function.

    However, when I use another function I built it doesn’t work.

    Does anyone know why?

    • dominodatalab

      Can you paste your code? Hard to diagnose the problem without seeing it.

      • Oh, sure.
        Sorry about that.

        When I use my own function results = Parallel(n_jobs=num_cores)(delayed(myF)(i) for i in inputs) line gets stuck or runs forever with no return. :/

        • Interestingly, with num_cores set to one, it runs really quick. But for num_cores equals 2 or greater number it takes soooo long to run.

          • dominodatalab

            I’m sorry, Jedson, your example works fine when I try it. Without seeing more about your setup it’s hard to speculate about the cause for the behavior you’re seeing. How many cores does your computer actually have?

          • I tested in two different machines and got the same problem.
            One of these computers has 8 cores and the other one, four.

            Actually, I could kinda fix this issue by copying and pasting all the script into my IDLE console, instead of just calling the file.

            Anyway, when I set my script to use only 2 cores, it worked quite fine.
            But with just one core is even faster! And with 4 cores didn’t work.

            Thanks for your attention!

          • constructor

            try to wrap to
            if __name__ == ‘__main__’:

  • Alberto Andreotti

    This sucks. You only get to parallelize each element in the collection. A true parallel for loop will distribute chunks to each of the available processors.
    Take a look at Scala’s. The exact serial code you use to run single-threaded can be used to run on multiple threads. No extra chunking by the user.

  • Alberto Andreotti

    Chunks are too small, you are expected to arrange big enough chunks yourself. A crapy approach, imho.

  • Marija Jegorova

    When I try to ‘pip install multiprocessing’ it throws the following error:
    Command “python setup.py egg_info” failed with error code 1 in /private/var/folders/7s/sswmssj51p73hky4mkqs4_zc0000gn/T/pip-build-wesmtf1f/multiprocessing/

    Any ideas how I could fix this?

    • Mark Kenzine

      multiprocessing is built-in function of Python. Dont need to use pip to install