3/jekyll_site/en/2022/02/09/matrix-multiplication-parallel-streams.md at 5731feea253ce6ece973ef22b261428caa0301d3

pomodoro/3

Fork 0

golovin 5731feea25 2023-06-30

2023-12-17 07:56:19 +03:00

7.4 KiB

Raw Blame History

title

description

sections

Parallel stream

We bypass the rows of the first matrix L in parallel mode. In each thread there are two nested loops: across the common side of two matrices M and across the columns of the second matrix N. Processing of the rows of the resulting matrix occurs independently of each other in parallel streams.

/**
 * @param l rows of matrix 'a'
 * @param m columns of matrix 'a'
 *          and rows of matrix 'b'
 * @param n columns of matrix 'b'
 * @param a first matrix 'l×m'
 * @param b second matrix 'm×n'
 * @return resulting matrix 'l×n'
 */
public static int[][] parallelMatrixMultiplication(int l, int m, int n, int[][] a, int[][] b) {
    // resulting matrix
    int[][] c = new int[l][n];
    // bypass the indexes of the rows of matrix 'a' in parallel mode
    IntStream.range(0, l).parallel().forEach(i -> {
        // bypass the indexes of the common side of two matrices:
        // the columns of matrix 'a' and the rows of matrix 'b'
        for (int k = 0; k < m; k++)
            // bypass the indexes of the columns of matrix 'b'
            for (int j = 0; j < n; j++)
                // the sum of the products of the elements of the i-th
                // row of matrix 'a' and the j-th column of matrix 'b'
                c[i][j] += a[i][k] * b[k][j];
    });
    return c;
}

Three nested loops

We take the optimized variant of algorithm, that uses cache of the execution environment better than others. The outer loop bypasses the rows of the first matrix L, then there is a loop across the common side of the two matrices M and it is followed by a loop across the columns of the second matrix N. Writing to the resulting matrix occurs row-wise, and each row is filled in layers.

/**
 * @param l rows of matrix 'a'
 * @param m columns of matrix 'a'
 *          and rows of matrix 'b'
 * @param n columns of matrix 'b'
 * @param a first matrix 'l×m'
 * @param b second matrix 'm×n'
 * @return resulting matrix 'l×n'
 */
public static int[][] sequentialMatrixMultiplication(int l, int m, int n, int[][] a, int[][] b) {
    // resulting matrix
    int[][] c = new int[l][n];
    // bypass the indexes of the rows of matrix 'a'
    for (int i = 0; i < l; i++)
        // bypass the indexes of the common side of two matrices:
        // the columns of matrix 'a' and the rows of matrix 'b'
        for (int k = 0; k < m; k++)
            // bypass the indexes of the columns of matrix 'b'
            for (int j = 0; j < n; j++)
                // the sum of the products of the elements of the i-th
                // row of matrix 'a' and the j-th column of matrix 'b'
                c[i][j] += a[i][k] * b[k][j];
    return c;
}

Testing

To check, we take two matrices A=[900×1000] and B=[1000×750], filled with random numbers. First, we compare the correctness of the implementation of the two algorithms — matrix products must match. Then we execute each method 10 times and calculate the average execution time.

// start the program and output the result
public static void main(String[] args) {
    // incoming data
    int l = 900, m = 1000, n = 750, steps = 10;
    int[][] a = randomMatrix(l, m), b = randomMatrix(m, n);
    // matrix products
    int[][] c1 = parallelMatrixMultiplication(l, m, n, a, b);
    int[][] c2 = sequentialMatrixMultiplication(l, m, n, a, b);
    // check the correctness of the results
    System.out.println("The results match: " + Arrays.deepEquals(c1, c2));
    // measure the execution time of two methods
    benchmark("Stream", steps, () -> {
        int[][] c = parallelMatrixMultiplication(l, m, n, a, b);
        if (!Arrays.deepEquals(c, c1)) System.out.print("error");
    });
    benchmark("Loops", steps, () -> {
        int[][] c = sequentialMatrixMultiplication(l, m, n, a, b);
        if (!Arrays.deepEquals(c, c2)) System.out.print("error");
    });
}

{% capture collapsed_md %}

// helper method, returns a matrix of the specified size
private static int[][] randomMatrix(int row, int col) {
    int[][] matrix = new int[row][col];
    for (int i = 0; i < row; i++)
        for (int j = 0; j < col; j++)
            matrix[i][j] = (int) (Math.random() * row * col);
    return matrix;
}

// helper method for measuring the execution time of the passed code
private static void benchmark(String title, int steps, Runnable runnable) {
    long time, avg = 0;
    System.out.print(title);
    for (int i = 0; i < steps; i++) {
        time = System.currentTimeMillis();
        runnable.run();
        time = System.currentTimeMillis() - time;
        // execution time of one step
        System.out.print(" | " + time);
        avg += time;
    }
    // average execution time
    System.out.println(" || " + (avg / steps));
}

{% endcapture %} {%- include collapsed_block.html summary="Helper methods" content=collapsed_md -%}

Output depends on the execution environment, time in milliseconds:

The results match: true
Stream | 113 | 144 | 117 | 114 | 114 | 117 | 120 | 125 | 111 | 113 || 118
Loops | 1357 | 530 | 551 | 569 | 535 | 538 | 525 | 517 | 518 | 514 || 615

Comparing algorithms

On an eight-core Linux x64 computer, we create a Windows x64 virtual machine for tests. All other things being equal, in the settings, we change the number of processors. We run the above test 100 times instead of 10 — we get a summary table of results. Time in milliseconds.

   CPU |   1 |   2 |   3 |   4 |   5 |   6 |   7 |   8 |
-------|-----|-----|-----|-----|-----|-----|-----|-----|
Stream | 522 | 276 | 179 | 154 | 128 | 112 | 110 |  93 |
Loops  | 519 | 603 | 583 | 571 | 545 | 571 | 559 | 467 |

Results: with a large number of processors in the system, the multithreaded algorithm works out noticeably faster than three nested loops. The computation time can be reduced by more than two times.

All the methods described above, including collapsed blocks, can be placed in one class.