Beginner's Guide to DML and PyDML


Overview

SystemML enables flexible, scalable machine learning. This flexibility is achieved through the specification of a high-level declarative machine learning language that comes in two flavors, one with an R-like syntax (DML) and one with a Python-like syntax (PyDML).

Algorithm scripts written in DML and PyDML can be run on Spark, on Hadoop, or in Standalone mode. SystemML also features an MLContext API that allows SystemML to be accessed via Scala or Python from a Spark Shell, a Jupyter Notebook, or a Zeppelin Notebook.

This Beginner’s Guide serves as a starting point for writing DML and PyDML scripts.

Script Invocation

DML and PyDML scripts can be invoked in a variety of ways. Suppose that we have hello.dml and hello.pydml scripts containing the following:

print('hello ' + $1)

One way to begin working with SystemML is to download a standalone distribution of SystemML and use the runStandaloneSystemML.sh and runStandaloneSystemML.bat scripts to run SystemML in standalone mode. The name of the DML or PyDML script is passed as the first argument to these scripts, along with a variety of arguments.

./runStandaloneSystemML.sh hello.dml -args world
./runStandaloneSystemML.sh hello.pydml -python -args world

Data Types

SystemML has four value data types. In DML, these are: double, integer, string, and boolean. In PyDML, these are: float, int, str, and bool. In normal usage, the data type of a variable is implicit based on its value. Mathematical operations typically operate on doubles/floats, whereas integers/ints are typically useful for tasks such as iteration and accessing elements in a matrix.

aDouble = 3.0
bInteger = 2
print('aDouble = ' + aDouble)
print('bInteger = ' + bInteger)
print('aDouble + bInteger = ' + (aDouble + bInteger))
print('bInteger ^ 3 = ' + (bInteger ^ 3))
print('aDouble ^ 2 = ' + (aDouble ^ 2))

cBoolean = TRUE
print('cBoolean = ' + cBoolean)
print('(2 < 1) = ' + (2 < 1))

dString = 'Hello'
eString = dString + ' World'
print('dString = ' + dString)
print('eString = ' + eString)
aFloat = 3.0
bInt = 2
print('aFloat = ' + aFloat)
print('bInt = ' + bInt)
print('aFloat + bInt = ' + (aFloat + bInt))
print('bInt ** 3 = ' + (bInt ** 3))
print('aFloat ** 2 = ' + (aFloat ** 2))

cBool = True
print('cBool = ' + cBool)
print('(2 < 1) = ' + (2 < 1))

dStr = 'Hello'
eStr = dStr + ' World'
print('dStr = ' + dStr)
print('eStr = ' + eStr)
aDouble = 3.0
bInteger = 2
aDouble + bInteger = 5.0
bInteger ^ 3 = 8.0
aDouble ^ 2 = 9.0
cBoolean = TRUE
(2 < 1) = FALSE
dString = Hello
eString = Hello World
aFloat = 3.0
bInt = 2
aFloat + bInt = 5.0
bInt ** 3 = 8.0
aFloat ** 2 = 9.0
cBool = TRUE
(2 < 1) = FALSE
dStr = Hello
eStr = Hello World

Matrix Basics

Creating a Matrix

A matrix can be created in DML using the matrix() function and in PyDML using the full() function. In the example below, a matrix element is still considered to be of the matrix data type, so the value is cast to a scalar in order to print it. Matrix element values are of type double/float.

m = matrix("1 2 3 4 5 6 7 8 9 10 11 12", rows=4, cols=3)
for (i in 1:nrow(m)) {
    for (j in 1:ncol(m)) {
        n = m[i,j]
        print('[' + i + ',' + j + ']:' + as.scalar(n))
    }
}
m = full("1 2 3 4 5 6 7 8 9 10 11 12", rows=4, cols=3)
for (i in 0:nrow(m)-1):
    for (j in 0:ncol(m)-1):
        n = m[i,j]
        print('[' + i + ',' + j + ']:' + scalar(n))
[1,1]:1.0
[1,2]:2.0
[1,3]:3.0
[2,1]:4.0
[2,2]:5.0
[2,3]:6.0
[3,1]:7.0
[3,2]:8.0
[3,3]:9.0
[4,1]:10.0
[4,2]:11.0
[4,3]:12.0
[0,0]:1.0
[0,1]:2.0
[0,2]:3.0
[1,0]:4.0
[1,1]:5.0
[1,2]:6.0
[2,0]:7.0
[2,1]:8.0
[2,2]:9.0
[3,0]:10.0
[3,1]:11.0
[3,2]:12.0

We can also output the matrix element values using the toString function:

m = matrix("1 2 3 4 5 6 7 8 9 10 11 12", rows=4, cols=3)
print(toString(m, sep=" | ", decimal=1))
m = full("1 2 3 4 5 6 7 8 9 10 11 12", rows=4, cols=3)
print(toString(m, sep=" | ", decimal=1))
1.0 | 2.0 | 3.0
4.0 | 5.0 | 6.0
7.0 | 8.0 | 9.0
10.0 | 11.0 | 12.0

For additional information about the matrix() and full() functions, please see the Matrix Construction section of the Language Reference. For information about the toString() function, see the Other Built-In Functions section of the Language Reference.

Saving a Matrix

A matrix can be saved using the write() function in DML and the save() function in PyDML. SystemML supports four different formats: text (i,j,v), mm (Matrix Market), csv (delimiter-separated values), and binary.

m = matrix("1 2 3 0 0 0 7 8 9 0 0 0", rows=4, cols=3)
write(m, "m.txt", format="text")
write(m, "m.mm", format="mm")
write(m, "m.csv", format="csv")
write(m, "m.binary", format="binary")
m = full("1 2 3 0 0 0 7 8 9 0 0 0", rows=4, cols=3)
save(m, "m.txt", format="text")
save(m, "m.mm", format="mm")
save(m, "m.csv", format="csv")
save(m, "m.binary", format="binary")

Saving a matrix automatically creates a metadata file for each format except for Matrix Market, since Matrix Market contains metadata within the *.mm file. All formats are text-based except binary. The contents of the resulting files are shown here. Note that the text (i,j,v) and mm (Matrix Market) formats index from 1, even when working with PyDML, which is 0-based.

1 1 1.0
1 2 2.0
1 3 3.0
3 1 7.0
3 2 8.0
3 3 9.0
{ 
    "data_type": "matrix"
    ,"value_type": "double"
    ,"rows": 4
    ,"cols": 3
    ,"nnz": 6
    ,"format": "text"
    ,"description": { "author": "SystemML" } 
}
%%MatrixMarket matrix coordinate real general
4 3 6
1 1 1.0
1 2 2.0
1 3 3.0
3 1 7.0
3 2 8.0
3 3 9.0
1.0,2.0,3.0
0,0,0
7.0,8.0,9.0
0,0,0
{ 
    "data_type": "matrix"
    ,"value_type": "double"
    ,"rows": 4
    ,"cols": 3
    ,"nnz": 6
    ,"format": "csv"
    ,"header": false
    ,"sep": ","
    ,"description": { "author": "SystemML" } 
}
Not text-based
{ 
    "data_type": "matrix"
    ,"value_type": "double"
    ,"rows": 4
    ,"cols": 3
    ,"rows_in_block": 1000
    ,"cols_in_block": 1000
    ,"nnz": 6
    ,"format": "binary"
    ,"description": { "author": "SystemML" } 
}

Loading a Matrix

A matrix can be loaded using the read() function in DML and the load() function in PyDML. As with saving, SystemML supports four formats: text (i,j,v), mm (Matrix Market), csv (delimiter-separated values), and binary. To read a file, a corresponding metadata file is required, except for the Matrix Market format. A metadata file is not required if a format parameter is specified to the read() or load() functions.

m = read("m.csv")
print("min:" + min(m))
print("max:" + max(m))
print("sum:" + sum(m))
mRowSums = rowSums(m)
for (i in 1:nrow(mRowSums)) {
    print("row " + i + " sum:" + as.scalar(mRowSums[i,1]))
}
mColSums = colSums(m)
for (i in 1:ncol(mColSums)) {
    print("col " + i + " sum:" + as.scalar(mColSums[1,i]))
}
m = load("m.csv")
print("min:" + min(m))
print("max:" + max(m))
print("sum:" + sum(m))
mRowSums = rowSums(m)
for (i in 0:nrow(mRowSums)-1):
    print("row " + i + " sum:" + scalar(mRowSums[i,0]))
mColSums = colSums(m)
for (i in 0:ncol(mColSums)-1):
    print("col " + i + " sum:" + scalar(mColSums[0,i]))
    
min:0.0
max:9.0
sum:30.0
row 1 sum:6.0
row 2 sum:0.0
row 3 sum:24.0
row 4 sum:0.0
col 1 sum:8.0
col 2 sum:10.0
col 3 sum:12.0
min:0.0
max:9.0
sum:30.0
row 0 sum:6.0
row 1 sum:0.0
row 2 sum:24.0
row 3 sum:0.0
col 0 sum:8.0
col 1 sum:10.0
col 2 sum:12.0

Matrix Operations

DML and PyDML offer a rich set of operators and built-in functions to perform various operations on matrices and scalars. Operators and built-in functions are described in great detail in the Language Reference (Expressions, Built-In Functions).

In this example, we create a matrix A. Next, we create another matrix B by adding 4 to each element in A. Next, we flip B by taking its transpose. We then multiply A and B, represented by matrix C. We create a matrix D with the same number of rows and columns as C, and initialize its elements to 5. We then subtract D from C and divide the values of its elements by 2 and assign the resulting matrix to D.

A = matrix("1 2 3 4 5 6", rows=3, cols=2)
print(toString(A))
B = A + 4
B = t(B)
print(toString(B))
C = A %*% B
print(toString(C))
D = matrix(5, rows=nrow(C), cols=ncol(C))
D = (C - D) / 2
print(toString(D))
A = full("1 2 3 4 5 6", rows=3, cols=2)
print(toString(A))
B = A + 4
B = transpose(B)
print(toString(B))
C = dot(A, B)
print(toString(C))
D = full(5, rows=nrow(C), cols=ncol(C))
D = (C - D) / 2
print(toString(D))
1.000 2.000
3.000 4.000
5.000 6.000

5.000 7.000 9.000
6.000 8.000 10.000

17.000 23.000 29.000
39.000 53.000 67.000
61.000 83.000 105.000

6.000 9.000 12.000
17.000 24.000 31.000
28.000 39.000 50.000

Matrix Indexing

The elements in a matrix can be accessed by their row and column indices. In the example below, we have 3x3 matrix A. First, we access the element at the third row and third column. Next, we obtain a row slice (vector) of the matrix by specifying the row and leaving the column blank. We obtain a column slice (vector) by leaving the row blank and specifying the column. After that, we obtain a submatrix via range indexing, where we specify rows, separated by a colon, and columns, separated by a colon.

A = matrix("1 2 3 4 5 6 7 8 9", rows=3, cols=3)
print(toString(A))
B = A[3,3]
print(toString(B))
C = A[2,]
print(toString(C))
D = A[,3]
print(toString(D))
E = A[2:3,1:2]
print(toString(E))
A = full("1 2 3 4 5 6 7 8 9", rows=3, cols=3)
print(toString(A))
B = A[2,2]
print(toString(B))
C = A[1,]
print(toString(C))
D = A[,2]
print(toString(D))
E = A[1:3,0:2]
print(toString(E))
1.000 2.000 3.000
4.000 5.000 6.000
7.000 8.000 9.000

9.000

4.000 5.000 6.000

3.000
6.000
9.000

4.000 5.000
7.000 8.000

Control Statements

DML and PyDML both feature if, if-else, and if-else-if conditional statements.

DML and PyDML feature 3 loop statements: while, for, and parfor (parallel for). In the example, note that the print statements within the parfor loop can occur in any order since the iterations occur in parallel rather than sequentially as in a regular for loop. The parfor statement can include several optional parameters, as described in the Language Reference (ParFor Statement).

i = 1
while (i <= 3) {
    if (i == 1) {
        print('hello')
    } else if (i == 2) {
        print('world')
    } else {
        print('!!!')
    }
    i = i + 1
}

A = matrix("1 2 3 4 5 6", rows=3, cols=2)

for (i in 1:nrow(A)) {
    print("for A[" + i + ",1]:" + as.scalar(A[i,1]))
}

parfor(i in 1:nrow(A)) {
    print("parfor A[" + i + ",1]:" + as.scalar(A[i,1]))
}
i = 1
while (i <= 3):
    if (i == 1):
        print('hello')
    elif (i == 2):
        print('world')
    else:
        print('!!!')
    i = i + 1

A = full("1 2 3 4 5 6", rows=3, cols=2)

for (i in 0:nrow(A)-1):
    print("for A[" + i + ",0]:" + scalar(A[i,0]))

parfor(i in 0:nrow(A)-1):
    print("parfor A[" + i + ",0]:" + scalar(A[i,0]))
hello
world
!!!
for A[1,1]:1.0
for A[2,1]:3.0
for A[3,1]:5.0
parfor A[2,1]:3.0
parfor A[1,1]:1.0
parfor A[3,1]:5.0
hello
world
!!!
for A[0,0]:1.0
for A[1,0]:3.0
for A[2,0]:5.0
parfor A[0,0]:1.0
parfor A[2,0]:5.0
parfor A[1,0]:3.0

User-Defined Functions

Functions encapsulate useful functionality in SystemML. In addition to built-in functions, users can define their own functions. Functions take 0 or more parameters and return 0 or more values. Currently, if a function returns nothing, it still needs to be assigned to a variable.

doSomething = function(matrix[double] mat) return (matrix[double] ret) {
    additionalCol = matrix(1, rows=nrow(mat), cols=1) # 1x3 matrix with 1 values
    ret = cbind(mat, additionalCol) # concatenate column to matrix
    ret = cbind(ret, seq(0, 2, 1))  # concatenate column (0,1,2) to matrix
    ret = cbind(ret, rowMaxs(ret))  # concatenate column of max row values to matrix
    ret = cbind(ret, rowSums(ret))  # concatenate column of row sums to matrix
}

A = rand(rows=3, cols=2, min=0, max=2) # random 3x2 matrix with values 0 to 2
B = doSomething(A)
write(A, "A.csv", format="csv")
write(B, "B.csv", format="csv")
def doSomething(mat: matrix[float]) -> (ret: matrix[float]):
    additionalCol = full(1, rows=nrow(mat), cols=1) # 1x3 matrix with 1 values
    ret = cbind(mat, additionalCol) # concatenate column to matrix
    ret = cbind(ret, seq(0, 2, 1))  # concatenate column (0,1,2) to matrix
    ret = cbind(ret, rowMaxs(ret))  # concatenate column of max row values to matrix
    ret = cbind(ret, rowSums(ret))  # concatenate column of row sums to matrix

A = rand(rows=3, cols=2, min=0, max=2) # random 3x2 matrix with values 0 to 2
B = doSomething(A)
save(A, "A.csv", format="csv")
save(B, "B.csv", format="csv")

In the above example, a 3x2 matrix of random doubles between 0 and 2 is created using the rand() function. Additional parameters can be passed to rand() to control sparsity and other matrix characteristics.

Matrix A is passed to the doSomething function. A column of 1 values is concatenated to the matrix. A column consisting of the values (0, 1, 2) is concatenated to the matrix. Next, a column consisting of the maximum row values is concatenated to the matrix. A column consisting of the row sums is concatenated to the matrix, and this resulting matrix is returned to variable B. Matrix A is output to the A.csv file and matrix B is saved as the B.csv file.

1.6091961493071,0.7088614208099939
0.5984862383600267,1.5732118950764993
0.2947607068519842,1.9081406573366781
1.6091961493071,0.7088614208099939,1.0,0,1.6091961493071,4.927253719424194
0.5984862383600267,1.5732118950764993,1.0,1.0,1.5732118950764993,5.744910028513026
0.2947607068519842,1.9081406573366781,1.0,2.0,2.0,7.202901364188662

Command-Line Arguments and Default Values

Command-line arguments can be passed to DML and PyDML scripts either as named arguments or as positional arguments. Named arguments are the preferred technique. Named arguments can be passed utilizing the -nvargs switch, and positional arguments can be passed using the -args switch.

Default values can be set using the ifdef() function.

In the example below, a matrix is read from the file system using named argument M. The number of rows to print is specified using the rowsToPrint argument, which defaults to 2 if no argument is supplied. Likewise, the number of columns is specified using colsToPrint with a default value of 2.

fileM = $M

numRowsToPrint = ifdef($rowsToPrint, 2) # default to 2
numColsToPrint = ifdef($colsToPrint, 2) # default to 2

m = read(fileM)

for (i in 1:numRowsToPrint) {
    for (j in 1:numColsToPrint) {
        print('[' + i + ',' + j + ']:' + as.scalar(m[i,j]))
    }
}
fileM = $M

numRowsToPrint = ifdef($rowsToPrint, 2) # default to 2
numColsToPrint = ifdef($colsToPrint, 2) # default to 2

m = load(fileM)

for (i in 0:numRowsToPrint-1):
    for (j in 0:numColsToPrint-1):
        print('[' + i + ',' + j + ']:' + scalar(m[i,j]))
Example #1 Arguments:
-f ex.dml -nvargs M=m.csv rowsToPrint=1 colsToPrint=3

Example #1 Results:
[1,1]:1.0
[1,2]:2.0
[1,3]:3.0

Example #2 Arguments:
-f ex.dml -nvargs M=m.csv

Example #2 Results:
[1,1]:1.0
[1,2]:2.0
[2,1]:0.0
[2,2]:0.0
Example #1 Arguments:
-f ex.pydml -python -nvargs M=m.csv rowsToPrint=1 colsToPrint=3

Example #1 Results:
[0,0]:1.0
[0,1]:2.0
[0,2]:3.0

Example #2 Arguments:
-f ex.pydml -python -nvargs M=m.csv

Example #2 Results:
[0,0]:1.0
[0,1]:2.0
[1,0]:0.0
[1,1]:0.0

Here, we see identical functionality but with positional arguments.

fileM = $1

numRowsToPrint = ifdef($2, 2) # default to 2
numColsToPrint = ifdef($3, 2) # default to 2

m = read(fileM)

for (i in 1:numRowsToPrint) {
    for (j in 1:numColsToPrint) {
        print('[' + i + ',' + j + ']:' + as.scalar(m[i,j]))
    }
}
fileM = $1

numRowsToPrint = ifdef($2, 2) # default to 2
numColsToPrint = ifdef($3, 2) # default to 2

m = load(fileM)

for (i in 0:numRowsToPrint-1):
    for (j in 0:numColsToPrint-1):
        print('[' + i + ',' + j + ']:' + scalar(m[i,j]))
Example #1 Arguments:
-f ex.dml -args m.csv 1 3

Example #1 Results:
[1,1]:1.0
[1,2]:2.0
[1,3]:3.0

Example #2 Arguments:
-f ex.dml -args m.csv

Example #2 Results:
[1,1]:1.0
[1,2]:2.0
[2,1]:0.0
[2,2]:0.0
Example #1 Arguments:
-f ex.pydml -python -args m.csv 1 3

Example #1 Results:
[0,0]:1.0
[0,1]:2.0
[0,2]:3.0

Example #2 Arguments:
-f ex.pydml -python -args m.csv

Example #2 Results:
[0,0]:1.0
[0,1]:2.0
[1,0]:0.0
[1,1]:0.0

Additional Information

The Language Reference contains highly detailed information regarding DML.

In addition, many excellent examples of DML and PyDML can be found in the scripts directory.