Recently there are a few customers asking me how to enable multiple users to access R Server on HDInsight, so I think blogging all the ways might be a good idea.
To provide some background, you need to provide two users when creating an HDInsight cluster. One is the so called “http user” – i.e. the “Cluster login user name” below. Another one is the “ssh user” – i.e. the “SSH user name” below.
Basically speaking, the “http user” will be used to authenticate through the HDInsight gateway, which is used to protect the HDInsight clusters you created. This user is used to access the Ambari UI, YARN UI, as well as many other UI components.
The “ssh user” will be used to access the cluster through secure shell. This user is actually a user in the Linux system in all the head nodes, worker nodes, edge nodes, etc., so you can use secure shell to access the remote clusters.
For Microsoft R Server on HDInsight type cluster, it’s a bit more complex, because we put R Studio Server Community version in HDInsight, which only accepts Linux user name and password as login mechanisms (it does not support passing tokens), so if you have created a new cluster and want to use R Studio, you need to first login using the http user’s credential and login through the HDInsight Gateway, and then use the ssh user’s credential to login to RStudio.
One limitation for existing HDInsight cluster is that only one ssh user account can be created during cluster provisioning time. So in order to allow multiple users to access Microsoft R Server on HDInsight clusters, we need to create additional users in the Linux system.
Because RStudio Server Community is running on the cluster’s edge node, so we need to do two steps here:
- Using the created ssh user to login the edge node
- Add more Linux users in Edge node
- Use RStudio Community version with the user just created
Step 1: Using the created ssh user to login the edge node
You can follow this documentation: Connect to HDInsight (Hadoop) using SSH (https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-linux-use-ssh-unix) to access the edge node. But to start simple, you should download any ssh tool (such as Putty), and use the existing SSH user to login.
The edge node address for R Server on HDInsight cluster is:
clustername-ed-ssh.azurehdinsight.net
Step 2: Add more Linux users in Edge node
Execute the command below:
sudo useradd yournewusername -m
sudo passwd yourusername
You will see something like below. When prompting “Current Kerberos password:”, just press Enter to ignore it. The -m option in useradd indicates that the system will create a home folder for the user.
Step 3: Use RStudio Community version with the user just created
Use the user just created to login to RStudio
And you will see that now we are using the new user (sshuser6) to login the clusters.
You can submit a job using scaleR functions:
# Set the HDFS (WASB) location of example data
bigDataDirRoot <- "/example/data"
# create a local folder for storaging data temporarily
source <- "/tmp/AirOnTimeCSV2012"
dir.create(source)
# Download data to the tmp folder
remoteDir <- "http://packages.revolutionanalytics.com/datasets/AirOnTimeCSV2012"
download.file(file.path(remoteDir, "airOT201201.csv"), file.path(source, "airOT201201.csv"))
download.file(file.path(remoteDir, "airOT201202.csv"), file.path(source, "airOT201202.csv"))
download.file(file.path(remoteDir, "airOT201203.csv"), file.path(source, "airOT201203.csv"))
download.file(file.path(remoteDir, "airOT201204.csv"), file.path(source, "airOT201204.csv"))
download.file(file.path(remoteDir, "airOT201205.csv"), file.path(source, "airOT201205.csv"))
download.file(file.path(remoteDir, "airOT201206.csv"), file.path(source, "airOT201206.csv"))
download.file(file.path(remoteDir, "airOT201207.csv"), file.path(source, "airOT201207.csv"))
download.file(file.path(remoteDir, "airOT201208.csv"), file.path(source, "airOT201208.csv"))
download.file(file.path(remoteDir, "airOT201209.csv"), file.path(source, "airOT201209.csv"))
download.file(file.path(remoteDir, "airOT201210.csv"), file.path(source, "airOT201210.csv"))
download.file(file.path(remoteDir, "airOT201211.csv"), file.path(source, "airOT201211.csv"))
download.file(file.path(remoteDir, "airOT201212.csv"), file.path(source, "airOT201212.csv"))
# Set directory in bigDataDirRoot to load the data into
inputDir <- file.path(bigDataDirRoot,"AirOnTimeCSV2012")
# Make the directory
rxHadoopMakeDir(inputDir)
# Copy the data from source to input
rxHadoopCopyFromLocal(source, bigDataDirRoot)
# Define the HDFS (WASB) file system
hdfsFS <- RxHdfsFileSystem()
# Create info list for the airline data
airlineColInfo <- list(
DAY_OF_WEEK = list(type = "factor"),
ORIGIN = list(type = "factor"),
DEST = list(type = "factor"),
DEP_TIME = list(type = "integer"),
ARR_DEL15 = list(type = "logical"))
# get all the column names
varNames <- names(airlineColInfo)
# Define the text data source in hdfs
airOnTimeData <- RxTextData(inputDir, colInfo = airlineColInfo, varsToKeep = varNames, fileSystem = hdfsFS)
# Define the text data source in local system
airOnTimeDataLocal <- RxTextData(source, colInfo = airlineColInfo, varsToKeep = varNames)
# formula to use
formula = "ARR_DEL15 ~ ORIGIN + DAY_OF_WEEK + DEP_TIME + DEST"
# Define the Spark compute context
mySparkCluster <- RxSpark()
# Set the compute context
rxSetComputeContext(mySparkCluster)
# Run a logistic regression
system.time(
modelSpark <- rxLogit(formula, data = airOnTimeData)
)
# Display a summary
summary(modelSpark)
and you will see that all the jobs submitted are under different user names in YARN UI.
Please be noted that all the newly added user does not have root privilege in Linux system, but it can the same access all the files in the remote storage (HDFS storage or WASB storage).