=====Welcome to the Data Science Lab Cluster===== ===== DSL LAB IS OPEN ===== You will need an account to get started (contact the Lab Admin). This wiki will be updated as new capabilities are added. * [[how_do_i#using_python|Python Anaconda]] is installed * [[how_do_i#r_studio|RStudio]] is installed * [[how_do_i#using_the_zeppelin_web_notebook|Zeppelin Notebooks]] are now full configured for Python, PySpark, Spark, Hive, and shell programming. A notebook called **Basic Tests (Python, PySpark, sh, and Hive)** is available for learning more about Zeppelin (clone first). * [[how_do_i#transfer_files_to_from_the_cluster|Transferring Files from the Cloud]] has been added. The [[using_rclone|rclone]] package has been installed on all workstations (rclone is a command line tool) * [[how_do_i#use_tensorflow|Python Tensorflow]] (CPU and GPU) and Keras are installed **Watch this space for updates.** ====About The System==== This computation resource is a collection of nine individual workstations that can work together as a scalable data science cluster for Big Data processing. The system can run large Hadoop and Spark jobs using the 10 TByte Hadoop Distributed File System (HDFS) on up to 120 cores. There are also three GPU equipped nodes that are configured to run TensorFlow. Total system memory is 600 GBytes spread across 30 separate motherboards. Each workstation provides a Linux desktop environment that supports Anaconda Navigator (Python), Rstudio, and the Zeppelin web notebook (Spark, PySpark, Hadoop Hive,HBase, Python) ====FOR HELP CLICK ON THE "How Do I" LINK BELOW==== * [[System Description|System Description]] * [[System Access|System Access]] * [[How Do I|How Do I]] * [[Adminstration|Administration]] **HINT:** To get back to this main page from any page in the wiki, click on the **Data Science Lab** in the upper left corner. **System News:** Feb-18-2022 Python Tensorflow (CPU and GPU) and Keras installed Feb-14-2022 Zeppelin Notebooks are configured and rclone installed Feb-07-2022 Anaconda Navigator Jan-21-2022 Anaconda Python and RStudio installed Jan-20-2022 System is ready for users Nov-11-2012 Upgrade to CentOS 7 in progress ---- OLD SYSTEM ---- Aug-30-2019 Python options are now (default, V-2.6.6) or Ananaconda (V-3.7.1). Zeppelin now supports Python3, PySpark, Spark1, Spark2, and SparkR. See the "How Do I" page for information. Feb-20-2019 Python Anaconda is now available, see the "How Do I" page for information on how to access it. Nov-27-2018 Default Spark version is now 2.1.0, default Pyspark uses Python 3.6.3 Nov-07-2018: The Zeppelin Notebook is now available, see System Access above Nov-01-2018: Python 3.6 updated on all systems with modules: numpy matplotlib TextBlob scipy scikit-learn gensim pillow h5py xgboost happtbase mysqlclient happybase (See "How Do I" for usage information) Jul-26-2018 R Studio server is installed. Enter "http://localhost:8787" in a browser to access. May-02-2018 R Libraries: See /opt/share/doc/Installing-R-Libraries for how to install your own R libraries. Apr-20-2018: Python HBase lib HappyBase installed. Tensorflow now running on Limulus8-TF and Limulus9-TF Feb-21-2018: A current Wikipedia snapshot is in HDFS at /data/Wikipedia Feb-19-2018: HDFS is now available on all limulus machines as /mnt/hdfs Annotated examples from the purple Hadoop book are in /opt/share/doc/Hadoop2_Quick_Start_V1 Feb-14-2018: The following Python 2.7 modules are installed: nltk, keras, numpy, pandas, matplotlib TextBlob, scipy, Tensorflow, scikit-learn, gensim, pillow, h5py !!! Be sure to run "scl enable devtoolset-6 python27 bash" to use Python 2.7