Open-Network-Insight · NathanSegerlind · Sep 27, 2016 · Sep 28, 2016 · Sep 29, 2016 · Sep 29, 2016
diff --git a/INSTALL.md b/INSTALL.md
@@ -1,17 +1,17 @@
-# oni-ml
+# spot-ml
 
-Machine learning routines for OpenNetworkInsight, version 1.1
+Machine learning routines for Apache Spot (incubating).
 
 ## Prerequisites, Installation and Configuration
 
-Install and configure oni-ml as a part of the Open-Network-Insight project, per the instruction at
-[the Open-Network-Insight wiki](https://github.com/Open-Network-Insight/open-network-insight/wiki).
+Install and configure spot-ml as a part of the Spot project, per the instruction at
+[the Spot wiki].
 
-The oni-ml routines must be built into a jar stored at `target/scala-2.10/oni-ml-assembly-1.1.jar` on the master node. This requires Scala 2.10 or later to be installed on the system building the jar. To build the jar, from the top-level of the oni-ml repo, execute the command `sbt assembly`.
+The spot-ml routines must be built into a jar stored at `target/scala-2.10/spot-ml-assembly-1.1.jar` on the master node. This requires Scala 2.10 or later to be installed on the system building the jar. To build the jar, from the top-level of the spot-ml repo, execute the command `sbt assembly`.
 
-Names and language that we will use from the configuration variables for Open-Network-Insight (that are set in the file [duxbay.conf](https://github.com/Open-Network-Insight/oni-setup/blob/dev/duxbay.conf))
+Names and language that we will use from the configuration variables for Spot (that are set in the file [spot.conf])
 
-- MLNODE The node from which the oni-ml routines are invoked
+- MLNODE The node from which the spot-ml routines are invoked
 - NODES The list of MPI worker nodes that execute the topic modelling analysis
 - HUSER An HDFS user path that will be the base path for the solution; this is usually the same user that you created to run the solution
 - LPATH The local path for the ML intermediate and final results, dynamically created and populated when the pipeline runs
@@ -22,7 +22,7 @@ Names and language that we will use from the configuration variables for Open-Ne
 
 ### Prepare data for input 
 
-Load data for consumption by oni-ml by running [oni-ingest](https://github.com/Open-Network-Insight/oni-ingest/tree/dev).
+Load data for consumption by spot-ml by running [spot-ingest].
 
 
 ## Run a suspicious connects analysis
@@ -52,15 +52,15 @@ As the maximum probability of an event is 1, a threshold of 1 can be used to sel
 
 ## Licensing
 
-oni-ml is licensed under Apache Version 2.0
+spot-ml is licensed under Apache Version 2.0
 
 ## Contributing
 
 Create a pull request and contact the maintainers.
 
 ## Issues
 
-Report issues at the [OpenNetworkInsight issues page](https://github.com/Open-Network-Insight/open-network-insight/issues).
+Report issues at the [Spot issues page].
 
 ## Maintainers
 

diff --git a/README.md b/README.md
@@ -1,27 +1,27 @@
-# oni-ml
+# spot-ml
 
-Machine learning routines for OpenNetworkInsight, version 1.1
+Machine learning routines for Apache Spot (incubating).
 
-At present, oni-ml contains routines for performing *suspicious connections* analyses on netflow, DNS or proxy data gathered from a network. These
+At present, spot-ml contains routines for performing *suspicious connections* analyses on netflow, DNS or proxy data gathered from a network. These
 analyses consume a (possibly very lage) collection of network events and produces a list of the events that considered to be the least probable (or most suspicious).
 
-oni-ml is designed to be run as a component of Open-Network-Insight. It relies on the ingest component of Open-Network-Insight to collect and load
-netflow and DNS records, and oni-ml will try to load data to the operational analytics component of Open-Network-Insight.  It is strongly suggested that when experimenting with oni-ml, you do so as a part of the unified Open-Network-Insight system: Please see [the Open-Network-Insight wiki](https://github.com/Open-Network-Insight/open-network-insight/wiki)
+spot-ml is designed to be run as a component of Spot. It relies on the ingest component of Spot to collect and load
+netflow and DNS records, and spot-ml will try to load data to the operational analytics component of Spot.  It is suggested that when experimenting with spot-ml, you do so as a part of the unified Spot system: Please see [the Spot wiki]
 
-The remaining instructions in this README file treat oni-ml in a stand-alone fashion that might be helpful for customizing and troubleshooting the
+The remaining instructions in this README file treat spot-ml in a stand-alone fashion that might be helpful for customizing and troubleshooting the
 component.
 
 ## Prepare data for input 
 
-Load data for consumption by oni-ml by running [oni-ingest](https://github.com/Open-Network-Insight/oni-ingest/tree/dev).
+Load data for consumption by spot-ml by running [spot-ingest].
 
 The data format and location where the data is stored differs for netflow and DNS analyses.
 
 **Netflow Data**
 
 Netflow data for the year YEAR, month  MONTH, and day DAY is stored in HDFS at `HUSER/flow/csv/y=YEAR/m=MONTH/d=DAY/*`
 
-Data for oni-ml netflow analyses is currently stored in text csv files using the following schema:
+Data for spot-ml netflow analyses is currently stored in text csv files using the following schema:
 
 - time: String
 - year: Double
@@ -55,7 +55,7 @@ Data for oni-ml netflow analyses is currently stored in text csv files using the
 
 DNS data for the year YEAR, month MONTH and day DAY is stored in Hive at `HUSER/dns/hive/y=YEAR/m=MONTH/d=DAY/`
 
-The Hive tables containing DNS data for oni-ml analyses have the following schema:
+The Hive tables containing DNS data for spot-ml analyses have the following schema:
 
 - frame_time: STRING
 - unix_tstamp: BIGINT
@@ -124,12 +124,12 @@ As the maximum probability of an event is 1, a threshold of 1 can be used to sel
 ```
 
 
-### oni-ml output
+### spot-ml output
 
 Final results are stored in the following file on HDFS.
 
 Depending on which data source is analyzed, 
-oni-ml output will be found under the ``HPATH`` at one of
+spot-ml output will be found under the ``HPATH`` at one of
 
      $HPATH/dns/scored_results/YYYYMMDD/scores/dns_results.csv
      $HPATH/proxy/scored_results/YYYYMMDD/scores/results.csv
@@ -138,7 +138,7 @@ oni-ml output will be found under the ``HPATH`` at one of
 
 It is a csv file in which network events annotated with estimated probabilities and sorted in ascending order.
 
-A successful run of oni-ml will also create and populate a directory at `LPATH/<source>/YYYYMMDD` where `<source>` is one of flow, dns or proxy, and
+A successful run of spot-ml will also create and populate a directory at `LPATH/<source>/YYYYMMDD` where `<source>` is one of flow, dns or proxy, and
 `YYYYMMDD` is the date argument provided to `ml_ops.sh` 
 This directory will contain the following files generated during the LDA procedure used for topic-modelling:
 
@@ -152,15 +152,15 @@ In addition, on each worker node identified in NODES, in the `LPATH/<source>/YYY
 
 ## Licensing
 
-oni-ml is licensed under Apache Version 2.0
+spot-ml is licensed under Apache Version 2.0
 
 ## Contributing
 
 Create a pull request and contact the maintainers.
 
 ## Issues
 
-Report issues at the [OpenNetworkInsight issues page](https://github.com/Open-Network-Insight/open-network-insight/issues).
+Report issues at the [Spot issues page].
 
 ## Maintainers
 

diff --git a/build.sbt b/build.sbt
@@ -1,4 +1,4 @@
-name := "oni-ml"
+name := "spot-ml"
 
 version := "1.1"
 

diff --git a/install_ml.sh b/install_ml.sh
@@ -1,13 +1,12 @@
  #!/bin/bash
 
- #   lda  stage
-source /etc/duxbay.conf
+source /etc/spot.conf
 
 #  copy solution files to all nodes
 for d in "${NODES[@]}" 
 do
-    rsync -v -a --include='target' --include='target/scala-2.10' --include='target/scala-2.10/oni-ml-assembly-1.1.jar' \
-       --include='oni-lda-c' --include='oni-lda-c/*'  --include 'top-1m.csv' --include='*.sh' \
+    rsync -v -a --include='target' --include='target/scala-2.10' --include='target/scala-2.10/spot-ml-assembly-1.1.jar' \
+       --include='spot-lda-c' --include='spot-lda-c/*'  --include 'top-1m.csv' --include='*.sh' \
       --exclude='*' .  $d:${LUSER}/ml
 done
 
diff --git a/ml_ops.sh b/ml_ops.sh
@@ -23,7 +23,7 @@ fi
 # read in variables (except for date) from etc/.conf file
 # note: FDATE and DSOURCE *must* be defined prior sourcing this conf file
 
-source /etc/duxbay.conf
+source /etc/spot.conf
 
 # third argument if present will override default TOL from conf file
 
@@ -78,13 +78,13 @@ rm -f ${LPATH}/*.{dat,beta,gamma,other,pkl} # protect the flow_scores.csv file
 hdfs dfs -rm -R -f ${HDFS_SCORED_CONNECTS}
 
 # Add -p <command> to execute pre MPI command.
-# Pre MPI command can be configured in /etc/duxbay.conf
+# Pre MPI command can be configured in /etc/spot.conf
 # In this script, after the line after --mpicmd ${MPI_CMD} add:
 # --mpiprep ${MPI_PREP_CMD}
 
 ${MPI_PREP_CMD}
 
-time spark-submit --class "org.opennetworkinsight.SuspiciousConnects" \
+time spark-submit --class "org.apache.spot.SuspiciousConnects" \
   --master yarn-client \
   --driver-memory ${SPK_DRIVER_MEM} \
   --conf spark.driver.maxResultSize=${SPK_DRIVER_MAX_RESULTS} \
@@ -101,7 +101,7 @@ time spark-submit --class "org.opennetworkinsight.SuspiciousConnects" \
   --conf spark.shuffle.service.enabled=true \
   --conf spark.yarn.am.waitTime=1000000 \
   --conf spark.yarn.driver.memoryOverhead=${SPK_DRIVER_MEM_OVERHEAD} \
-  --conf spark.yarn.executor.memoryOverhead=${SPAK_EXEC_MEM_OVERHEAD} target/scala-2.10/oni-ml-assembly-1.1.jar \
+  --conf spark.yarn.executor.memoryOverhead=${SPAK_EXEC_MEM_OVERHEAD} target/scala-2.10/spot-ml-assembly-1.1.jar \
   --analysis ${DSOURCE} \
   --input ${RAWDATA_PATH}  \
   --dupfactor ${DUPFACTOR} \

diff --git a/...g/opennetworkinsight/OniLDACWrapper.scala → ...ala/org/apache/spot/SpotLDACWrapper.scala b/...g/opennetworkinsight/OniLDACWrapper.scala → ...ala/org/apache/spot/SpotLDACWrapper.scala
@@ -1,4 +1,4 @@
-package org.opennetworkinsight
+package org.apache.spot
 
 import org.apache.spark.rdd.RDD
 import java.io.PrintWriter
@@ -15,14 +15,14 @@ import scala.sys.process._
   * 4. Calculates and returns probability of word given topic: p(w|z)
   */
 
-object OniLDACWrapper {
+object SpotLDACWrapper {
 
-  case class OniLDACInput(doc: String, word: String, count: Int) extends Serializable
+  case class SpotLDACInput(doc: String, word: String, count: Int) extends Serializable
 
-  case class OniLDACOutput(docToTopicMix: Map[String, Array[Double]], wordResults: Map[String, Array[Double]])
+  case class SpotLDACOutput(docToTopicMix: Map[String, Array[Double]], wordResults: Map[String, Array[Double]])
 
 
-  def runLDA(docWordCount: RDD[OniLDACInput],
+  def runLDA(docWordCount: RDD[SpotLDACInput],
              modelFile: String,
              topicDocumentFile: String,
              topicWordFile: String,
@@ -35,19 +35,19 @@ object OniLDACWrapper {
              localUser: String,
              dataSource: String,
              nodes: String,
-             prgSeed: Option[Long]):   OniLDACOutput =  {
+             prgSeed: Option[Long]):   SpotLDACOutput =  {
 
     // Create word Map Word,Index for further usage
     val wordDictionary: Map[String, Int] = {
       val words = docWordCount
         .cache
-        .map({case OniLDACInput(doc, word, count) => word})
+        .map({case SpotLDACInput(doc, word, count) => word})
         .distinct
         .collect
       words.zipWithIndex.toMap
     }
 
-    val distinctDocument = docWordCount.map({case OniLDACInput(doc, word, count) => doc}).distinct.collect
+    val distinctDocument = docWordCount.map({case SpotLDACInput(doc, word, count) => doc}).distinct.collect
     //distinctDocument.cache()
 
     // Create document Map Index, Document for further usage
@@ -113,7 +113,7 @@ object OniLDACWrapper {
     // Create word results
     val wordResults = getWordToProbPerTopicMap(topicWordData, wordDictionary)
 
-    OniLDACOutput(docToTopicMix, wordResults)
+    SpotLDACOutput(docToTopicMix, wordResults)
   }
 
   /**
@@ -147,18 +147,18 @@ object OniLDACWrapper {
     }
   }
 
-  def createModel(documentWordData: RDD[OniLDACInput], wordToIndex: Map[String, Int], distinctDocument: Array[String])
+  def createModel(documentWordData: RDD[SpotLDACInput], wordToIndex: Map[String, Int], distinctDocument: Array[String])
   : Array[String]
   = {
     val documentCount = documentWordData
-      .map({case OniLDACInput(doc, word, count) => doc})
+      .map({case SpotLDACInput(doc, word, count) => doc})
       .map(document => (document, 1))
       .reduceByKey(_ + _)
       .collect
       .toMap
 
     val wordIndexdocWordCount = documentWordData
-      .map({case OniLDACInput(doc, word, count) => (doc, wordToIndex(word) + ":" + count)})
+      .map({case SpotLDACInput(doc, word, count) => (doc, wordToIndex(word) + ":" + count)})
       .groupByKey()
       .map(x => (x._1, x._2.mkString(" ")))
       .collect

diff --git a/...ennetworkinsight/SuspiciousConnects.scala → .../org/apache/spot/SuspiciousConnects.scala b/...ennetworkinsight/SuspiciousConnects.scala → .../org/apache/spot/SuspiciousConnects.scala
@@ -1,12 +1,12 @@
-package org.opennetworkinsight
+package org.apache.spot
 
 import org.apache.log4j.{Level, LogManager, Logger}
 import org.apache.spark.sql.SQLContext
 import org.apache.spark.{SparkConf, SparkContext}
-import org.opennetworkinsight.SuspiciousConnectsArgumentParser.SuspiciousConnectsConfig
-import org.opennetworkinsight.dns.DNSSuspiciousConnectsAnalysis
-import org.opennetworkinsight.netflow.FlowSuspiciousConnects
-import org.opennetworkinsight.proxy.ProxySuspiciousConnectsAnalysis
+import org.apache.spot.SuspiciousConnectsArgumentParser.SuspiciousConnectsConfig
+import org.apache.spot.dns.DNSSuspiciousConnectsAnalysis
+import org.apache.spot.netflow.FlowSuspiciousConnects
+import org.apache.spot.proxy.ProxySuspiciousConnectsAnalysis
 
 /**
   * Top level entrypoint to execute suspicious connections analysis on network data.
@@ -39,7 +39,7 @@ object SuspiciousConnects {
         Logger.getLogger("akka").setLevel(Level.OFF)
 
         val analysis = config.analysis
-        val sparkConfig = new SparkConf().setAppName("ONI ML:  " + analysis + " lda")
+        val sparkConfig = new SparkConf().setAppName("Spot ML:  " + analysis + " suspicious connects analysis")
         val sparkContext = new SparkContext(sparkConfig)
         val sqlContext = new SQLContext(sparkContext)
         implicit val outputDelimiter = OutputDelimiter

diff --git a/...ht/SuspiciousConnectsArgumentParser.scala → ...ot/SuspiciousConnectsArgumentParser.scala b/...ht/SuspiciousConnectsArgumentParser.scala → ...ot/SuspiciousConnectsArgumentParser.scala
@@ -1,4 +1,4 @@
-package org.opennetworkinsight
+package org.apache.spot
 
 
 /**

diff --git a/...ght/SuspiciousConnectsScoreFunction.scala → ...pot/SuspiciousConnectsScoreFunction.scala b/...ght/SuspiciousConnectsScoreFunction.scala → ...pot/SuspiciousConnectsScoreFunction.scala
@@ -1,4 +1,4 @@
-package org.opennetworkinsight
+package org.apache.spot
 
 import org.apache.spark.broadcast.Broadcast
 

diff --git a/...rg/opennetworkinsight/dns/DNSSchema.scala → ...scala/org/apache/spot/dns/DNSSchema.scala b/...rg/opennetworkinsight/dns/DNSSchema.scala → ...scala/org/apache/spot/dns/DNSSchema.scala
@@ -1,8 +1,8 @@
-package org.opennetworkinsight.dns
+package org.apache.spot.dns
 
 import org.apache.spark.sql.functions._
 import org.apache.spark.sql.types._
-import org.opennetworkinsight.dns.model.DNSSuspiciousConnectsModel.ModelSchema
+import org.apache.spot.dns.model.DNSSuspiciousConnectsModel.ModelSchema
 
 /**
   * Data frame schemas and column names used in the DNS suspicious connects analysis.

diff --git a/...t/dns/DNSSuspiciousConnectsAnalysis.scala → ...t/dns/DNSSuspiciousConnectsAnalysis.scala b/...t/dns/DNSSuspiciousConnectsAnalysis.scala → ...t/dns/DNSSuspiciousConnectsAnalysis.scala
@@ -1,17 +1,17 @@
-package org.opennetworkinsight.dns
+package org.apache.spot.dns
 
 import org.apache.spark.SparkContext
 import org.apache.spark.sql.functions._
 import org.apache.spark.sql.types._
 import org.apache.spark.sql.{DataFrame, SQLContext}
-import org.opennetworkinsight.SuspiciousConnectsArgumentParser.SuspiciousConnectsConfig
-import org.opennetworkinsight.dns.DNSSchema._
-import org.opennetworkinsight.dns.model.DNSSuspiciousConnectsModel
-import org.opennetworkinsight.dns.sideinformation.DNSSideInformation
+import org.apache.spot.SuspiciousConnectsArgumentParser.SuspiciousConnectsConfig
+import org.apache.spot.dns.DNSSchema._
+import org.apache.spot.dns.model.DNSSuspiciousConnectsModel
+import org.apache.spot.dns.sideinformation.DNSSideInformation
 import org.apache.log4j.Logger
 
-import org.opennetworkinsight.dns.model.DNSSuspiciousConnectsModel.ModelSchema
-import org.opennetworkinsight.dns.sideinformation.DNSSideInformation.SideInfoSchema
+import org.apache.spot.dns.model.DNSSuspiciousConnectsModel.ModelSchema
+import org.apache.spot.dns.sideinformation.DNSSideInformation.SideInfoSchema
 
 /**
   * The suspicious connections analysis of DNS log data develops a probabilistic model the DNS queries

diff --git a/...nnetworkinsight/dns/DNSWordCreation.scala → ...org/apache/spot/dns/DNSWordCreation.scala b/...nnetworkinsight/dns/DNSWordCreation.scala → ...org/apache/spot/dns/DNSWordCreation.scala
@@ -1,9 +1,9 @@
-package org.opennetworkinsight.dns
+package org.apache.spot.dns
 
 import org.apache.spark.broadcast.Broadcast
 import org.apache.spark.sql.functions._
-import org.opennetworkinsight.utilities.DomainProcessor.{DomainInfo, extractDomainInfo}
-import org.opennetworkinsight.utilities.Quantiles
+import org.apache.spot.utilities.DomainProcessor.{DomainInfo, extractDomainInfo}
+import org.apache.spot.utilities.Quantiles
 
 
 /**

diff --git a/...etworkinsight/dns/model/DNSFeedback.scala → ...g/apache/spot/dns/model/DNSFeedback.scala b/...etworkinsight/dns/model/DNSFeedback.scala → ...g/apache/spot/dns/model/DNSFeedback.scala
@@ -1,9 +1,9 @@
-package org.opennetworkinsight.dns.model
+package org.apache.spot.dns.model
 
 import org.apache.spark.SparkContext
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.{DataFrame, Row, SQLContext}
-import org.opennetworkinsight.dns.model.DNSSuspiciousConnectsModel.{ModelSchema, modelColumns}
+import org.apache.spot.dns.model.DNSSuspiciousConnectsModel.{ModelSchema, modelColumns}
 
 import scala.io.Source
 

diff --git a/...kinsight/dns/model/DNSScoreFunction.scala → ...che/spot/dns/model/DNSScoreFunction.scala b/...kinsight/dns/model/DNSScoreFunction.scala → ...che/spot/dns/model/DNSScoreFunction.scala
@@ -1,8 +1,8 @@
-package org.opennetworkinsight.dns.model
+package org.apache.spot.dns.model
 
 import org.apache.spark.broadcast.Broadcast
-import org.opennetworkinsight.SuspiciousConnectsScoreFunction
-import org.opennetworkinsight.dns.DNSWordCreation
+import org.apache.spot.SuspiciousConnectsScoreFunction
+import org.apache.spot.dns.DNSWordCreation
 
 
 /**