سؤال

I have a SQLContext data frame derived from pandas data frame consisting of several numerical columns. I want to perform multivariate statistical analysis using the pyspark.mllib.stats package. The statistics function expects a RDD of vectors. I could not convert this data frame into RDD of vectors. Is there a way to convert the data frame?

Code:

 rdd = sqlCtx.createDataFrame(df_new)
 summary = Statistics.colStats(rdd)

I am getting df_new from

 df_new = df.applymap(lambda s: dic.get(s) if s in dic else s) #df is a pandas dataframe

I am getting a PY4JJava error at the summary line. The issue is with the format of rdd.

هل كانت مفيدة؟

المحلول

The Dataframe Python API exposes the RDD of a Dataframe by calling the following :

df.rdd # you can save it, perform transformations of course, etc. 

df.rdd returns the content as an pyspark.RDD of Row.

You can then map on that RDD of Row transforming every Row into a numpy vector. I can't be more specific about the transformation since I don't know what your vector represents with the information given.

Note 1: dfis the variable define our Dataframe.

Note 2: this function is available since Spark 1.3

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى datascience.stackexchange
scroll top