fileio¶
The functionality provided by this module is used in Context.textFile()
for reading and in RDD.saveAsTextFile()
for writing.
You can use this submodule with File.dump()
, File.load()
and
File.exists()
to read, write and check for existance of a file.
All methods transparently handle various schemas (for example http://
,
s3://
and file://
) and compression/decompression of .gz
and
.bz2
files (among others).
-
class
pysparkling.fileio.
File
(file_name)[source]¶ File object.
Parameters: file_name – Any file name. -
static
resolve_filenames
(all_expr)[source]¶ resolve expression for a filename
Parameters: all_expr – A comma separated list of expressions. The expressions can contain the wildcard characters *
and?
. It also resolves Spark datasets to the paths of the individual partitions (i.e.my_data
gets resolved to[my_data/part-00000, my_data/part-00001]
).Returns: A list of file names. Return type: list
-
load
()[source]¶ Load the data from a file.
Return type: io.BytesIO
-
static
-
class
pysparkling.fileio.
TextFile
(file_name)[source]¶ Derived from
File
.Parameters: file_name – Any text file name.
File System¶
-
class
pysparkling.fileio.fs.
FileSystem
(file_name)[source]¶ Interface class for the file system.
Parameters: file_name (str) – File name. -
static
resolve_filenames
(expr)[source]¶ Resolve the given glob-like expression to filenames.
Return type: list
-
load
()[source]¶ Load a file to a stream.
Return type: io.BytesIO
-
load_text
(encoding='utf8', encoding_errors='ignore')[source]¶ Load a file to a stream.
Parameters: Return type:
-
dump
(stream)[source]¶ Dump a stream to a file.
Parameters: stream (io.BytesIO) – Input tream.
-
make_public
(recursive=False)[source]¶ Make the file public (only on some file systems).
Parameters: recursive (bool) – Recurse. Return type: FileSystem
-
static
-
class
pysparkling.fileio.fs.
Local
(file_name)[source]¶ FileSystem
implementation for the local file system.
-
class
pysparkling.fileio.fs.
GS
(file_name)[source]¶ FileSystem
implementation for Google Storage.Paths are of the form
gs://bucket_name/file_path
orgs://project_name:bucket_name/file_path
.-
mime_type
= 'text/plain'¶ Default mime type.
-
project_name
= None¶ Set a default project name.
-
-
class
pysparkling.fileio.fs.
Hdfs
(file_name)[source]¶ FileSystem
implementation for HDFS.
-
class
pysparkling.fileio.fs.
Http
(file_name)[source]¶ FileSystem
implementation for HTTP.
-
class
pysparkling.fileio.fs.
S3
(file_name)[source]¶ FileSystem
implementation for S3.Use environment variables
AWS_SECRET_ACCESS_KEY
andAWS_ACCESS_KEY_ID
for auth and use file paths of the forms3://bucket_name/filename.txt
.-
connection_kwargs
= {}¶ Keyword arguments for new connections. Example: set to
{'anon': True}
for anonymous connections.
-
Codec¶
-
class
pysparkling.fileio.codec.
Codec
[source]¶ Codec.
-
compress
(stream)[source]¶ Compress.
Parameters: stream (io.BytesIO) – Uncompressed input stream. Return type: io.BytesIO
-
decompress
(stream)[source]¶ Decompress.
Parameters: stream (io.BytesIO) – Compressed input stream. Return type: io.BytesIO
-
-
class
pysparkling.fileio.codec.
Lzma
[source]¶ Implementation of
Codec
for lzma compression.Needs Python >= 3.3.