InputFiles Operation in PySpark DataFrames: A Comprehensive Guide
PySpark’s DataFrame API is a cornerstone for managing big data, and the inputFiles operation offers a practical way to peek behind the curtain by revealing the source files that make up your DataFrame. It’s a handy tool that hands you a list of file paths—think of it as a map showing where your data came from—without fussing over the data itself. Whether you’re tracking down the origins of a loaded dataset, debugging a pipeline, or keeping tabs on your data’s roots, inputFiles gives you a straightforward path to that info. Built into Spark’s Spark SQL engine and powered by the Catalyst optimizer, it pulls this list quickly, relying on metadata rather than heavy computation. In this guide, we’ll explore what inputFiles does, walk through how you can use it with plenty of detail, and highlight where it fits into real-world scenarios, all with examples that bring it to life.
Ready to dig into inputFiles? Check out PySpark Fundamentals and let’s get rolling!
What is the InputFiles Operation in PySpark?
The inputFiles operation in PySpark is a method you call on a DataFrame to get a list of the file paths that Spark used to build it. It’s like asking Spark, “Where did you get this data?” and getting back a plain Python list of strings—each one a path to a file that fed into your DataFrame. Introduced in Spark 3.1.0, it’s designed to give you a best-effort snapshot, pulling together the files from whatever sources (like CSV, Parquet, or JSON) your DataFrame came from. When you run it, Spark digs into the metadata it’s already got, managed by the Catalyst optimizer, and hands you that list without scanning the data itself. It’s a lightweight move—no big cluster jobs, no shuffling—just a quick peek at the origins. You’ll see it pop up when you need to trace back to your source files, maybe to debug a load gone wrong or log where your data’s been pulled from, making it a neat trick for keeping your pipeline transparent and under control.
Here’s a simple taste of how it works:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("QuickPeek").getOrCreate()
data = [("Alice", 25), ("Bob", 30)]
df = spark.createDataFrame(data, ["name", "age"])
df.write.parquet("/tmp/example.parquet")
loaded_df = spark.read.parquet("/tmp/example.parquet")
files = loaded_df.inputFiles()
print(files)
# Output might look like:
# ['file:///tmp/example.parquet/part-00000-...parquet']
spark.stop()
We start with a SparkSession, whip up a small DataFrame, save it as Parquet, and load it back. Calling inputFiles on the loaded DataFrame gives us a list of file paths—here, it’s just one Parquet file, but it could be many depending on how the data’s split. It’s a clean way to see what Spark’s working with. Want more on DataFrames? Check out DataFrames in PySpark. For setup help, see Installing PySpark.
Various Ways to Use InputFiles in PySpark
The inputFiles method offers several natural ways to tap into the source files of your DataFrame, each fitting into different parts of your workflow. Let’s go through them with examples that show how it all plays out.
1. Tracking Down Where Your Data Came From
Sometimes you just need to know where your DataFrame’s data started. Calling inputFiles hands you that list of file paths Spark pulled from—a quick way to see the origins without digging through logs or guessing.
This is a lifesaver when you’re loading data from a bunch of files, maybe a directory full of CSVs or Parquets, and want to confirm what Spark grabbed. It’s your first step to make sure the right files got picked up, especially if you’re dealing with a mix of sources or a dynamic folder.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("TrackOrigin").getOrCreate()
data1 = [("Alice", 25)]
data2 = [("Bob", 30)]
spark.createDataFrame(data1, ["name", "age"]).write.parquet("/tmp/part1.parquet")
spark.createDataFrame(data2, ["name", "age"]).write.parquet("/tmp/part2.parquet")
df = spark.read.parquet("/tmp/*.parquet")
files = df.inputFiles()
print(f"Source files: {files}")
# Output might look like:
# ['file:///tmp/part1.parquet/part-00000-...parquet', 'file:///tmp/part2.parquet/part-00000-...parquet']
spark.stop()
We save two DataFrames as Parquet files, load them back with a wildcard, and inputFiles shows both paths. If you’re pulling daily logs from a folder, this confirms Spark scooped up all the files you meant it to.
2. Debugging a Load That Went Sideways
When something’s off with your loaded DataFrame—maybe it’s missing data or looks funky—inputFiles helps you trace it back to the source. It lists the files Spark used, so you can check if it grabbed the right ones or skipped something.
This comes up when you’re troubleshooting a pipeline. Say your DataFrame’s short on rows, and you suspect Spark missed a file—or grabbed an old one by mistake. Pulling inputFiles lets you see exactly what went in, narrowing down the culprit fast.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DebugLoad").getOrCreate()
data = [("Alice", 25)]
spark.createDataFrame(data, ["name", "age"]).write.parquet("/tmp/debug.parquet")
df = spark.read.parquet("/tmp/debug.parquet")
files = df.inputFiles()
print(f"Loaded from: {files}")
# Output might look like:
# ['file:///tmp/debug.parquet/part-00000-...parquet']
spark.stop()
We save a DataFrame, load it, and inputFiles shows the file path. If your pipeline’s pulling from a directory and the data’s off, this tells you if Spark hit the right spot—or if an old file’s sneaking in.
3. Logging Where Data’s Coming From
Keeping a record of your data’s origins is smart for audits or tracking. InputFiles gives you that list of paths to log, tying your DataFrame back to its source files without extra hassle.
This fits when you’re running a job and need to note where everything came from—maybe for compliance or just to keep your process clear. You grab inputFiles and stash it in a log, so you’ve got a trail to follow later.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("LogSource").getOrCreate()
data = [("Alice", 25)]
spark.createDataFrame(data, ["name", "age"]).write.parquet("/tmp/log.parquet")
df = spark.read.parquet("/tmp/log.parquet")
files = df.inputFiles()
with open("data_log.txt", "a") as log:
log.write(f"DataFrame loaded from: {files}\n")
print(f"Logged files: {files}")
# Output might look like:
# Logged files: ['file:///tmp/log.parquet/part-00000-...parquet']
spark.stop()
We save a DataFrame, load it, and log the inputFiles to a file. If you’re processing customer data and need to track sources, this keeps a tidy record without slowing you down.
4. Checking Partitioned Data Sources
When your DataFrame comes from partitioned files—like Parquet split by year or region—inputFiles lists all the pieces Spark pulled together. It shows you every file path, reflecting how the partitions got rolled into one.
This is key when you’re working with partitioned datasets and want to make sure Spark grabbed everything—or the right subset. Maybe you’re loading sales data split by month, and you need to confirm all the months you expected are in there.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PartitionCheck").getOrCreate()
data1 = [("Alice", "2023")]
data2 = [("Bob", "2024")]
spark.createDataFrame(data1, ["name", "year"]).write.partitionBy("year").parquet("/tmp/sales")
spark.createDataFrame(data2, ["name", "year"]).write.partitionBy("year").mode("append").parquet("/tmp/sales")
df = spark.read.parquet("/tmp/sales")
files = df.inputFiles()
print(f"Partition files: {files}")
# Output might look like:
# ['file:///tmp/sales/year=2023/part-00000-...parquet', 'file:///tmp/sales/year=2024/part-00000-...parquet']
spark.stop()
We save two DataFrames partitioned by year, load them back, and inputFiles lists both paths. If you’re analyzing yearly sales, this shows Spark got 2023 and 2024, no gaps.
5. Verifying Filtered Loads
If you’ve loaded a subset of files—like Parquet with a filter—inputFiles tells you which ones Spark actually used. It’s a list of just those paths, helping you verify your filter worked as expected.
This is useful when you’re narrowing down a big dataset, maybe loading only recent files with a condition. You run inputFiles to see if Spark stuck to your filter, catching any slip-ups before you process further.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("FilterVerify").getOrCreate()
data1 = [("Alice", "2023")]
data2 = [("Bob", "2024")]
spark.createDataFrame(data1, ["name", "year"]).write.partitionBy("year").parquet("/tmp/filtered")
spark.createDataFrame(data2, ["name", "year"]).write.partitionBy("year").mode("append").parquet("/tmp/filtered")
df = spark.read.parquet("/tmp/filtered").filter("year = '2024'")
files = df.inputFiles()
print(f"Filtered files: {files}")
# Output might look like:
# ['file:///tmp/filtered/year=2024/part-00000-...parquet']
spark.stop()
We save partitioned data, load it with a filter for 2024, and inputFiles shows just that year’s file. If you’re pulling recent transactions, this confirms Spark didn’t grab older stuff by mistake.
Common Use Cases of the InputFiles Operation
The inputFiles method fits into all kinds of moments where knowing your data’s source files matters. Here’s where it naturally shines.
1. Pinning Down Data Origins
When you need to know where your DataFrame came from, inputFiles gives you that list of file paths—a clear way to trace back to the source.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("OriginPin").getOrCreate()
data = [("Alice", 25)]
spark.createDataFrame(data, ["name", "age"]).write.parquet("/tmp/origin.parquet")
df = spark.read.parquet("/tmp/origin.parquet")
files = df.inputFiles()
print(f"From: {files}")
# Output: From: ['file:///tmp/origin.parquet/...']
spark.stop()
2. Sorting Out Load Problems
If your DataFrame’s acting up, inputFiles shows you the files Spark used, helping you figure out if it grabbed the wrong ones or missed something.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("LoadSort").getOrCreate()
df = spark.read.parquet("/tmp/mystery.parquet")
files = df.inputFiles()
print(f"Loaded files: {files}")
# Output: See what Spark pulled
spark.stop()
3. Keeping a Source Log
For tracking or audits, inputFiles lets you log where your data’s from, tying it back to its roots.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SourceLog").getOrCreate()
df = spark.read.parquet("/tmp/track.parquet")
files = df.inputFiles()
with open("source_log.txt", "a") as log:
log.write(f"Loaded from: {files}\n")
# Keeps a record
spark.stop()
4. Confirming Partition Loads
With partitioned data, inputFiles lists all the files Spark pulled in, making sure you’ve got the full set—or just what you wanted.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PartConfirm").getOrCreate()
df = spark.read.parquet("/tmp/parts/*.parquet")
files = df.inputFiles()
print(f"Partition files: {files}")
# Output: All partition paths
spark.stop()
FAQ: Answers to Common InputFiles Questions
Here’s a natural take on questions folks might have about inputFiles, with answers that dig into the details.
Q: How’s inputFiles different from tracking files myself?
Using inputFiles is Spark doing the work for you—it pulls the file paths straight from the DataFrame’s metadata, no manual tracking needed. If you tracked them yourself, you’d have to note every file path you fed into spark.read, which works but gets messy with wildcards or partitions. InputFiles grabs what Spark actually used, even if it filtered some out, saving you the hassle and keeping it accurate.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("TrackCompare").getOrCreate()
df = spark.read.parquet("/tmp/multi/*.parquet")
files = df.inputFiles()
print(f"Spark says: {files}")
# Output: Exact files Spark loaded, no guesswork
spark.stop()
Q: Does inputFiles always get every file?
It tries its best, but it’s not perfect—it’s a “best-effort” snapshot. InputFiles asks each source relation (like a Parquet reader) for its files and combines them, dropping duplicates. If a source doesn’t report right—like a funky custom reader—or if Spark’s filtering skips some files under the hood, you might not see everything. Usually, it’s spot-on for standard loads like CSV or Parquet.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("EveryFile").getOrCreate()
df = spark.read.parquet("/tmp/all/*.parquet")
files = df.inputFiles()
print(f"Files: {len(files)} found")
# Output: Should match files loaded, barring edge cases
spark.stop()
Q: Can I use inputFiles to change my DataFrame?
No—it’s just a peek, not a tool to tweak. InputFiles gives you the list of source paths, but it doesn’t let you mess with the DataFrame itself. If you want to reload or adjust based on those files, you’d use spark.read again with the paths—or filter them yourself—but inputFiles itself is read-only.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ChangeNo").getOrCreate()
df = spark.read.parquet("/tmp/change.parquet")
files = df.inputFiles()
# Can’t modify df with files directly, but can reload
new_df = spark.read.parquet(files[0])
new_df.show()
# Output: New DataFrame from one file
spark.stop()
Q: Does calling inputFiles take a lot of time?
Not at all—it’s quick. When you run inputFiles, Spark’s just pulling metadata it’s already got, not scanning data or firing up a big job. It’s fast, even with a massive DataFrame, so you can call it whenever without slowing things down.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("TimeCheck").getOrCreate()
data = [("Alice", 25)] * 1000000
spark.createDataFrame(data, ["name", "age"]).write.parquet("/tmp/big.parquet")
df = spark.read.parquet("/tmp/big.parquet")
files = df.inputFiles()
print(f"Fast grab: {files}")
# Output: Instant list, no delay
spark.stop()
Q: What if my DataFrame’s from a database, not files?
If your DataFrame comes from a database—like via JDBC—inputFiles won’t have much to say. It’s built for file-based sources (CSV, Parquet, etc.), so with a database, it’ll likely return an empty list since there’s no file path to report. You’d need to track the source another way, like the query or table name.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DBCheck").getOrCreate()
# Hypothetical JDBC load
df = spark.read.format("jdbc").option("url", "jdbc:sqlite:/tmp/db").option("dbtable", "people").load()
files = df.inputFiles()
print(f"Database files: {files}")
# Output: Database files: [] (no files here)
spark.stop()
InputFiles vs Other DataFrame Operations
The inputFiles method gives you a list of source file paths, unlike columns, which lists names, or dtypes, which adds types. It’s not about data like show or stats like describe—it’s metadata about origins, pulled fast by Spark’s Catalyst engine, distinct from heavy ops like collect.
More details at DataFrame Operations.
Conclusion
The inputFiles operation in PySpark is a slick, no-nonsense way to see where your DataFrame’s data came from, handing you a list of file paths to track or debug with ease. Master it with PySpark Fundamentals to sharpen your data skills!