Moving to the Cloud¶
Warning
This is in progress and not yet fully supported.
In the Quick Start guide, you learned how to implement a simple app that trains an image classifier and serve it once trained.
In this tutorial, you’ll learn how to extend that application so that it works seamlessly both locally and in the cloud.
Step 1: Distributed Application¶
Distributed Storage¶
When running your application in a fully-distributed setting, the data available on one machine won’t necessarily be available on another.
To solve this problem, Lightning introduces the Path
object.
This ensures that your code can run both locally and in the cloud.
The Path
object keeps track of the work which creates
the path. This enables Lightning to transfer the files correctly in a distributed setting.
Instead of passing a string representing a file or directory, Lightning simply wraps
them into a Path
object and makes them an attribute of your LightningWork.
Without doing this conscientiously for every single path, your application will fail in the cloud.
In the example below, a file written by SourceFileWork is being transferred by the flow to the DestinationFileAndServeWork work. The Path object is the reference to the file.
import os
from lightning.app import CloudCompute, LightningApp, LightningFlow, LightningWork
from lightning.app.components import TracerPythonScript
from lightning.app.storage.path import Path
FILE_CONTENT = """
Hello there!
This tab is currently an IFrame of the FastAPI Server running in `DestinationFileAndServeWork`.
Also, the content of this file was created in `SourceFileWork` and then transferred to `DestinationFileAndServeWork`.
Are you already 🤯 ? Stick with us, this is only the beginning. Lightning is 🚀.
"""
class SourceFileWork(LightningWork):
def __init__(self, cloud_compute: CloudCompute = CloudCompute(), **kwargs):
super().__init__(parallel=True, **kwargs, cloud_compute=cloud_compute)
self.boring_path = None
def run(self):
# This should be used as a REFERENCE to the file.
self.boring_path = "lit://boring_file.txt"
with open(self.boring_path, "w", encoding="utf-8") as f:
f.write(FILE_CONTENT)
class DestinationFileAndServeWork(TracerPythonScript):
def run(self, path: Path):
assert path.exists()
self.script_args += [f"--filepath={path}", f"--host={self.host}", f"--port={self.port}"]
super().run()
class BoringApp(LightningFlow):
def __init__(self):
super().__init__()
self.source_work = SourceFileWork()
self.dest_work = DestinationFileAndServeWork(
script_path=os.path.join(os.path.dirname(__file__), "scripts/serve.py"),
port=1111,
parallel=False, # runs until killed.
cloud_compute=CloudCompute(),
raise_exception=True,
)
@property
def ready(self) -> bool:
return self.dest_work.is_running
def run(self):
self.source_work.run()
if self.source_work.has_succeeded:
# the flow passes the file from one work to another.
self.dest_work.run(self.source_work.boring_path)
self.stop("Boring App End")
def configure_layout(self):
return {"name": "Boring Tab", "content": self.dest_work.url + "/file"}
app = LightningApp(BoringApp())
In the scripts/serve.py
file, we are creating a FastApi Service running on port 1111
that returns the content of the file received from SourceFileWork when
a post request is sent to /file
.
import argparse
import os
import uvicorn
from fastapi import FastAPI
from fastapi.requests import Request
from fastapi.responses import HTMLResponse
if __name__ == "__main__":
parser = argparse.ArgumentParser("Server Parser")
parser.add_argument("--filepath", type=str, help="Where to find the `filepath`")
parser.add_argument("--host", type=str, default="0.0.0.0", help="Server host`")
parser.add_argument("--port", type=int, default="8888", help="Server port`")
hparams = parser.parse_args()
fastapi_service = FastAPI()
if not os.path.exists(str(hparams.filepath)):
content = ["The file wasn't transferred"]
else:
with open(hparams.filepath) as fo:
content = fo.readlines() # read the file received from SourceWork.
@fastapi_service.get("/file")
async def get_file_content(request: Request, response_class=HTMLResponse):
lines = "\n".join(["<p>" + line + "</p>" for line in content])
return HTMLResponse(f"<html><head></head><body><ul>{lines}</ul></body></html>")
uvicorn.run(app=fastapi_service, host=hparams.host, port=hparams.port)
Distributed Frontend¶
In the above example, the FastAPI Service was running on one machine, and the frontend UI in another.
In order to assemble them, you need to do two things:
Provide port argument to your work’s
__init__
method to expose a single service.
Here’s how to expose the port:
class BoringApp(LightningFlow):
def __init__(self):
super().__init__()
self.source_work = SourceFileWork()
self.dest_work = DestinationFileAndServeWork(
script_path=os.path.join(os.path.dirname(__file__), "scripts/serve.py"),
port=1111,
parallel=False, # runs until killed.
cloud_compute=CloudCompute(),
raise_exception=True,
)
And here’s how to expose your services within the configure_layout
flow hook:
# the flow passes the file from one work to another.
self.dest_work.run(self.source_work.boring_path)
self.stop("Boring App End")
def configure_layout(self):
In this example, we’re appending /file
to our FastApi Service url.
This means that our Boring Tab
triggers the get_file_content
from the FastAPI Service
and embeds its content as an IFrame.
@fastapi_service.get("/file")
async def get_file_content(request: Request, response_class=HTMLResponse):
lines = "\n".join(["<p>" + line + "</p>" for line in content])
Here’s a visualization of the application described above:
Step 2: Scalable Application¶
The benefit of defining long-running code inside a
LightningWork
component is that you can run it on different hardware
by providing CloudCompute
to
the __init__
method of your LightningWork
.
By adapting the Quick Start example as follows, you can easily run your component on multiple GPUs:
Without doing much, you’re now running a script on its own cluster of machines! 🤯
Step 3: Resilient Application¶
We designed Lightning with a strong emphasis on supporting failure cases. The framework shines when the developer embraces our fault-tolerance best practices, enabling them to create ML applications with a high degree of complexity as well as a strong support for unhappy cases.
An entire section would be dedicated to this concept.
TODO