Introduction to serverless functions
fal is a serverless Python runtime provider that simplifies scaling your workloads in the cloud. By adding a simple annotation to your existing function, you can enable it to perform processing that may exceed the capabilities of your local environment. This includes accessing more raw compute power, enabling AI accelerator cards and GPUs.
Leveraging Python's rich ecosystem
Fal functions can leverage python's powerful package ecosystem.
Let's start with a fun example, using pyjokes
inside one of our serverless functions.
import fal
@fal.function(
"virtualenv",
requirements=["pyjokes"],
)
def tell_joke() -> str:
import pyjokes
joke = pyjokes.get_joke()
return joke
print("Joke from the clouds: ", tell_joke())
A new virtual environment will be created by fal in the cloud and the set of requirements that we passed will be installed as soon as this function is called. From that point on, our code will be executed as if it were running locally, and the joke prepared by the pyjokes library will be returned.
One thing that you might have noticed is that we imported the pyjokes
library under the function definition as opposed to on top of the file.
Since pyjokes is not necessary in the local Python environment, the snippet above will work even if pyjokes
is not installed in the local environment. However, the dependency pyjokes
can still be used when using remote functions. This is particularly important when dealing with complex dependency chains.
Implementation note
When using 3rd party objects as inputs or outputs for fal functions, be aware that both your local computer and the fal serverless runtime has to have the same exact set of dependencies.
Running basic ML workflows
Now that we know how to leverage an existing Python package, we can start using transformers (opens in a new tab) to run a simple ML workflow such as text classification, all in the cloud with no infrastructure!
import fal
# Can be any model from HF's model hub, see https://huggingface.co/models
TEXT_CLASSIFICATION_MODEL = "distilbert-base-uncased-finetuned-sst-2-english"
@fal.function(
"virtualenv",
requirements=["transformers", "datasets", "torch"],
machine_type="M",
)
def classify_text(text: str) -> tuple[str, float]:
from transformers import pipeline
pipe = pipeline("text-classification", model=TEXT_CLASSIFICATION_MODEL)
[result] = pipe(text)
return result["label"], result["score"]
if __name__ == "__main__":
sentiment, confidence = classify_text("I like apples.")
print(
f"Sentiment of the subject prompt: {sentiment!r} "
f"with a confidence of {confidence}"
)
One new concept you might have noticed is the new machine_type
annotation which denotes where to run your workflow. Since this is an ML inference model, we choose an M
tier machine (which is much more compute intensive compared to the default XS
machines). Running our workflow in one of these machines takes about ~15 seconds on the initial invocation (after the environment has been built).
Faster subsequent invocations
Each time you invoke a fal function; depending on its properties (such as its environment, its machine type etc.), the fal runtime automatically provisions a new machine for you under the hood in the cloud, runs your function, returns you the result and voids* that machine.
This is useful for cost saving when the usage is sparse. If your traffic patterns include subsequent invocations, it might be a good idea to keep the new machines around just a little bit longer. This reduces the potential side-effects of new cold starts (machine provisioning time, and more importantly being able to keep the same Python process around).
import fal
# Can be any model from HF's model hub, see https://huggingface.co/models
TEXT_CLASSIFICATION_MODEL = "distilbert-base-uncased-finetuned-sst-2-english"
@fal.function(
"virtualenv",
requirements=["transformers", "datasets", "torch"],
machine_type="M",
keep_alive=60,
)
def classify_text(text: str) -> tuple[str, float]:
from transformers import pipeline
pipe = pipeline("text-classification", model=TEXT_CLASSIFICATION_MODEL)
[result] = pipe(text)
return result["label"], result["score"]
if __name__ == "__main__":
for prompt in [
"I like apples.",
"I hate oranges.",
"I have mixed feelings about pineapples.",
]:
sentiment, confidence = classify_text(prompt)
print(
f"Sentiment of the subject prompt: {sentiment!r} "
f"with a confidence of {confidence}"
)
After setting a keep_alive
on the function, we can see that our invocations went from ~15 seconds to ~3 seconds each. Almost a 5x speed-up thanks to just being able to keep the same Python modules in memory.
Implementation note
There is a default keep_alive
value of 30
seconds for each fal function,
so all invocations will be kept around for 30
more seconds after their
initial call to ensure best performance is achieved.