Frank Shen
04/07/2023, 9:15 PM@dynamic(
requests=Resources(cpu="4", mem="20Gi"),
)
def train_foreach_tenure_small(
df: pd.DataFrame,
) -> None:
for tenure in range(1, 3, 1):
data = df[df['TENURE']==tenure]
xgbse_training(df)
@task(requests=Resources(mem="5Gi"))
def xgbse_training()
Pryce
04/07/2023, 11:15 PMkubectl describe
on the pod mentioned in the status message to see what the last event was?karthikraj
04/10/2023, 10:40 AM]$ kubectl describe pod af49sl86c8p8kc2qqrh8-n2-0-dn0-0 -n marketing-development
Name: af49sl86c8p8kc2qqrh8-n2-0-dn0-0
Namespace: marketing-development
Priority: 0
Node: ip-10-69-46-24.ec2.internal/10.69.46.24
Start Time: Mon, 10 Apr 2023 07:51:35 +0000
Labels: domain=development
execution-id=af49sl86c8p8kc2qqrh8
interruptible=false
node-id=dn0
project=marketing
shard-key=3
task-name=train-monthly-ltv-train-wf-xgbse-training
workflow-name=train-monthly-ltv-train-wf-wf-train-small
Annotations: <http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: false
<http://kubernetes.io/psp|kubernetes.io/psp>: eks.privileged
Status: Running
IP: 10.69.47.41
IPs:
IP: 10.69.47.41
Controlled By: flyteworkflow/af49sl86c8p8kc2qqrh8
Containers:
af49sl86c8p8kc2qqrh8-n2-0-dn0-0:
Container ID: <docker://85c96fc54a94dc53944482c7091c802f01948f4f2a604c2b33e676074ad3c24>8
Image: <http://613630599026.dkr.ecr.us-east-1.amazonaws.com/dai-mlp-flyte-spark-root-user:v-5|613630599026.dkr.ecr.us-east-1.amazonaws.com/dai-mlp-flyte-spark-root-user:v-5>
Image ID: <docker-pullable://613630599026.dkr.ecr.us-east-1.amazonaws.com/dai-mlp-flyte-spark-root-user@sha256:f7331f453275fddba83e6793ca16179f04c2dcebd6ed7ed07dc37ffc0a845aee>
Port: <none>
Host Port: <none>
Args:
pyflyte-fast-execute
--additional-distribution
<s3://dev-wm-max-ml-flyte-us-east-1/6y/marketing/development/7NFZOXKPRBVVDC7QIGQGPL5J4A======/fastdf2af7949e4597d1ee1840f9322a9993.tar.gz>
--dest-dir
/root
--
pyflyte-execute
--inputs
<s3://dev-wm-max-ml-flyte-us-east-1/metadata/propeller/marketing-development-af49sl86c8p8kc2qqrh8/n2/data/0/dn0/inputs.pb>
--output-prefix
<s3://dev-wm-max-ml-flyte-us-east-1/metadata/propeller/marketing-development-af49sl86c8p8kc2qqrh8/n2/data/0/dn0/0>
--raw-output-data-prefix
<s3://dev-wm-max-ml-flyte-us-east-1/m3/af49sl86c8p8kc2qqrh8-n2-0-dn0-0>
--checkpoint-path
<s3://dev-wm-max-ml-flyte-us-east-1/m3/af49sl86c8p8kc2qqrh8-n2-0-dn0-0/_flytecheckpoints>
--prev-checkpoint
""
--resolver
flytekit.core.python_auto_container.default_task_resolver
--
task-module
train.monthly_ltv_train_wf
task-name
xgbse_training
State: Running
Started: Mon, 10 Apr 2023 07:51:36 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 2
memory: 40Gi
Requests:
cpu: 2
memory: 40Gi
Environment:
FLYTE_INTERNAL_EXECUTION_WORKFLOW: marketing:development:train.monthly_ltv_train_wf.wf_train_small
FLYTE_INTERNAL_EXECUTION_ID: af49sl86c8p8kc2qqrh8
FLYTE_INTERNAL_EXECUTION_PROJECT: marketing
FLYTE_INTERNAL_EXECUTION_DOMAIN: development
FLYTE_ATTEMPT_NUMBER: 0
FLYTE_INTERNAL_TASK_PROJECT: marketing
FLYTE_INTERNAL_TASK_DOMAIN: development
FLYTE_INTERNAL_TASK_NAME: train.monthly_ltv_train_wf.xgbse_training
FLYTE_INTERNAL_TASK_VERSION: dW-H1Ra7T_2-SM6A3K-nkg==
FLYTE_INTERNAL_PROJECT: marketing
FLYTE_INTERNAL_DOMAIN: development
FLYTE_INTERNAL_NAME: train.monthly_ltv_train_wf.xgbse_training
FLYTE_INTERNAL_VERSION: dW-H1Ra7T_2-SM6A3K-nkg==
DEFAULT_ENV_VAR: VALUE
MY_NAME: KARTHIKRAJ
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bcpld (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-bcpld:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
<http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events: <none>
Pryce
04/10/2023, 7:09 PMBabis Kiosidis
04/11/2023, 6:43 AMkarthikraj
04/11/2023, 11:15 AMBabis Kiosidis
04/11/2023, 12:23 PMxgbse_training(df)
Frank Shen
04/11/2023, 4:11 PMi think it makes sense to not pass large datasets as inputs/outputs of tasks, to simplify a bit the metadata handling of the system.
Babis Kiosidis
04/12/2023, 7:01 AMKetan (kumare3)
Babis Kiosidis
04/12/2023, 3:17 PMPryce
04/12/2023, 8:06 PMrange()
function above what's significantly larger? If that's the case then the delay with the larger dataset may be caused from dynamically making a dag with many many task nodes. Perhaps refactoring to a map task may be helpful.Frank Shen
04/12/2023, 8:28 PM