Skip to content

Conversation

@YYStreet
Copy link
Contributor

@YYStreet YYStreet commented May 28, 2020

Issue #, if available:

Checklist

  • I've prepended PR tag with frameworks/job this applies to : [mxnet, tensorflow, pytorch] | [build] | [test] | [build, test] | [ec2, ecs, eks, sagemaker]
  • (If applicable) I've documented below the DLC image/dockerfile this relates to
  • (If applicable) I've documented below the tests I've run on the DLC image
  • (If applicable) I've reviewed the licenses of updated and new binaries and their dependencies to make sure all licenses are on the Apache Software Foundation Third Party License Policy Category A or Category B license list. See https://www.apache.org/legal/resolved.html.
  • (If applicable) I've scanned the updated and new binaries to make sure they do not have vulnerabilities associated with them.

Description:

Tests run:

DLC image/dockerfile:

Additional context:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment>
region: &REGION <set-$REGION-in-environment>
framework: &FRAMEWORK pytorch
version: &VERSION 1.5.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our main buildspec file will be for 1.5 only.
Changes in 1.4 will far and fewer than PT1.5.
Please create a sperate buildspce for PT1.4 not PT1.5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I want to do is to use it as a one time release purpose file. Will recover to PT1.5 once PT 1.4 patch release done.
Which method do you prefer? To do modification on root buildspec.yml? It also requires a recover operation

Copy link
Contributor

@nikhil-sk nikhil-sk May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, this is fine. Let's use buildspec.yml for containers we want to build currently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming.

@aws aws deleted a comment from YYStreet May 28, 2020

# Add any script or repo as required
RUN cd /var && wget https://raw.githubusercontent.com/kubeflow/pytorch-operator/master/examples/mnist/mnist.py
COPY mnist.py /var/mnist.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is introduced since PT 1.5. PT team change the script and we now maintain it as a separate artifact https://github.com/aws/deep-learning-containers/blob/master/pytorch/training/docker/build_artifacts/mnist.py

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary now that we are building a PT 1.4 image?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked with Tushar. We think it should be a general change and should also apply to PT1.4. @TusharKanekiDey

LOG_DIR=${HOME_DIR}/logs

git clone https://github.com/pytorch/examples.git ${HOME_DIR}/artifacts/examples
sed -i "s/param.add_(-0.1 \* param.grad)/param.data.add_(-0.1 \* param.grad)/g" ${HOME_DIR}/artifacts/examples/regression/main.py # temporary patch for isssue https://github.com/pytorch/examples/issues/782
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a issue with this test. The PT example repo fixed a similar issue, but seems forgot to fix this file. pytorch/examples#779

I opened a PR for their repo but did not have response yet. This is a temp approach to make the test pass.

@YYStreet YYStreet merged commit b9929da into aws:master May 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants